jingyaogong/minimind_dataset
Text GenerationMULTILINGUALapache-2.0
Created by jingyaogong at 2024, the jingyaogong/minimind_dataset is a text generation dataset in MULTILINGUAL in Parquet format. With 2.5K downloads and 108 likes, it is actively used by the community. It is released under the apache-2.0 license.
About jingyaogong/minimind_dataset
📌 数据介绍
Ⅰ Tokenizer
分词器可以粗略理解成 LLM 使用的一本“词典”,负责把自然语言映射成 token id,再把 token id 解码回文本;项目中也提供了train_tokenizer.py作为词表训练示例。不建议重新训练 tokenizer,因为词表和切分规则一旦变化,模型权重、数据格式、推理接口与社区生态的兼容性都会下降,也会削弱模型的传播性。同时,tokenizer 还会影响 PPL 这类按 token 统计的指标,因此跨 ...
Details
- Task
- Text Generation
- Language
- MULTILINGUAL
- Format
- Parquet
- Rows / instances
- N/A
- Creator
- jingyaogong
- Year
- 2024
- License
- apache-2.0
- Downloads
- 2539
- Likes
- 108