Skip to content

jingyaogong/minimind_dataset

Text GenerationMULTILINGUALapache-2.0

Created by jingyaogong at 2024, the jingyaogong/minimind_dataset is a text generation dataset in MULTILINGUAL in Parquet format. With 2.5K downloads and 108 likes, it is actively used by the community. It is released under the apache-2.0 license.

About jingyaogong/minimind_dataset

📌 数据介绍 Ⅰ Tokenizer 分词器可以粗略理解成 LLM 使用的一本“词典”,负责把自然语言映射成 token id,再把 token id 解码回文本;项目中也提供了train_tokenizer.py作为词表训练示例。不建议重新训练 tokenizer,因为词表和切分规则一旦变化,模型权重、数据格式、推理接口与社区生态的兼容性都会下降,也会削弱模型的传播性。同时,tokenizer 还会影响 PPL 这类按 token 统计的指标,因此跨 ...

Details

Task
Text Generation
Language
MULTILINGUAL
Format
Parquet
Rows / instances
N/A
Creator
jingyaogong
Year
2024
License
apache-2.0
Downloads
2539
Likes
108
Download Homepage

Related Text Generation datasets

FAQ