wenge-research/yayi2_pretrain_data
General NLPZH, ENBenchmark
Wenge-research/yayi2_pretrain_data is a General NLP-focused benchmark dataset in ZH, EN distributed in Parquet format.
📊 This dataset is used as an LLM benchmark. See model leaderboards →
About wenge-research/yayi2_pretrain_data
介绍/Introduction
本数据集源自雅意训练语料,我们精选了约100B数据,数据大小约为500GB。我们期望通过雅意预训练数据的开源推动中文预训练大模型开源社区的发展,并积极为此贡献力量。通过开源,我们与每一位合作伙伴共同构建雅意大模型生态。
We opensource the pre-trained dataset in this release, it should contain more than 100B tokens depending on the tokeni...
Details
- Task
- General NLP
- Language
- ZH, EN
- Format
- Parquet
- Rows / instances
- N/A
- Creator
- wenge-research
- Year
- 2023