m-a-p/MAP-CC
General NLPEnglish
M-a-p/MAP-CC is a General NLP dataset in English from m-a-p in Parquet format.
About m-a-p/MAP-CC
MAP-CC
š Homepage | š¤ MAP-CC | š¤ CHC-Bench | š¤ CT-LLM | š arXiv | GitHub
An open-source Chinese pretraining dataset with a scale of 800 billion tokens, offering the NLP community high-quality Chinese pretraining data.
Disclaimer
Th...
Details
- Task
- General NLP
- Language
- English
- Format
- Parquet
- Rows / instances
- N/A
- Creator
- m-a-p
- Year
- 2024