Chinese Datasets
We catalog 58 Chinese datasets for NLP and machine learning. Browse the list below or narrow down by task.
This page covers Chinese (Mandarin), the most spoken first language in the world and a major focus of multilingual NLP. Our directory includes 58 datasets in Chinese.
Updated June 2026
- OpenSQZ/AutoMathText-V2Text Generation, Question AnsweringEN, ZH
- clzoro/GLM-5.1-1000000xText Generation, Question AnsweringEN, ZH
- MedDialogDialogueChinese
- a-m-team/AM-DeepSeek-R1-Distilled-1.4MText GenerationZH, EN
- a686d380/h-corpus-2023General NLPZH
- NLP Chinese CorpusText CorporaChinese
- SylvanL/Traditional-Chinese-Medicine-Dataset-SFTTable Question AnsweringZH
- Congliu/Chinese-DeepSeek-R1-Distill-data-110kText Generation, Question AnsweringZH
- Congliu/Chinese-DeepSeek-R1-Distill-data-110k-SFTText Generation, Question AnsweringZH
- silk-road/Wizard-LM-Chinese-instruct-evolText Generation, Question AnsweringZH, EN
- MATINFClassification, Question Answering, SummarizationChinese
- BelleGroup/train_1M_CNGeneral NLPZH
- BelleGroup/train_3.5M_CNGeneral NLPZH
- bigbio/med_qaGeneral NLPEN, ZH
- UCSD26/medical_dialogQuestion AnsweringEN, ZH
- Mxode/Chinese-InstructText Generation, Question AnsweringZH
- shibing624/alpaca-zhText GenerationZH
- BAAI/Infinity-InstructText GenerationEN, ZH
- m-a-p/PIN-200MGeneral NLPEN, ZH
- Tencent AI Lab Embedding CorpusEmbeddingsChinese
- BAAI/COIGGeneral NLPZH
- shibing624/medicalText GenerationZH
- BelleGroup/school_math_0.25MGeneral NLPZH
- proj-persona/PersonaHubText Generation, Text Classification, Token Classification, Fill Mask, Table Question AnsweringEN, ZH
- JosephusCheung/GuanacoDatasetText Generation, Question AnsweringZH, EN, JA
- openbmb/UltraData-MathText GenerationEN, ZH
- LooksJuicy/ruozhibaText GenerationZH
- wangrui6/Zhihu-KOLQuestion AnsweringZH
- llamafactory/tiny-supervised-datasetText Generation, Question AnsweringEN, ZH
- MMInstruction/M3ITImage To Text, Image ClassificationEN, ZH
- opencsg/chinese-fineweb-eduText GenerationZH
- shareAI/ShareGPT-Chinese-English-90kQuestion Answering, Text GenerationEN, ZH
- silk-road/alpaca-data-gpt4-chineseText GenerationZH, EN
- sunzeyeah/chinese_chatgpt_corpusText Generation, Question Answering, Reinforcement LearningZH
- Wenetspeech4TTS/WenetSpeech4TTSText To SpeechZH
- Seikaijyu/Sex-novel-filteredGeneral NLPZH
- LooksJuicy/Chinese-Roleplay-NovelGeneral NLPZH
- Magpie-Align/Magpie-Qwen2-Pro-200K-ChineseQuestion AnsweringZH
- Seikaijyu/Beautiful-ChineseGeneral NLPZH
- opencsg/Fineweb-Edu-Chinese-V2.2Text Generation, Question AnsweringZH
- opencsg/chinese-cosmopediaText GenerationZH
- BAAI/Infinity-PreferenceText Generation, Question AnsweringEN, ZH
- galaxyMindAiLabs/stem-reasoning-complexText Generation, Question AnsweringEN, ZH
- Limour/b-corpusText GenerationZH
- zai-org/LongCite-45kText Generation, Question AnsweringEN, ZH
- MiniMaxAI/SynLogicText GenerationEN, ZH
- opencsg/chinese-fineweb-edu-v2Text GenerationZH
- SparkAudio/voxboxText To SpeechZH, EN
- Jackrong/Claude-opus-4.6-TraceInversion-9000xText GenerationEN, ZH, KO
- BAAI/IndustryCorpus2General NLPEN, ZH
- Qwen/WebWorldDataText GenerationEN, ZH
- llm-wizard/alpaca-gpt4-data-zhText GenerationZH
- LinkSoul/Chinese-LLaVA-Vision-InstructionsGeneral NLPEN, ZH
- manu/project_gutenbergText GenerationFR, EN, ZH
- Flmc/DISC-Med-SFTQuestion AnsweringZH
- shibing624/roleplay-zh-sharegpt-gpt4-dataText GenerationZH
- silk-road/ChatHaruhi-54K-Role-Playing-DialogueText GenerationEN, ZH
- Jackrong/Claude-opus-4.7-TraceInversion-5000xText GenerationEN, ZH, KO
What tasks do Chinese datasets cover?
Text Generation (34)Question Answering (19)General NLP (12)Table Question Answering (2)Text To Speech (2)Dialogue (1)Text Corpora (1)Classification (1)Summarization (1)Embeddings (1)Text Classification (1)Token Classification (1)Fill Mask (1)Image To Text (1)Image Classification (1)Reinforcement Learning (1)