General NLP Datasets
There are 200 general nlp datasets in our directory, 8 of which are benchmarks. Each links to its source, paper, and download — browse the full list below or filter by language.
General NLP is a machine-learning task covered in our directory. We catalog 200 datasets for it.
Updated June 2026
- GenAI4ELab/papercli-papers-neuripsGeneral NLPEnglish
- 5551z/VisCoR-55KGeneral NLPEnglish
- 57xj5SHr/Tui9DGhpGeneral NLPEnglish
- clouditera/security-paper-datasetsGeneral NLPEnglish
- Clybius/booru-essence-imagesGeneral NLPEnglish
- wegrthj/l36l5h-qi9l-dataGeneral NLPEnglish
- a686d380/h-corpus-2023General NLPZH
- trankhacminhtk/trankhacminhtkGeneral NLPEnglish
- anon8231489123/ShareGPT_Vicuna_unfilteredGeneral NLPEN
- zai-org/AgentInstructGeneral NLPEN
- buiminhquan1998/buiminhquan1998General NLPEnglish
- idegen/cstsGeneral NLPEnglish
- lmms-lab/MMBenchGeneral NLPEnglish
- genarenadata/backup-leaderboard-dataGeneral NLPEnglish
- GGSheng/ai-backupGeneral NLPEnglish
- General-Level/General-Bench-OpensetGeneral NLPEnglish
- tranthanhdat2009/tranthanhdat2009General NLPEnglish
- harvard-lil/cold-casesGeneral NLPEN
- just-me7ss/American-Sign-Language-DatasetGeneral NLPEnglish
- mvp-lab/LLaVA-OneVision-1.5-Mid-Training-85MGeneral NLPEnglish
- MarkrAI/KoCommercial-DatasetGeneral NLPKO
- PsiBotAI/SynDataGeneral NLPEN
- Symato/ccGeneral NLPVI
- Kazimir-ai/text-to-image-promptsGeneral NLPEN
- shenyunhang/VoiceAssistant-400KGeneral NLPEnglish
- stanford-vision-lab/gpicGeneral NLPEN
- xlangai/DS-1000General NLPCODE
- ZahidYasinMittha/American-Sign-Language-DatasetGeneral NLPEnglish
- codeparrot/codeparrot-cleanGeneral NLPEnglish
- codeparrot/github-code-cleanGeneral NLPEnglish
- codeparrot/self-instruct-starcoderGeneral NLPENBenchmark
- vuthanhdat2001/vuthanhdat2001General NLPEnglish
- nvidia/PhysicalAI-Autonomous-VehiclesGeneral NLPEnglish
- artur-muratov/multilingual-speech-commands-15langGeneral NLPEN, RU, KK
- CohereLabs/aya_collection_language_splitGeneral NLPACE, AFR, AMH
- CohereLabs/aya_redteamingGeneral NLPEN, HI, FR
- nvidia/PhysicalAI-WorldModel-Synthetic-Autonomous-Driving-ScenariosGeneral NLPEN
- xlangai/osworld_v2_assetsGeneral NLPEnglish
- zed-industries/zetaGeneral NLPEnglish
- compsciencelab/mdCATHGeneral NLPEnglish
- lmms-lab/Video-MMEGeneral NLPEnglish
- cot-leaderboard/cot-eval-traces-2.0General NLPEnglish
- CropNet/CropNetGeneral NLPEN
- Crownelius/Opus-4.5-WritingStyle-1000xGeneral NLPEnglish
- Crownelius/Opus-4.6-Reasoning-2100x-formattedGeneral NLPEnglish
- CShorten/ML-ArXiv-PapersGeneral NLPEnglish
- llm-jp/jhleGeneral NLPJA
- trinhminh2005/trinhminh2005General NLPEnglish
- CyberNative/Code_Vulnerability_Security_DPOGeneral NLPEnglish
- dangthihang1995/dangthihang1995General NLPEnglish
- dangthu2006/dangthu2006General NLPEnglish
- k9cli/video-vec2wav2-tokenizerGeneral NLPEnglish
- danielv835/personal_finance_v0.2General NLPEnglish
- DarrenDong/parity-experimentsGeneral NLPEnglish
- BelleGroup/train_1M_CNGeneral NLPZH
- cfahlgren1/react-code-instructionsGeneral NLPEnglish
- allenai/sodaGeneral NLPEN
- open-thoughts/OpenThoughts2-1MGeneral NLPEnglish
- BelleGroup/train_3.5M_CNGeneral NLPZH
- Anthropic/values-in-the-wildGeneral NLPEnglish
- openai/healthbenchGeneral NLPEnglish
- HennyPr/ps2_hf2General NLPEnglish
- ksolovev/FineNewsGeneral NLPEnglish
- Maynor996/upload2General NLPEnglish
- Maynor996/img_uploadGeneral NLPEnglish
- nick007x/arxiv-papersGeneral NLPEnglish
- nvidia/Nemotron-Post-Training-Dataset-v1General NLPEnglish
- atokforps/latent_worker_early-a2_08General NLPEnglish
- MBZUAI/LaMini-instructionGeneral NLPEN
- QuixiAI/WizardLM_alpaca_evol_instruct_70k_unfilteredGeneral NLPEnglish
- bigbio/med_qaGeneral NLPEN, ZH
- Kaichengalex/YFCC15MGeneral NLPEnglish
- Emmyc2/pspGeneral NLPEnglish
- jat-project/jat-dataset-tokenizedGeneral NLPEnglish
- Kthera/pesozGeneral NLPEnglish
- Maximilians/ps2_hf1General NLPEnglish
- daniilakk/nbchr_pdfsGeneral NLPEnglish
- dattt67/pypi-fcg-datasetGeneral NLPEN
- Locutusque/function-calling-chatmlGeneral NLPEnglish
- dazhiyang/bsrn-crsGeneral NLPEN
- japanese-asr/whisper_transcriptions.reazon_speech_allGeneral NLPEnglish
- Tele-AI/TeleChat-PTDGeneral NLPEnglish
- lavita/medical-qa-shared-task-v1-toyGeneral NLPEnglish
- PeakStars/Math-InstructGeneral NLPEnglish
- world-igr-plum/regionsGeneral NLPEnglish
- bao2001/bao2001General NLPEnglish
- Spawning/PD12MGeneral NLPEN
- nguyenloan2002/nguyenloan2002General NLPEnglish
- shigure451/Japanese_MangaGeneral NLPEnglish
- XiaoPanPanKevinPan/aicapstone_group7_cutlery_v2_replay_2General NLPEnglish
- updatebao/geonamebase_1General NLPEnglish
- builddotai/Egocentric-100KGeneral NLPEnglish
- TIGER-Lab/arxiv-latex-5TGeneral NLPEN
- preezy02/en-us-data-with-imagesGeneral NLPEnglish
- satellogic/EarthViewGeneral NLPEnglish
- kaiokendev/SuperCOT-datasetGeneral NLPEnglish
- huggingface-course/documentation-imagesGeneral NLPEnglish
- izumi-lab/llm-japanese-datasetGeneral NLPJA
- cccat6/ASCII_Art_VLAGeneral NLPEnglish
- Chelsea707/arxiv-cs-2020-2025-pdfsGeneral NLPEnglish
- rtrm/debugGeneral NLPEnglish
- dclure/laion-aesthetics-12m-umapGeneral NLPEN
- ddanielle/DogSpeak_DatasetGeneral NLPEnglish
- deepghs/character_indexGeneral NLPEnglish
- karpathy/climbmix-400b-shuffleGeneral NLPEnglish
- yyyzzzzyyy/envssGeneral NLPEnglish
- SaylorTwift/bbhGeneral NLPEnglish
- HuggingFaceTB/cosmopediaGeneral NLPEN
- muybuenacuentajaja2/filesGeneral NLPEnglish
- vrnp-2401/Noisy-DataGeneral NLPEnglish
- phamthibich2005/phamthibich2005General NLPEnglish
- deepghs/game_charactersGeneral NLPEnglish
- hjq766/imabedGeneral NLPEnglish
- karpathy/fineweb-edu-100b-shuffleGeneral NLPEnglish
- ErikCikalleshi/new_york_times_news_2000_2007General NLPEnglish
- KAKA22/SpreadsheetBench-v2General NLPEnglish
- Felix92/docTR-resource-collectionGeneral NLPEnglish
- hf-internal-testing/transformers_circleci_workflow_runsGeneral NLPEnglish
- InternRobotics/EBench-DatasetGeneral NLPEnglish
- Narsil/image_dummyGeneral NLPEnglish
- mlfoundations/dclm-pool-7b-2xGeneral NLPEnglish
- Dagonulca/figofigofigofigoGeneral NLPEnglish
- applied-ai-018/pretraining_v1-omega_booksGeneral NLPEnglish
- m-a-p/PIN-200MGeneral NLPEN, ZH
- nvidia/Llama-Nemotron-Post-Training-DatasetGeneral NLPEnglish
- jacobbieker/eumetsat-rssGeneral NLPEnglish
- teknium/OpenHermes-2.5General NLPENG
- hallucinations-leaderboard/resultsGeneral NLPEnglish
- HuggingFaceFW/finepdfs_lang_classificationGeneral NLPEnglish
- atokforps/latent_worker_early-a2_00General NLPEnglish
- deepseek-ai/DeepSeek-ProverBenchGeneral NLPEnglish
- deepset/prompt-injectionsGeneral NLPEnglish
- defeatbeta/yahoo-finance-dataGeneral NLPEN
- ccoffee20/flatpakGeneral NLPEnglish
- deepmind/math_datasetGeneral NLPEN
- Drakesuper/rodridreGeneral NLPEnglish
- mlabonne/guanaco-llama2-1kGeneral NLPEnglish
- jzr99/mesh4d_datasetGeneral NLPEnglish
- NovaSky-AI/Sky-T1_data_17kGeneral NLPEnglish
- lmsys/chatbot_arena_conversationsGeneral NLPEnglish
- boltzgen/inference-dataGeneral NLPEnglish
- yeigen/fannie-mae-loan-performanceGeneral NLPEnglish
- atokforps/latent_worker_early-a2_03General NLPEnglish
- atokforps/latent_worker_early-a2_01General NLPEnglish
- simplescaling/s1K-1.1General NLPEN
- BAAI/COIGGeneral NLPZH
- nguyenbaolam1998/nguyenbaolam1998General NLPEnglish
- hf-internal-testing/diffusers-imagesGeneral NLPEnglish
- lmms-lab/MMEGeneral NLPEnglish
- zhengyun21/PMC-PatientsGeneral NLPEN
- nyanko7/LLaMA-65BGeneral NLPEnglish
- atokforps/latent_v1_fullrun_alpha3_04General NLPEnglish
- BelleGroup/school_math_0.25MGeneral NLPZH
- indiejoseph/wikipedia-zh-yue-filteredGeneral NLPEnglish
- tangjia0424/wcb-proGeneral NLPEnglish
- peiyi9979/Math-ShepherdGeneral NLPEnglish
- HuggingFaceTB/smol-smoltalkGeneral NLPEN
- OpenAssistant/oasst1General NLPEN, ES, RU
- lmsys/lmsys-chat-1mGeneral NLPEnglish
- Nerfgun3/bad_promptGeneral NLPEN
- deinal/spacecast-dataGeneral NLPEnglish
- cais/hleGeneral NLPEnglish
- evaleval/EEE_datastoreGeneral NLPEnglish
- DeliberatorArchiver/asmr-archive-data-02General NLPJABenchmark
- Aasdfip/habitat_web_pose_trainGeneral NLPEnglish
- AbbasABC/HFL-DatasetGeneral NLPEnglish
- Den4ikAI/russian_dialoguesGeneral NLPRU
- abdullah/IUG-CourseTranscriptsGeneral NLPEnglish
- Aber-r/SA-1B_backupGeneral NLPEnglish
- HuggingFaceTB/smollm-corpusGeneral NLPEN
- abiyo27/BibleTTS_Ewe-BibleGeneral NLPEnglish
- ospanbatyr/pqGeneral NLPEnglish
- Abrumu/Fashion_controlnet_dataset_V3General NLPEnglish
- GAIR/limaGeneral NLPEnglish
- garage-bAInd/Open-PlatypusGeneral NLPEN
- HuggingFaceTB/smoltalkGeneral NLPEN
- AdithyaSK/RAG_EvalGeneral NLPEnglish
- open-web-math/open-web-mathGeneral NLPEnglish
- bespokelabs/Bespoke-Stratos-17kGeneral NLPEN
- arcinstitute/Stack-scBaseCount189MGeneral NLPEnglishBenchmark
- neuralwork/arxiverGeneral NLPEnglish
- Intel/orca_dpo_pairsGeneral NLPEnglish
- Aeala/ShareGPT_Vicuna_unfilteredGeneral NLPEN
- Dexora/Dexora_Real-World_DatasetGeneral NLPEnglish
- di-zhang-fdu/AIME_1983_2024General NLPEnglishBenchmark
- agentica-org/DeepScaleR-Preview-DatasetGeneral NLPEN
- agents-course/certificatesGeneral NLPEnglish
- agents-course/course-imagesGeneral NLPEnglish
- nateraw/video-demoGeneral NLPEnglish
- hendrycks/competition_mathGeneral NLPENBenchmark
- SWE-bench/SWE-bench_MultilingualGeneral NLPENBenchmark
- jtatman/stable-diffusion-prompts-stats-full-uncensoredGeneral NLPEnglish
- banned-historical-archives/zhongyangribaoGeneral NLPEnglishBenchmark
- QuixiAI/dolphin-r1General NLPEnglish
- NousResearch/Hermes-3-DatasetGeneral NLPEnglishBenchmark
- LinkSoul/instruction_merge_setGeneral NLPEnglish
- laion/relaion-high-resolutionGeneral NLPEnglish
- Maxx0/sexting-nsfw-adultcontenGeneral NLPEnglish
- openai/graphwalksGeneral NLPEnglish
- nick007x/github-code-2025General NLPEnglish