English Datasets
We catalog 200 English datasets for NLP and machine learning, including 3 benchmarks. Browse the list below or narrow down by task.
This page covers English, the most widely represented language in NLP research and the default for most large language models. Our directory includes 200 datasets in English.
Updated June 2026
- SWE-benchCodeEnglishBenchmark
- tasl-lab/unioccImage To 3DEnglish
- junma/CVPR-BiomedSegFMImage SegmentationEnglish
- robbyant/mdm_depthDepth EstimationEN
- OpenSQZ/AutoMathText-V2Text Generation, Question AnsweringEN, ZH
- GenAI4ELab/papercli-papers-neuripsGeneral NLPEnglish
- 5551z/VisCoR-55KGeneral NLPEnglish
- 57xj5SHr/Tui9DGhpGeneral NLPEnglish
- 64bits/lima_vicuna_formatText GenerationEN
- FinancialPhraseBankSentiment AnalysisEnglish
- CLIPAMharic/AmharicCLIP-annotationImage To TextAM, EN
- clouditera/security-paper-datasetsGeneral NLPEnglish
- Clybius/booru-essence-imagesGeneral NLPEnglish
- clzoro/GLM-5.1-1000000xText Generation, Question AnsweringEN, ZH
- codefuse-ai/CodeExercise-Python-27kText GenerationEN
- JusperLee/Hive-ALLAudio Classification, Audio To AudioEN
- jxu124/OpenX-EmbodimentRobotics, Reinforcement LearningEN
- IIRCReading ComprehensionEnglish
- SubjQAQuestion AnsweringEnglish
- SigmaLaw-ABSAText Corpora, Sentiment AnalysisEnglish
- wegrthj/l36l5h-qi9l-dataGeneral NLPEnglish
- aps/super_glueText Classification, Token Classification, Question AnsweringEN
- cadene/agibot_alpha_v30RoboticsEnglish
- cadene/droid_1.0.1RoboticsEnglish
- a-m-team/AM-DeepSeek-R1-Distilled-1.4MText GenerationZH, EN
- cais/wmdpText GenerationEN
- trankhacminhtk/trankhacminhtkGeneral NLPEnglish
- anon8231489123/ShareGPT_Vicuna_unfilteredGeneral NLPEN
- zai-org/AgentInstructGeneral NLPEN
- buiminhquan1998/buiminhquan1998General NLPEnglish
- CoNLL 2003 ++Named Entity Recognition (NER)English
- ACL Citation Coreference CorpusCoreference ResolutionEnglish
- KingNish/reasoning-base-20kText GenerationEN
- OpenMed/Medical-Reasoning-SFT-MegaText Generation, Question AnsweringEN
- GrailQAQuestion Answering, Knowledge BaseEnglish
- WNUT 2016Named Entity Recognition (NER)English
- Social Narrative TreeCommonsenseEnglish
- Social Bias Inference Corpus (SBIC) Classification, Text GenerationEnglish
- ENT-DESCData-To-Text GenerationEnglish
- Multi-XscienceSummarizationEnglish
- Acronym IdentificationAcronym IdentificationEnglish
- idegen/cstsGeneral NLPEnglish
- OpenGVLab/InternVidFeature ExtractionEN
- CC100-EnglishText CorporaEnglish
- lmms-lab/MMBenchGeneral NLPEnglish
- genarenadata/backup-leaderboard-dataGeneral NLPEnglish
- GGSheng/ai-backupGeneral NLPEnglish
- General-Level/General-Bench-OpensetGeneral NLPEnglish
- tranthanhdat2009/tranthanhdat2009General NLPEnglish
- harvard-lil/cold-casesGeneral NLPEN
- JailbreakV-28K/JailBreakV-28kText Generation, Question AnsweringEnglish
- Benjy/typed_digital_signaturesImage Classification, Zero Shot Image Classification, Image Feature ExtractionEN
- just-me7ss/American-Sign-Language-DatasetGeneral NLPEnglish
- mvp-lab/LLaVA-OneVision-1.5-Mid-Training-85MGeneral NLPEnglish
- XDOF/ABC-130kRoboticsEN
- PsiBotAI/SynDataGeneral NLPEN
- labelmaker/arkit_labelmakerImage SegmentationEN
- Kazimir-ai/text-to-image-promptsGeneral NLPEN
- shenyunhang/VoiceAssistant-400KGeneral NLPEnglish
- stanford-vision-lab/gpicGeneral NLPEN
- ZahidYasinMittha/American-Sign-Language-DatasetGeneral NLPEnglish
- Constructive Comments Corpus (C3)Text ClassificationEnglish
- codeparrot/codeparrot-cleanGeneral NLPEnglish
- codeparrot/github-code-cleanGeneral NLPEnglish
- codeparrot/self-instruct-starcoderGeneral NLPENBenchmark
- CogComp/trecText ClassificationEN
- FewRel 1.0Relation ExtractionEnglish
- vuthanhdat2001/vuthanhdat2001General NLPEnglish
- LEDGARClassificationEnglish
- nvidia/PhysicalAI-Autonomous-VehiclesGeneral NLPEnglish
- artur-muratov/multilingual-speech-commands-15langGeneral NLPEN, RU, KK
- stanfordnlp/imdbText ClassificationEN
- uoft-cs/cifar10Image ClassificationEN
- jackyhate/text-to-image-2MText To Image, Image To Text, Image ClassificationEN
- QEDQuestion Answering, ExplainabilityEnglish
- Visual GenomeVisual Question Answering, Knowkedge BaseEnglish
- CohereLabs/aya_redteamingGeneral NLPEN, HI, FR
- commoncrawl/host-index-testing-v2Text GenerationEnglish
- Visual Commonsense GraphsVisual Question Answering, CommonsenseEnglish
- nvidia/PhysicalAI-WorldModel-Synthetic-Autonomous-Driving-ScenariosGeneral NLPEN
- GeNeVAText-to-ImageEnglish
- PubmedQAQuestion AnsweringEnglish
- BioCreative II Gene Mention Recognition (BC2GM)Information Extraction, Named Entity Recognition (NER)English
- BC5CDR Drug/Chemical (BC5-Chem)Information Extraction, Named Entity Recognition (NER)English
- BC5CDR Disease (BC5-Disease)Information Extraction, Named Entity Recognition (NER)English
- JNLPBAInformation Extraction, Named Entity Recognition (NER)English
- NCBI Disease CorpusInformation Extraction, Named Entity Recognition (NER)English
- ChemProtRelation ExtractionEnglish
- Drug-Disease Interaction (DDI)Relation ExtractionEnglish
- Gene-Disease Associations (GAD)Relation ExtractionEnglish
- BIOSSESSemantic Textual SimilarityEnglish
- K-and-K/knights-and-knavesQuestion AnsweringEN
- mvp-lab/LLaVA-OneVision-2-DataVideo Text To Text, Visual Question Answering, Image Text To TextEN
- allenai/openbookqaQuestion AnsweringEN
- Adverse Drug Effect (ADE) CorpusInformation ExtractionEnglish
- nvidia/ToolScaleText GenerationEN
- Locutusque/UltraTextbooksText GenerationEN, CODE
- TVQAMulti-Modal Learning, Video Question AnsweringEnglish
- Paraphrase and Semantic Similarity in Twitter (PIT)ClassificationEnglish
- Question Answering in Context (QuAC)Question Answering, Reading ComprehensionEnglish
- Reading Comprehension with Commonsense Reasoning Dataset (Record)Question Answering, Reading ComprehensionEnglish
- OpenWebTextCorpusText CorporaEnglish
- Personalized DialogDialogueEnglish
- ArxivPapersText CorporaEnglish
- SegmentedTables & LinkedResultsTable Segmentation, Table Type ClassificationEnglish
- GigawordSummarizationEnglish
- SAMSumSummarizationEnglish
- News Category DatasetClassificationEnglish
- NIPS PapersText CorporaEnglish
- xlangai/osworld_v2_assetsGeneral NLPEnglish
- Question NLINatural Language Inference (NLI)English
- Reading Comprehension with Multiple Hops (Qangaroo)Question Answering, Reading ComprehensionEnglish
- zed-industries/zetaGeneral NLPEnglish
- Relationship and Entity Extraction Evaluation Dataset (RE3D)Classification, Entity and Relation RecognitionEnglish
- Schema-Guided Dialogue State Tracking (DSTC 8)Dialogue State TrackingEnglish
- kaist-ai/CoT-CollectionText Generation, Text ClassificationEN
- Civil CommentsClassificationEnglish
- 1 Billion Word Language Model Benchmark (lm1b)Language ModelingEnglish
- OpenGVLab/ShareGPT-4oVisual Question Answering, Question AnsweringEN
- community-datasets/yahoo_answers_topicsText ClassificationEN
- competitions/aiornotImage ClassificationEnglish
- compsciencelab/mdCATHGeneral NLPEnglish
- ParaBankSemantic Textual SimilarityEnglish
- AmbigNQQuestion Answering, Reading ComprehensionEnglish
- E2EText GenerationEnglish
- LIAR DatasetClassification, Fake News DetectionEnglish
- lmms-lab/Video-MMEGeneral NLPEnglish
- HumicroeditClassificationEnglish
- Atlas of Machine Commonsense (ATOMIC)Commonsense, Knowledge GraphEnglish
- GeniaPart of Speech (POS), Constituency, Coreference, Event, RelationEnglish
- DNA Methylation CorpusInformation Extraction, Entity Extraction, Event ExtractionEnglish
- Exhaustive PTM CorpusInformation Extraction, Event ExtractionEnglish
- mTOR Pathway CorpusInformation Extraction, Entity Extraction, Event ExtractionEnglish
- PTM Event CorpusInformation Extraction, Event ExtractionEnglish
- T4SS Event CorpusInformation Extraction, Event ExtractionEnglish
- The New York Times Annotated CorpusSummarization, Information ExtractionEnglish
- Situations With Adversarial Generations (SWAG)Question Answering, Reading ComprehensionEnglish
- Skytrax User Reviews DatasetClassification, Sentiment AnalysisEnglish
- Spider 1.0Semantic Parsing, SQL-to-TextEnglish
- SQuAD v2.0Question Answering, Reading ComprehensionEnglish
- TIGER-Lab/WebInstructSubQuestion AnsweringEN
- Stanford Natural Language Inference (SNLI) CorpusNatural Language Inference (NLI)English
- The Benchmark of Linguistic Minimal Pairs (BLiMP)Language ModelingEnglish
- ColBERTClassification, Humor DetectionEnglish
- PARANMT-50MParaphrasing GenerationEnglish
- Igbo TextText Corpora, Machine TranslationIgbo, English
- Urhobo TextText Corpora, Machine TranslationUrhobo, English
- corbyrosset/researchy_questionsQuestion AnsweringENBenchmark
- cot-leaderboard/cot-eval-traces-2.0General NLPEnglish
- CropNet/CropNetGeneral NLPEN
- Crownelius/Opus-4.5-WritingStyle-1000xGeneral NLPEnglish
- Crownelius/Opus-4.6-Reasoning-2100x-formattedGeneral NLPEnglish
- CShorten/ML-ArXiv-PapersGeneral NLPEnglish
- DoQaQuestion Answering, DialogueEnglish
- QudaInformation Extraction, VisualizationEnglish
- The Conversational Intelligence Challenge 2 (ConvAI2)DialogueEnglish
- The Stanford Sentiment Treebank (SST)Classification, Sentiment AnalysisEnglish
- Taskmaster-1DialogueEnglish
- The Penn Treebank ProjectPOSEnglish
- Who Did What DatasetQuestion Answering, Reading ComprehensionEnglish
- kakaobrain/coyo-700mText To Image, Image To Text, Zero Shot ClassificationEN
- Video Commonsense Reasoning (VCR)Question Answering, Visual, CommonsenseEnglish
- Clash of ClansSentiment AnalysisEnglish
- Datasets Knowledge EmbeddingEmbeddingsEnglish
- MPQA Opinion CorpusSentiment AnalysisEnglish
- trinhminh2005/trinhminh2005General NLPEnglish
- CSU-JPG/TextAtlas5MText To ImageEN
- ctmedtech/DDR-datasetImage Segmentation, Image Classification, Object DetectionEN
- cua-lite/ScaleCUAImage Text To TextEnglish
- cyberagent/crelloImage SegmentationEN
- CyberNative/Code_Vulnerability_Security_DPOGeneral NLPEnglish
- silk-road/Wizard-LM-Chinese-instruct-evolText Generation, Question AnsweringZH, EN
- Complex Sequential Question Answering (CSQA)Question Answering, Knowledge BaseEnglish
- Linked WikiText-2Knowledge GraphEnglish
- BuGLText CorporaEnglish
- Visual QA (VQA)Visual Question AnsweringEnglish
- Topical-ChatDialogueEnglish
- Total-Text-DatasetScene Text DetectionEnglish
- TriviaQAQuestion Answering, Reading ComprehensionEnglish
- MathQAQuestion Answering, Reading ComprehensionEnglish
- SherLIiCNatural Language Inference (NLI), Lexical Inference/EntailmentEnglish
- DiaBLaMachine Translation, DialogueFrench, English
- NELA-GT-2019Text Corpora, ClassificationEnglish
- WebQuestionsQuestion Answering, Knowledge BaseEnglish
- Multimodal EmotionLines Dataset (MELD)Multi-Modal LearningEnglish
- Book Depository DatasetTopic Modeling, ClassificationEnglish
- Twenty Newsgroups DatasetClassification, ClusteringEnglish
- neural-bridge/rag-dataset-12000Question AnsweringEN
- bones-studio/seedRobotics, Text To Video, Video Text To TextEN
- pixparse/pdfa-eng-wdsImage To TextEN
- lmarena-ai/arena-human-preference-55kText ClassificationEN
- DAMO-NLP-SG/multimodal_textbookText Generation, SummarizationEN
- dangth2004/KITTI-Pseudo-DepthDepth Estimation, Image To ImageEN
- dangthihang1995/dangthihang1995General NLPEnglish
- dangthu2006/dangthu2006General NLPEnglish
- SemEval-2016 Task 4Classification, Sentiment AnalysisEnglish
- Wikipedia News CorpusText CorporaEnglish
- WSD English All-Words Fine-Grained DatasetsWord Sense Disambiguation English
- Nanbeige/ToolMindText GenerationEN
- LDJnr/PuffinQuestion Answering, Text GenerationEN
What tasks do English datasets cover?
General NLP (45)Question Answering (30)Text Generation (19)Classification (15)Information Extraction (13)Reading Comprehension (10)Text Corpora (10)Sentiment Analysis (7)Text Classification (7)Named Entity Recognition (NER) (7)Dialogue (6)Robotics (5)Summarization (5)Image Classification (5)Visual Question Answering (5)Event Extraction (5)Image Segmentation (4)Image To Text (4)