Classification Datasets
There are 45 classification datasets in our directory. Each links to its source, paper, and download — browse the full list below or filter by language.
Classification is the task of sorting inputs into predefined categories — one of the most common supervised-learning tasks. We catalog 45 datasets for it.
Updated June 2026
- B5 CorpusClassificationPortuguese
- Social Bias Inference Corpus (SBIC) Classification, Text GenerationEnglish
- LEDGARClassificationEnglish
- Paraphrase and Semantic Similarity in Twitter (PIT)ClassificationEnglish
- News Category DatasetClassificationEnglish
- Relationship and Entity Extraction Evaluation Dataset (RE3D)Classification, Entity and Relation RecognitionEnglish
- Civil CommentsClassificationEnglish
- LIAR DatasetClassification, Fake News DetectionEnglish
- HumicroeditClassificationEnglish
- Skytrax User Reviews DatasetClassification, Sentiment AnalysisEnglish
- Ten Thousand German News Articles Dataset (10kGNAD)ClassificationGerman
- ColBERTClassification, Humor DetectionEnglish
- The Stanford Sentiment Treebank (SST)Classification, Sentiment AnalysisEnglish
- MATINFClassification, Question Answering, SummarizationChinese
- The EUR-Lex DatasetClassificationMulti-Lingual
- Arabic Jordanian General Tweets (AJGT)Classification, Sentiment AnalysisArabic
- NELA-GT-2019Text Corpora, ClassificationEnglish
- Book Depository DatasetTopic Modeling, ClassificationEnglish
- Twenty Newsgroups DatasetClassification, ClusteringEnglish
- SemEval-2016 Task 4Classification, Sentiment AnalysisEnglish
- ArguAna TripAdvisor CorpusClassification, Sentiment AnalysisEnglish
- Yelp Open DatasetClassification, Sentiment AnalysisEnglish
- Abductive Natural Language Inference (aNLI)Classification, CommonsenseEnglish
- Blogger Authorship CorpusClassification, Sentiment AnalysisEnglish
- Dutch Book ReviewsClassification, Sentiment AnalysisDutch
- Amazon Fine Food ReviewsClassification, Sentiment AnalysisEnglish
- Buzz in Social Media DatasetClassificationEnglish
- Car Evaluation DatasetClassificationEnglish
- ClueWeb CorporaClassificationEnglish
- Corporate Messaging CorpusClassificationEnglish
- DEXTER DatasetClassificationEnglish
- Google Books N-gramsClassification, ClusteringMulti-Lingual
- Legal Case ReportsClassificationEnglish
- Ling-Spam DatasetClassificationEnglish
- MovieTweetingsClassification, RegressionEnglish
- Personae CorpusClassification, RegressionDutch
- Spambase DatasetClassificationEnglish
- Reuters-21578 Benchmark CorpusClassificationEnglish
- Sentiment Labeled Sentences DatasetClassification, Sentiment AnalysisEnglish
- Sentiment140Classification, Sentiment AnalysisEnglish
- SMS Spam Collection DatasetClassificationEnglish
- Web of Science DatasetClassificationEnglish
- Twitter Dataset for Arabic Sentiment AnalysisClassification, Sentiment AnalysisArabic
- Twitter US Airline SentimentClassification, Sentiment AnalysisEnglish
- YouTube Comedy Slam Preference DatasetClassificationEnglish