List of Text classification Datasets for Machine Learning Projects

High-quality datasets are the key to good performance in natural language processing (NLP) projects. We collected a list of NLP datasets for Text classification task, to get started your machine learning projects. Bellow your find a large curated training base for Text classification.

What is Text classification task?

Text Classification is the process of assigning the text to classes or categories based on predefined features such as words or sentences.


Custom fine-tune with Text classification datasets

Metatext is a powerful no-code tool for train, tune and integrate custom NLP models
➡️  Try for free


Found 274 Text classification Datasets

Let’s get started!

B5 Corpus
Dataset is a collection of Facebook posts, including information about brazilian authors, like gender, age, personality score (Based in B5 test), education level, politic position, religious, and others.
Sentimental LIAR
Sentimental LIAR dataset is a modified and further extended version of the original LIAR dataset. It was modified to be a binary-label dataset that was then extended by adding sentiments derived using the Google NLP API.
CASA (IndoNLU)
An aspect-based sentiment analysis dataset consisting of around a thousand car reviews collected from multiple Indonesian online automobile platforms. Task is defined as a multi-label classification task, where each label represents a sentiment for a single aspect with three possible values: positive, negative, and neutral.
HoASA (IndoNLU)
An aspect-based sentiment analysis dataset consisting of hotel reviews collected from the hotel aggregator platform, AiryRooms. The dataset covers ten different aspects of hotel quality. There are four possible sentiment classes for each sentiment label: positive, negative, neutral, and positive-negative.
EmoT (IndoNLU)
Dataset used for emotion classification of tweets with 5 categories: anger, fear, happiness, love and sadness.
SmSA (IndoNLU)
Dataset is a collection of comments and reviews in Indonesian obtained from multiple online platforms. The text was crawled and then annotated by several Indonesian linguists to construct this dataset. There are three possible sentiments: positive, negative, and neutral.
Tunisian Arabish Corpus (TArC)
Dataset has been extracted from social media for an amount of 43,313 tokens. The classification task consists in categorizing the text at the token level into three classes: arabizi, foreign and emotag.
Social Bias Inference Corpus (SBIC) 
Dataset contains 150k structured annotations of social media posts, covering over 34k implications about a thousand demographic groups.
Argumentation Annotated Student Peer Reviews Corpus (AASPRC)
Dataset contains 1,000 persuasive student peer reviews about business model feedbacks annotated for their argumentative components and argumentative relations.
TweetEval
TweetEval consists of seven tasks in Twitter, all framed as multi-class tweet classification. Emotion Recognition, Emoji Prediction, Irony Detection, Hate Speech Detection, Offensive Language Identification, Sentiment Analysis, & Stance Detection.
KINNEWS and KIRNEWS
There are 2 news classification datasets (KINNEWS and KIRNEWS), which were both collected from Rwanda and Burundi news websites and newspapers. In total, there are 21,268 and 4,612 news articles which are distributed across 14 and 12 categories for KINNEWS and KIRNEWS respectively.
TouTiao Text Classification for News Titles (TNEWS) (CLUE Benchmark)
Dataset consists of Chinese news published by TouTiao before May 2018, with a total of 73,360 titles. Each title is labeled with one of 15 news categories (finance, technology, sports, etc.) and the task is to predict which category the title belongs to.
IFLYTEK (CLUE Benchmark)
Dataset contains 17,332 long text annotation data about app application descriptions, including various application topics related to daily life. The task is to classify the descriptions from 119 categories.
RELX & RELX-Distant
Two datasets for cross-lingual relation classification are included: RELX and RELXDistant. RELX contains 502 parallel sentences per language (total of 5 languages) with 18 relations with direction and no_relation in total of 37 categories. RELX-Distant was extracted from Wikipedia & Wikidata.
Bengali Hate Speech
Dataset contains Bengali text classified into 5 categories: personal hate, political hate, religious hate, geopolitical hate, & gender abusive hate.
NatCat
Dataset contains naturally annotated category-text pairs for training text classifiers derived from 3 sources: Wikipedia, Reddit, and Stack Exchange.
Offensive Language Identification Dataset (OLID)
Dataset contains a collection of 14,200 annotated English tweets using an annotation model that encompasses three levels: offensive language detection, categorization of offensive language, and offensive language target identification.
English Possible Idiomatic Expressions (EPIE)
Dataset containing 25,206 sentences labelled with lexical instances of 717 idiomatic expressions.
SFU Opinion and Comments Corpus (SOCC)
Dataset contains 10,339 opinion articles (editorials, columns, and op-eds) together with their 663,173 comments from 303,665 comment threads, from the main Canadian daily in English, The Globe and Mail, from January 2012 to December 2016. In addition there's a subset annotated corpus measuring toxicity, negation and its scope, and appraisal containing 1,043 annotated comments in responses to 10 different articles covering a variety of subjects: technology, immigration, terrorism, politics, budget, social issues, religion, property, and refugees.
Constructive Comments Corpus (C3)
Dataset is a subset of comments from the SFU Opinion and Comments Corpus. This subset, the Constructive Comments Corpus (C3) consists of 12,000 comments annotated by crowdworkers.
LEDGAR
LEDGAR is a multilabel corpus of legal provisions in contracts suited for text classification in the legal domain (legaltech). It features over 1.8M+ provisions and a set of 180K+ labels. A smaller, cleaned version of the corpus is also available.
CoDEx 
Three graph datasets containing positive and hard negative triples, entity types, entity and relation descriptions, and Wikipedia page extracts for entities.
Wiki-CS
Dataset consists of nodes corresponding to Computer Science articles, with edges based on hyperlinks and 10 classes representing different branches of the field.
Vietnamese Students’ Feedback Corpus (UIT-VSFC)
Dataset contains over 16,000 sentences which are human-annotated with two different tasks: sentiment-based and topic-based classifications.
Intonation-Aided Intention Identification for Korean (3i4K)
Dataset contains seven class annotated corpus of single text utterances/intents in conversation.
Vietnamese Social Media Emotion Corpus (UIT-VSMEC)
Dataset contains 6,927 human-annotated sentences with six emotion labels, contributing to emotion recognition research in Vietnamese.
HoC (Hallmarks of Cancer)
Dataset consists of 1,852 PubMed publication abstracts manually annotated by experts according to the Hallmarks of Cancer taxonomy. The taxonomy consists of 37 classes in a hierarchy.
TRACT: Tweets Reporting Abuse Classification Task Corpus
Dataset used for multi-class classification task involving three classes of tweets that mention abuse reportings: "report" (annotated as 1); "empathy" (annotated as 2); and "general" (annotated as 3).
Korean Hate Speech Dataset
Dataset contains ~9,4K manually labeled entertainment news comments for identifying Korean toxic speech.
SegmentedTables & LinkedResults
Dataset mentions in captions, the type of table (leaderboard, ablation, irrelevant) and ground truth cell annotations into classes: dataset, metric, paper model, cited model, meta and task.
News Category Dataset
Dataset contains around 200k news headlines from the year 2012 to 2018 obtained from HuffPost.
UrbanSound & UrbanSound8K
UrbanSound: Dataset contains 1,302 labeled sound recordings. Each recording is labeled with the start and end times of sound events from 10 classes: air_conditioner, car_horn, children_playing, dog_bark, drilling, enginge_idling, gun_shot, jackhammer, siren, and street_music. UrbanSound8K: Dataset contains 8,732 labeled sound excerpts (<=4s) of urban sounds from 10 classes: air_conditioner, car_horn, children_playing, dog_bark, drilling, enginge_idling, gun_shot, jackhammer, siren, and street_music.
Polusa
Dataset contains 0.9M articles covering policy topics published between Jan. 2017 and Aug. 2019 by 18 news outlets representing the political spectrum.
Civil Comments
Dataset contains the archive of the Civil Comments platform. Dataset was annotated for toxicity.
SciCite
Dataset used for classifying citation intents in academic papers. The main citation intent label for each JSON object is specified with the label key while the citation context is specified in with a context key.
Humicroedit
Dataset contains 15,095 edited news headlines and their numerically assessed humor.
FakeNewsNet
Repo contains two datasets with news content, social context, and spatiotemporal information from Politifact and Gossipcop.
LIAR Dataset
Dataset contains 12.8K manually labeled short statements in various contexts from POLITIFACT.COM, which provides detailed analysis report and links to source documents for each case.
Fact Extraction and Verfication (FEVER)
Dataset contains 185,445 claims generated by altering sentences extracted from Wikipedia and subsequently verified without knowledge of the sentence they were derived from. The claims are classified as supported, rufted or notenoughinfo.
ColBERT
Dataset contains 200k short texts (100k positive, 100k negative). Used for humor detection.
GoEmotions
Dataset contains 58K carefully curated Reddit comments labeled for 27 emotion categories: admiration, amusement, anger, annoyance, approval, caring, confusion, curiosity, desire, disappointment, disapproval, disgust, embarrassment, excitement, fear, gratitude, grief, joy, love, nervousness, optimism, pride, realization, relief, remorse, sadness, & surprise.
Cyberbullying Detection (CBD)
Dataset contains annotated tweets that identify harmful or non-harmful content.
Hate Speech Dataset
Dataset contains tweets that are labeled as hate speech or non-hate speech. Collected during the 2016 Philippine Presidential Elections.
Dengue Dataset
Dataset for multi-class (5) classification on tweets: 5 classes: absent, dengue, health, mosquito & sick.
BuzzFace
Dataset focused on news stories (which are annotated for veracity) posted to Facebook during September 2016 consisting of: Nearly 1.7 million Facebook comments discussing the news content, Facebook plugin comments, Disqus plugin comments, Associated webpage content of the news articles.
Some Like it Hoax
Dataset contains 15,500 posts from 32 pages (14 conspiracy and 18 scientific).
BARD Bangla Article Classifier
A large corpus of Bangla documents classified into 5 classes: sports, state, economy, entertainment, and international.
BanFakeNews
A Dataset for detecting fake news in Bangla. News articles were scraped from news portals in Bengladesh.
SemEval-2019 Task 6 
Dataset containing tweets as either offensive or not offensive (Sub-task A) and further classifies offensive tweets into categories (Sub-tasks B – C).
Ohsumed Dataset
Dataset containing references from MEDLINE, the on-line medical information database, consisting of titles and/or abstracts from 270 medical journals over a five-year period (1987-1991).
The EUR-Lex Dataset
Dataset is a collection of documents about European Union law.​ It contains many different types of documents, including treaties, legislation, case-law and legislative proposals, which are indexed with almost 4,000 labels.
PubMed 200k RCT Dataset
Dataset is based on PubMed for sequential sentence classification. The dataset consists of approximately 200,000 abstracts of randomized controlled trials, totaling 2.3 million sentences.
MATINF
A labeled dataset for classification, question answering and summarization. MATINF contains 1.07 million question-answer pairs with human-labeled categories and usergenerated question descriptions.
NELA-GT-2019
Dataset contains 1.12M news articles from 260 sources collected between January 1st 2019 and December 31st 2019. Included are source-level ground truth labels from 7 different assessment sites.
Book Depository Dataset
Dataset contains books from bookdepository.com, not the actual content of the book but a list of metadata like title, description, dimensions, category and others.
Arabic Jordanian General Tweets (AJGT)
Dataset consists of 1,800 tweets annotated as positive and negative. Modern Standard Arabic (MSA) or Jordanian dialect.
Content-Based Categorized Dataset
Dataset contains 996 Web pages from the ArabicWeb16 dataset were extracted and labeled.
ASTD: Arabic Sentiment Tweets Dataset
Dataset contains over 10k Arabic sentiment tweets classified into 4 classes: subjective positive, subjective negative, subjective mixed, and objective.
Webis-CLS-10
The Cross-Lingual Sentiment (CLS) dataset comprises about 800,000 Amazon product reviews in the 4 languages: English, German, French, and Japanese.
SemEval-2016 Task 4
Dataset contains 5 subtasks involving the sentiment analysis of tweets.
ArguAna TripAdvisor Corpus
Dataset contains 2,100 hotel reviews balanced with respect to the reviews’ sentiment scores. reviews are segmented into subsentence-level statements that have been manually classified as a fact, a positive, or a negative opinion.
Arabic Violence Twitter Corpus
Annotated Arabic tweets which mention a violent act. Tweets were classifed into 8 classes: Crime, Accident, Crisis, Conflict, Human Rights Abuse, Violence, Opinion, or other. Requires using Twitter API to match IDs with tweets for retrieval.
AG News
Dataset contains more than 1 million news articles for topic classification. The 4 classes are: World, Sports, Business, and Sci/Tech.
Excitement Datasets
Datasets contain negative feedbacks from customers where they state reasons for dissatisfaction with a given company. The datasets are available in English and Italian.
Large Movie Review Dataset - Imdb
Dataset contains 25,000 highly polar movie reviews for training, and 25,000 for testing.
Wisesight Sentiment Corpus
Dataset contains around 26,700 messages in Thai language from various social media with human-annotated sentiment classification (positive, neutral, negative, and question).
Abductive Natural Language Inference (aNLI)
Dataset is a binary-classification task, the goal is to pick the most plausible explanatory hypothesis given two observations from narrative contexts. It contains 20k commonsense narrative contexts and 200k explanations."
EmoBank
Dataset is a large-scale text corpus manually annotated with emotion according to the psychological Valence-Arousal-Dominance scheme.
Irony Sarcasm Analysis Corpus
Dataset contains tweets in 4 subgroups: irony, sarcasm, regular and figurative. Requires using Twitter API in order to obtain tweets.
Sentiment Corpus of App Reviews with Fine-grained Annotations in German (SCARE)
Dataset consists of fine-grained annotations for mobile application reviews from the Google Play Store. For each user review the mentioned application aspects, i.e., the design or the usability, as well as subjective phrases, which evaluate these aspects, are annotated. In addition, the polarity (positive, negative or neutral) of each subjective phrase is recorded as well as the relationship of an aspect to the main app in discussion. Requires emailing source for password to retrieve data.
Affective Text
Classification of emotions in 250 news headlines. Categories: anger, disgust, fear, joy, happiness, sadness, surprise.
Classify Emotional Relationships of Fictional Characters
Dataset contains 19 short stories that are shorter than 1,500 words, and depict at least four different characters.
DailyDialog
A manually labelled conversations dataset. Categories: no emotion, anger, disgust, fear, happiness, sadness, surprise.
Dataset for Intent Classification and Out-of-Scope Prediction
Dataset is a benchmark for evaluating intent classification systems for dialog systems / chatbots in the presence of out-of-scope queries.
Dutch Book Reviews
Dataset contains book reviews along with associated binary sentiment polarity labels.
Emotion-Stimulus
Dataset annotated with both the emotion and the stimulus using FrameNet’s emotions-directed frame. 820 sentences with both cause and emotion and 1594 sentences marked with their emotion tag. Categories: happiness, sadness, anger, fear, surprise, disgust and shame.
Event-focused Emotion Corpora for German and English
German and English emotion corpora for emotion classification, annotated with crowdsourcing in the style of the ISEAR resources.
Social Media Mining for Health (SMM4H)
Dataset contains medication-related text classification and concept normalization from Twitter
Switchboard Dialogue Act Corpus (SwDA)
A subset of the Switchboard-1 corpus consisting of 1,155 conversations and 42 tags
The Emotion in Text
Dataset of tweets labelled with emotion. Categories: empty, sadness, enthusiasm, neutral, worry, sadness, love, fun, hate, happiness, relief, boredom, surprise, anger.
Amazon Fine Food Reviews
Dataset consists of reviews of fine foods from amazon.
Amazon Reviews
US product reviews from Amazon.
Automated Essay Scoring
Dataset contains student-written essays with scores.
Blogger Authorship Corpus
Blog post entries of 19,320 people from blogger.com.
Buzz in Social Media Dataset
Data from Twitter and Tom's Hardware. This dataset focuses on specific buzz topics being discussed on those sites.
Car Evaluation Dataset
Car properties and their overall acceptability.
ClueWeb Corpora
Annotated web pages from the ClueWeb09 and ClueWeb12 corpora.
Corporate Messaging Corpus
Dataset contains classifed statements as information, dialog (replies to users, etc.), or action (messages that ask for votes or ask users to click on links, etc.
DEXTER Dataset
Task given is to determine, from features given, which articles are about corporate acquisitions.
Google Books N-grams
N-grams from a very large corpus of books.
Hate Speech Identification Dataset
Dataset contains lexicons, notebooks containing content that is racist, sexist, homophobic, and offensive in general.
Home Depot Product Search Relevance
Dataset contains a number of products and real customer search terms from Home Depot's website.
Legal Case Reports
Federal Court of Australia cases from 2006 to 2009.
Ling-Spam Dataset
Corpus contains both legitimate and spam emails.
MovieLens
Dataset contains 22,000,000 ratings and 580,000 tags applied to 33,000 movies by 240,000 users.
MovieTweetings
Movie rating dataset based on public and well-structured tweets.
Paraphrase and Semantic Similarity in Twitter (PIT)
Dataset focuses on whether tweets have (almost) same meaning/information or not.
Personae Corpus
Collected for experiments in Authorship Attribution and Personality Prediction. Consists of 145 Dutch-language essays.
Relationship and Entity Extraction Evaluation Dataset (RE3D)
Entity and Relation marked data from various news and government sources.
Reuters-21578 Benchmark Corpus
Dataset is a collection of 10,788 documents from the Reuters financial newswire service, partitioned into a training set with 7769 documents and a test set with 3019 documents.
Sentiment Labeled Sentences Dataset
Dataset contains 3,000 sentiment labeled sentences.
Sentiment140
Tweet data from 2009 including original text, time stamp, user and sentiment.
Short Answer Scoring
Student-written short-answer responses.
Skytrax User Reviews Dataset
User reviews of airlines, airports, seats, and lounges from Skytrax.
SMS Spam Collection Dataset
Dataset contains SMS spam messages.
Spambase Dataset
Dataset contains spam emails.
Ten Thousand German News Articles Dataset (10kGNAD)
Dataset consists of 10273 german language news articles from an austrian online newspaper categorized into nine topics.
The Stanford Sentiment Treebank (SST)
Sentence sentiment classification of movie reviews.
Twenty Newsgroups Dataset
Dataset is a collection newsgroup documents used for classification task.
Twitter Dataset for Arabic Sentiment Analysis
Dataset contains Arabic tweets.
Twitter US Airline Sentiment
Dataset contains airline-related tweets that were labeled with positive, negative, and neutral sentiment.
Web of Science Dataset
Hierarchical Datasets for Text Classification.
Paraphrase and Semantic Similarity in Twitter (PIT)
Dataset focuses on whether tweets have (almost) same meaning/information or not.
Personae Corpus
Collected for experiments in Authorship Attribution and Personality Prediction. Consists of 145 Dutch-language essays.
Relationship and Entity Extraction Evaluation Dataset (RE3D)
Entity and Relation marked data from various news and government sources.
Reuters-21578 Benchmark Corpus
Dataset is a collection of 10,788 documents from the Reuters financial newswire service, partitioned into a training set with 7769 documents and a test set with 3019 documents.
Sentiment Labeled Sentences Dataset
Dataset contains 3,000 sentiment labeled sentences.
Sentiment140
Tweet data from 2009 including original text, time stamp, user and sentiment.
Short Answer Scoring
Student-written short-answer responses.
Skytrax User Reviews Dataset
User reviews of airlines, airports, seats, and lounges from Skytrax.
SMS Spam Collection Dataset
Dataset contains SMS spam messages.
Spambase Dataset
Dataset contains spam emails.
Ten Thousand German News Articles Dataset (10kGNAD)
Dataset consists of 10273 german language news articles from an austrian online newspaper categorized into nine topics.
The Stanford Sentiment Treebank (SST)
Sentence sentiment classification of movie reviews.
Twenty Newsgroups Dataset
Dataset is a collection newsgroup documents used for classification task.
Twitter Dataset for Arabic Sentiment Analysis
Dataset contains Arabic tweets.
Twitter US Airline Sentiment
Dataset contains airline-related tweets that were labeled with positive, negative, and neutral sentiment.
Web of Science Dataset
Hierarchical Datasets for Text Classification.
Yelp Open Dataset
Dataset containing millions of reviews on Yelp. In addition it contains business data including location data, attributes, and categories.
YouTube Comedy Slam Preference Dataset
User vote data for pairs of videos shown on YouTube. Users voted on funnier videos.
The BrWaC (Brazilian Portuguese Web as Corpus)
This dataset is a large corpus constructed in our lab following the Wacky framework, which was made public for research purposes.
BlogSet-BR
This dataset is a collection of blog posts crawled from Blogspot platform, containing texts by brazilian authors.
Datasets of Neuropsychological Language Tests in Brazilian Portuguese (DNLT-BP)
This dataset contains data collected from participants in clinical or academic studies and research, by reading and signing the Informed Consent Form, and the research was evaluated and approved by the Research Ethics Committees of the institutions to which they are linked
Historical Portuguese Corpora (HPC)
Dataset is a sub-project of the Historical Dictionary of Brazilian Portuguese project, which is funded by CNPq, Brazil. In the HPC project tools and resources for manipulation of historical corpora and management of historical dictionaries are developed. The tools and resources were released under public domain
TweetSentBR
This dataset contains sentiment polarity classification, this dataset contains 800k tweets in Portuguese divided into positive, negative, and neutral classes
B2W-Reviews01
This dataset contains reviews from ecommerce products. About 130k customer reviews, extracted from Americanas.com, between Jan and May 2018. Including annotated data from customers profile, like ender, age, and geograph location.
Mercadolibre Data Challenge 2019
This dataset are used in MercadoLibre data challenge, and contains multi-language products classification from MercadoLibre.com
B5 Corpus
Dataset is a collection of Facebook posts, including information about brazilian authors, like gender, age, personality score (Based in B5 test), education level, politic position, religious, and others.
Sentimental LIAR
Sentimental LIAR dataset is a modified and further extended version of the original LIAR dataset. It was modified to be a binary-label dataset that was then extended by adding sentiments derived using the Google NLP API.
CASA (IndoNLU)
An aspect-based sentiment analysis dataset consisting of around a thousand car reviews collected from multiple Indonesian online automobile platforms. Task is defined as a multi-label classification task, where each label represents a sentiment for a single aspect with three possible values: positive, negative, and neutral.
HoASA (IndoNLU)
An aspect-based sentiment analysis dataset consisting of hotel reviews collected from the hotel aggregator platform, AiryRooms. The dataset covers ten different aspects of hotel quality. There are four possible sentiment classes for each sentiment label: positive, negative, neutral, and positive-negative.
EmoT (IndoNLU)
Dataset used for emotion classification of tweets with 5 categories: anger, fear, happiness, love and sadness.
SmSA (IndoNLU)
Dataset is a collection of comments and reviews in Indonesian obtained from multiple online platforms. The text was crawled and then annotated by several Indonesian linguists to construct this dataset. There are three possible sentiments: positive, negative, and neutral.
Tunisian Arabish Corpus (TArC)
Dataset has been extracted from social media for an amount of 43,313 tokens. The classification task consists in categorizing the text at the token level into three classes: arabizi, foreign and emotag.
Social Bias Inference Corpus (SBIC) 
Dataset contains 150k structured annotations of social media posts, covering over 34k implications about a thousand demographic groups.
Argumentation Annotated Student Peer Reviews Corpus (AASPRC)
Dataset contains 1,000 persuasive student peer reviews about business model feedbacks annotated for their argumentative components and argumentative relations.
TweetEval
TweetEval consists of seven tasks in Twitter, all framed as multi-class tweet classification. Emotion Recognition, Emoji Prediction, Irony Detection, Hate Speech Detection, Offensive Language Identification, Sentiment Analysis, & Stance Detection.
KINNEWS and KIRNEWS
There are 2 news classification datasets (KINNEWS and KIRNEWS), which were both collected from Rwanda and Burundi news websites and newspapers. In total, there are 21,268 and 4,612 news articles which are distributed across 14 and 12 categories for KINNEWS and KIRNEWS respectively.
TouTiao Text Classification for News Titles (TNEWS) (CLUE Benchmark)
Dataset consists of Chinese news published by TouTiao before May 2018, with a total of 73,360 titles. Each title is labeled with one of 15 news categories (finance, technology, sports, etc.) and the task is to predict which category the title belongs to.
IFLYTEK (CLUE Benchmark)
Dataset contains 17,332 long text annotation data about app application descriptions, including various application topics related to daily life. The task is to classify the descriptions from 119 categories.
RELX & RELX-Distant
Two datasets for cross-lingual relation classification are included: RELX and RELXDistant. RELX contains 502 parallel sentences per language (total of 5 languages) with 18 relations with direction and no_relation in total of 37 categories. RELX-Distant was extracted from Wikipedia & Wikidata.
Bengali Hate Speech
Dataset contains Bengali text classified into 5 categories: personal hate, political hate, religious hate, geopolitical hate, & gender abusive hate.
NatCat
Dataset contains naturally annotated category-text pairs for training text classifiers derived from 3 sources: Wikipedia, Reddit, and Stack Exchange.
Offensive Language Identification Dataset (OLID)
Dataset contains a collection of 14,200 annotated English tweets using an annotation model that encompasses three levels: offensive language detection, categorization of offensive language, and offensive language target identification.
English Possible Idiomatic Expressions (EPIE)
Dataset containing 25,206 sentences labelled with lexical instances of 717 idiomatic expressions.
SFU Opinion and Comments Corpus (SOCC)
Dataset contains 10,339 opinion articles (editorials, columns, and op-eds) together with their 663,173 comments from 303,665 comment threads, from the main Canadian daily in English, The Globe and Mail, from January 2012 to December 2016. In addition there's a subset annotated corpus measuring toxicity, negation and its scope, and appraisal containing 1,043 annotated comments in responses to 10 different articles covering a variety of subjects: technology, immigration, terrorism, politics, budget, social issues, religion, property, and refugees.
Constructive Comments Corpus (C3)
Dataset is a subset of comments from the SFU Opinion and Comments Corpus. This subset, the Constructive Comments Corpus (C3) consists of 12,000 comments annotated by crowdworkers.
LEDGAR
LEDGAR is a multilabel corpus of legal provisions in contracts suited for text classification in the legal domain (legaltech). It features over 1.8M+ provisions and a set of 180K+ labels. A smaller, cleaned version of the corpus is also available.
CoDEx 
Three graph datasets containing positive and hard negative triples, entity types, entity and relation descriptions, and Wikipedia page extracts for entities.
Wiki-CS
Dataset consists of nodes corresponding to Computer Science articles, with edges based on hyperlinks and 10 classes representing different branches of the field.
Vietnamese Students’ Feedback Corpus (UIT-VSFC)
Dataset contains over 16,000 sentences which are human-annotated with two different tasks: sentiment-based and topic-based classifications.
Intonation-Aided Intention Identification for Korean (3i4K)
Dataset contains seven class annotated corpus of single text utterances/intents in conversation.
Vietnamese Social Media Emotion Corpus (UIT-VSMEC)
Dataset contains 6,927 human-annotated sentences with six emotion labels, contributing to emotion recognition research in Vietnamese.
HoC (Hallmarks of Cancer)
Dataset consists of 1,852 PubMed publication abstracts manually annotated by experts according to the Hallmarks of Cancer taxonomy. The taxonomy consists of 37 classes in a hierarchy.
TRACT: Tweets Reporting Abuse Classification Task Corpus
Dataset used for multi-class classification task involving three classes of tweets that mention abuse reportings: "report" (annotated as 1); "empathy" (annotated as 2); and "general" (annotated as 3).
Korean Hate Speech Dataset
Dataset contains ~9,4K manually labeled entertainment news comments for identifying Korean toxic speech.
SegmentedTables & LinkedResults
Dataset mentions in captions, the type of table (leaderboard, ablation, irrelevant) and ground truth cell annotations into classes: dataset, metric, paper model, cited model, meta and task.
News Category Dataset
Dataset contains around 200k news headlines from the year 2012 to 2018 obtained from HuffPost.
UrbanSound & UrbanSound8K
UrbanSound: Dataset contains 1,302 labeled sound recordings. Each recording is labeled with the start and end times of sound events from 10 classes: air_conditioner, car_horn, children_playing, dog_bark, drilling, enginge_idling, gun_shot, jackhammer, siren, and street_music. UrbanSound8K: Dataset contains 8,732 labeled sound excerpts (<=4s) of urban sounds from 10 classes: air_conditioner, car_horn, children_playing, dog_bark, drilling, enginge_idling, gun_shot, jackhammer, siren, and street_music.
Polusa
Dataset contains 0.9M articles covering policy topics published between Jan. 2017 and Aug. 2019 by 18 news outlets representing the political spectrum.
Civil Comments
Dataset contains the archive of the Civil Comments platform. Dataset was annotated for toxicity.
SciCite
Dataset used for classifying citation intents in academic papers. The main citation intent label for each JSON object is specified with the label key while the citation context is specified in with a context key.
Humicroedit
Dataset contains 15,095 edited news headlines and their numerically assessed humor.
FakeNewsNet
Repo contains two datasets with news content, social context, and spatiotemporal information from Politifact and Gossipcop.
LIAR Dataset
Dataset contains 12.8K manually labeled short statements in various contexts from POLITIFACT.COM, which provides detailed analysis report and links to source documents for each case.
Fact Extraction and Verfication (FEVER)
Dataset contains 185,445 claims generated by altering sentences extracted from Wikipedia and subsequently verified without knowledge of the sentence they were derived from. The claims are classified as supported, rufted or notenoughinfo.
ColBERT
Dataset contains 200k short texts (100k positive, 100k negative). Used for humor detection.
GoEmotions
Dataset contains 58K carefully curated Reddit comments labeled for 27 emotion categories: admiration, amusement, anger, annoyance, approval, caring, confusion, curiosity, desire, disappointment, disapproval, disgust, embarrassment, excitement, fear, gratitude, grief, joy, love, nervousness, optimism, pride, realization, relief, remorse, sadness, & surprise.
Cyberbullying Detection (CBD)
Dataset contains annotated tweets that identify harmful or non-harmful content.
Hate Speech Dataset
Dataset contains tweets that are labeled as hate speech or non-hate speech. Collected during the 2016 Philippine Presidential Elections.
Dengue Dataset
Dataset for multi-class (5) classification on tweets: 5 classes: absent, dengue, health, mosquito & sick.
BuzzFace
Dataset focused on news stories (which are annotated for veracity) posted to Facebook during September 2016 consisting of: Nearly 1.7 million Facebook comments discussing the news content, Facebook plugin comments, Disqus plugin comments, Associated webpage content of the news articles.
Some Like it Hoax
Dataset contains 15,500 posts from 32 pages (14 conspiracy and 18 scientific).
BARD Bangla Article Classifier
A large corpus of Bangla documents classified into 5 classes: sports, state, economy, entertainment, and international.
BanFakeNews
A Dataset for detecting fake news in Bangla. News articles were scraped from news portals in Bengladesh.
SemEval-2019 Task 6 
Dataset containing tweets as either offensive or not offensive (Sub-task A) and further classifies offensive tweets into categories (Sub-tasks B – C).
Ohsumed Dataset
Dataset containing references from MEDLINE, the on-line medical information database, consisting of titles and/or abstracts from 270 medical journals over a five-year period (1987-1991).
The EUR-Lex Dataset
Dataset is a collection of documents about European Union law.​ It contains many different types of documents, including treaties, legislation, case-law and legislative proposals, which are indexed with almost 4,000 labels.
PubMed 200k RCT Dataset
Dataset is based on PubMed for sequential sentence classification. The dataset consists of approximately 200,000 abstracts of randomized controlled trials, totaling 2.3 million sentences.
MATINF
A labeled dataset for classification, question answering and summarization. MATINF contains 1.07 million question-answer pairs with human-labeled categories and usergenerated question descriptions.
NELA-GT-2019
Dataset contains 1.12M news articles from 260 sources collected between January 1st 2019 and December 31st 2019. Included are source-level ground truth labels from 7 different assessment sites.
Book Depository Dataset
Dataset contains books from bookdepository.com, not the actual content of the book but a list of metadata like title, description, dimensions, category and others.
Arabic Jordanian General Tweets (AJGT)
Dataset consists of 1,800 tweets annotated as positive and negative. Modern Standard Arabic (MSA) or Jordanian dialect.
Content-Based Categorized Dataset
Dataset contains 996 Web pages from the ArabicWeb16 dataset were extracted and labeled.
ASTD: Arabic Sentiment Tweets Dataset
Dataset contains over 10k Arabic sentiment tweets classified into 4 classes: subjective positive, subjective negative, subjective mixed, and objective.
Webis-CLS-10
The Cross-Lingual Sentiment (CLS) dataset comprises about 800,000 Amazon product reviews in the 4 languages: English, German, French, and Japanese.
SemEval-2016 Task 4
Dataset contains 5 subtasks involving the sentiment analysis of tweets.
ArguAna TripAdvisor Corpus
Dataset contains 2,100 hotel reviews balanced with respect to the reviews’ sentiment scores. reviews are segmented into subsentence-level statements that have been manually classified as a fact, a positive, or a negative opinion.
Arabic Violence Twitter Corpus
Annotated Arabic tweets which mention a violent act. Tweets were classifed into 8 classes: Crime, Accident, Crisis, Conflict, Human Rights Abuse, Violence, Opinion, or other. Requires using Twitter API to match IDs with tweets for retrieval.
AG News
Dataset contains more than 1 million news articles for topic classification. The 4 classes are: World, Sports, Business, and Sci/Tech.
Excitement Datasets
Datasets contain negative feedbacks from customers where they state reasons for dissatisfaction with a given company. The datasets are available in English and Italian.
Large Movie Review Dataset - Imdb
Dataset contains 25,000 highly polar movie reviews for training, and 25,000 for testing.
Wisesight Sentiment Corpus
Dataset contains around 26,700 messages in Thai language from various social media with human-annotated sentiment classification (positive, neutral, negative, and question).
Abductive Natural Language Inference (aNLI)
Dataset is a binary-classification task, the goal is to pick the most plausible explanatory hypothesis given two observations from narrative contexts. It contains 20k commonsense narrative contexts and 200k explanations."
EmoBank
Dataset is a large-scale text corpus manually annotated with emotion according to the psychological Valence-Arousal-Dominance scheme.
Irony Sarcasm Analysis Corpus
Dataset contains tweets in 4 subgroups: irony, sarcasm, regular and figurative. Requires using Twitter API in order to obtain tweets.
Sentiment Corpus of App Reviews with Fine-grained Annotations in German (SCARE)
Dataset consists of fine-grained annotations for mobile application reviews from the Google Play Store. For each user review the mentioned application aspects, i.e., the design or the usability, as well as subjective phrases, which evaluate these aspects, are annotated. In addition, the polarity (positive, negative or neutral) of each subjective phrase is recorded as well as the relationship of an aspect to the main app in discussion. Requires emailing source for password to retrieve data.
Affective Text
Classification of emotions in 250 news headlines. Categories: anger, disgust, fear, joy, happiness, sadness, surprise.
Classify Emotional Relationships of Fictional Characters
Dataset contains 19 short stories that are shorter than 1,500 words, and depict at least four different characters.
DailyDialog
A manually labelled conversations dataset. Categories: no emotion, anger, disgust, fear, happiness, sadness, surprise.
Dataset for Intent Classification and Out-of-Scope Prediction
Dataset is a benchmark for evaluating intent classification systems for dialog systems / chatbots in the presence of out-of-scope queries.
Dutch Book Reviews
Dataset contains book reviews along with associated binary sentiment polarity labels.
Emotion-Stimulus
Dataset annotated with both the emotion and the stimulus using FrameNet’s emotions-directed frame. 820 sentences with both cause and emotion and 1594 sentences marked with their emotion tag. Categories: happiness, sadness, anger, fear, surprise, disgust and shame.
Event-focused Emotion Corpora for German and English
German and English emotion corpora for emotion classification, annotated with crowdsourcing in the style of the ISEAR resources.
Social Media Mining for Health (SMM4H)
Dataset contains medication-related text classification and concept normalization from Twitter
Switchboard Dialogue Act Corpus (SwDA)
A subset of the Switchboard-1 corpus consisting of 1,155 conversations and 42 tags
The Emotion in Text
Dataset of tweets labelled with emotion. Categories: empty, sadness, enthusiasm, neutral, worry, sadness, love, fun, hate, happiness, relief, boredom, surprise, anger.
Amazon Fine Food Reviews
Dataset consists of reviews of fine foods from amazon.
Amazon Reviews
US product reviews from Amazon.
Automated Essay Scoring
Dataset contains student-written essays with scores.
Blogger Authorship Corpus
Blog post entries of 19,320 people from blogger.com.
Buzz in Social Media Dataset
Data from Twitter and Tom's Hardware. This dataset focuses on specific buzz topics being discussed on those sites.
Car Evaluation Dataset
Car properties and their overall acceptability.
ClueWeb Corpora
Annotated web pages from the ClueWeb09 and ClueWeb12 corpora.
Corporate Messaging Corpus
Dataset contains classifed statements as information, dialog (replies to users, etc.), or action (messages that ask for votes or ask users to click on links, etc.
DEXTER Dataset
Task given is to determine, from features given, which articles are about corporate acquisitions.
Google Books N-grams
N-grams from a very large corpus of books.
Hate Speech Identification Dataset
Dataset contains lexicons, notebooks containing content that is racist, sexist, homophobic, and offensive in general.
Home Depot Product Search Relevance
Dataset contains a number of products and real customer search terms from Home Depot's website.
Legal Case Reports
Federal Court of Australia cases from 2006 to 2009.
Ling-Spam Dataset
Corpus contains both legitimate and spam emails.
MovieLens
Dataset contains 22,000,000 ratings and 580,000 tags applied to 33,000 movies by 240,000 users.
MovieTweetings
Movie rating dataset based on public and well-structured tweets.
Paraphrase and Semantic Similarity in Twitter (PIT)
Dataset focuses on whether tweets have (almost) same meaning/information or not.
Personae Corpus
Collected for experiments in Authorship Attribution and Personality Prediction. Consists of 145 Dutch-language essays.
Relationship and Entity Extraction Evaluation Dataset (RE3D)
Entity and Relation marked data from various news and government sources.
Reuters-21578 Benchmark Corpus
Dataset is a collection of 10,788 documents from the Reuters financial newswire service, partitioned into a training set with 7769 documents and a test set with 3019 documents.
Sentiment Labeled Sentences Dataset
Dataset contains 3,000 sentiment labeled sentences.
Sentiment140
Tweet data from 2009 including original text, time stamp, user and sentiment.
Short Answer Scoring
Student-written short-answer responses.
Skytrax User Reviews Dataset
User reviews of airlines, airports, seats, and lounges from Skytrax.
SMS Spam Collection Dataset
Dataset contains SMS spam messages.
Spambase Dataset
Dataset contains spam emails.
Ten Thousand German News Articles Dataset (10kGNAD)
Dataset consists of 10273 german language news articles from an austrian online newspaper categorized into nine topics.
The Stanford Sentiment Treebank (SST)
Sentence sentiment classification of movie reviews.
Twenty Newsgroups Dataset
Dataset is a collection newsgroup documents used for classification task.
Twitter Dataset for Arabic Sentiment Analysis
Dataset contains Arabic tweets.
Twitter US Airline Sentiment
Dataset contains airline-related tweets that were labeled with positive, negative, and neutral sentiment.
Web of Science Dataset
Hierarchical Datasets for Text Classification.
Paraphrase and Semantic Similarity in Twitter (PIT)
Dataset focuses on whether tweets have (almost) same meaning/information or not.
Personae Corpus
Collected for experiments in Authorship Attribution and Personality Prediction. Consists of 145 Dutch-language essays.
Relationship and Entity Extraction Evaluation Dataset (RE3D)
Entity and Relation marked data from various news and government sources.
Reuters-21578 Benchmark Corpus
Dataset is a collection of 10,788 documents from the Reuters financial newswire service, partitioned into a training set with 7769 documents and a test set with 3019 documents.
Sentiment Labeled Sentences Dataset
Dataset contains 3,000 sentiment labeled sentences.
Sentiment140
Tweet data from 2009 including original text, time stamp, user and sentiment.
Short Answer Scoring
Student-written short-answer responses.
Skytrax User Reviews Dataset
User reviews of airlines, airports, seats, and lounges from Skytrax.
SMS Spam Collection Dataset
Dataset contains SMS spam messages.
Spambase Dataset
Dataset contains spam emails.
Ten Thousand German News Articles Dataset (10kGNAD)
Dataset consists of 10273 german language news articles from an austrian online newspaper categorized into nine topics.
The Stanford Sentiment Treebank (SST)
Sentence sentiment classification of movie reviews.
Twenty Newsgroups Dataset
Dataset is a collection newsgroup documents used for classification task.
Twitter Dataset for Arabic Sentiment Analysis
Dataset contains Arabic tweets.
Twitter US Airline Sentiment
Dataset contains airline-related tweets that were labeled with positive, negative, and neutral sentiment.
Web of Science Dataset
Hierarchical Datasets for Text Classification.
Yelp Open Dataset
Dataset containing millions of reviews on Yelp. In addition it contains business data including location data, attributes, and categories.
YouTube Comedy Slam Preference Dataset
User vote data for pairs of videos shown on YouTube. Users voted on funnier videos.
The BrWaC (Brazilian Portuguese Web as Corpus)
This dataset is a large corpus constructed in our lab following the Wacky framework, which was made public for research purposes.
BlogSet-BR
This dataset is a collection of blog posts crawled from Blogspot platform, containing texts by brazilian authors.
Datasets of Neuropsychological Language Tests in Brazilian Portuguese (DNLT-BP)
This dataset contains data collected from participants in clinical or academic studies and research, by reading and signing the Informed Consent Form, and the research was evaluated and approved by the Research Ethics Committees of the institutions to which they are linked
Historical Portuguese Corpora (HPC)
Dataset is a sub-project of the Historical Dictionary of Brazilian Portuguese project, which is funded by CNPq, Brazil. In the HPC project tools and resources for manipulation of historical corpora and management of historical dictionaries are developed. The tools and resources were released under public domain
TweetSentBR
This dataset contains sentiment polarity classification, this dataset contains 800k tweets in Portuguese divided into positive, negative, and neutral classes
B2W-Reviews01
This dataset contains reviews from ecommerce products. About 130k customer reviews, extracted from Americanas.com, between Jan and May 2018. Including annotated data from customers profile, like ender, age, and geograph location.
Mercadolibre Data Challenge 2019
This dataset are used in MercadoLibre data challenge, and contains multi-language products classification from MercadoLibre.com

Classify and extract text 10x better and faster 🦾

Metatext helps you to classify and extract information from text and documents with customized language models with your data and expertise.