List of Summarization Datasets for Machine Learning Projects
High-quality datasets are the key to good performance in natural language processing (NLP) projects. We collected a list of NLP datasets for Summarization task, to get started your machine learning projects. Bellow your find a large curated training base for Summarization.
What is Summarization task?
Summarization is a natural language processing (NLP) task that requires, given a document of arbitrary length, a summarizer to return a shorter, relevant subset of the input for a specific purpose.
Custom fine-tune with Summarization datasets
Metatext is a powerful no-code tool for train, tune and integrate custom NLP models
➡️ Try for free
Found 64 Summarization Datasets
Let’s get started!
WikiSummary
A summarization dataset extracted from Wikipedia.
IndoSum
Dataset for text summarization in Indonesian that is compiled from online news articles and publicly available.
Multi-Xscience
A multi-document summarization dataset created from scientific articles. MultiXScience introduces a challenging multidocument summarization task: writing the related-work section of a paper based on its abstract and the articles it references.
NewSHead
Dataset contains 369,940 English stories with 932,571 unique URLs, among which we have 359,940 stories for training, 5,000 for validation, and 5,000 for testing, respectively. Each news story contains at least three (and up to five) articles.
NCLS-Corpora
Contains two datasets for cross-lingual summarization: ZH2ENSUM and EN2ZHSUM. There exists 370,759 English-to-Chinese cross-lingual summarization (CLS) pairs from ENSUM and 1,699,713 Chinese-to-English CLS pairs.
MEDIQA-Answer Summarization
Dataset containing question-driven summaries of answers to consumer health questions.
Wikipedia Current Events Portal (WCEP) Dataset
Dataset is used for multi-document summarization (MDS) and consists of short, human-written summaries about news events, obtained from the Wikipedia Current Events Portal (WCEP), each paired with a cluster of news articles associated with an event.
Gigaword
Dataset contains headline-generation on a corpus of article pairs from Gigaword consisting of around 4 million articles.
Opinosis
Dataset contains sentences extracted from reviews for 51 topics. Topics and opinions are obtained from Tripadvisor, Edmunds.com and Amazon.com.
BillSum
Dataset contains a summarization of US Congressional and California state bills.
SAMSum
Dataset contains over 16K chat dialogues with manually annotated summaries.
Annotated Enron Subject Line Corpus (AESLC)
Dataset contains email messages of employees in the Enron Corporation.
Multi-News
Dataset consists of news articles and human-written summaries of these articles from the site newser.com. Each summary is professionally written by editors and includes links to the original articles cited.
The New York Times Annotated Corpus
Dataset contains over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with article metadata provided by the New York Times Newsroom.
BigPatent
Dataset consists of 1.3 million records of U.S. patent documents along with human written abstractive summaries.
LCSTS
Dataset constructed from the Chinese microblogging website Sina Weibo. It consists of over 2 million real Chinese short texts with short summaries given by the author of each text. Requires application.
Essex Arabic Summaries Corpus (EASC)
Dataset contains 153 Arabic articles and 765 human-generated extractive summaries of those articles. These summaries were generated using Mechanical Turk.
KALIMAT Multipurpose Arabic Corpus
Dataset contains 20,291 Arabic articles collected from the Omani newspaper Alwatan. Extractive Single-document and multi-document system summaries. Named Entity Recognised articles. The data has 6 categories: culture, economy, local-news, international-news, religion, and sports.
SCITLDR
Dataset of a combination of TLDRs written by human experts and author written TLDRs of computer science papers from OpenReview.
MLSUM
Dataset was collected from online newspapers, it contains 1.5M+ article/summary pairs in 5 languages: French, German, Spanish, Russian, & Turkish.
Polish Summaries Corpus (PSC)
Dataset contains news articles and their summaries.
MATINF
A labeled dataset for classification, question answering and summarization. MATINF contains 1.07 million question-answer pairs with human-labeled categories and usergenerated question descriptions.
MultiLing Pilot 2011 Dataset
Dataset is derived from publicly available WikiNews English texts and translated into 7 languages: Arabic, Czech, English, French, Greek, Hebrew, Hindi.
Webis-TLDR-17 Corpus
Dataset contains 3 Million pairs of content and self-written summaries mined from Reddit. It is one of the first large-scale summarization dataset from the social media domain.
Webis-Snippet-20 Corpus
Dataset comprises four abstractive snippet dataset from ClueWeb09, Clueweb12, and DMOZ descriptions. More than 10 million <webpage, abstractive snippet> pairs / 3.5 million <query, webpage, abstractive snippet> pairs were collected.
CASS
Dataset is composed of decisions made by the French Court of cassation and summaries of these decisions made by lawyer.
How2
Dataset of instructional videos covering a wide variety of topics across video clips (about 2,000 hours), with word-level time alignments to the ground-truth English subtitles. And 300 hours was translated into Portuguese subtitles.
X-Sum
The XSum dataset consists of 226,711 Wayback archived BBC articles (2010 to 2017) and covering a wide variety of domains: News, Politics, Sports, Weather, Business, Technology, Science, Health, Family, Education, Entertainment and Arts.
Cornell Newsroom
Dataset contains 1.3 million articles and summaries written by authors and editors in the newsrooms of 38 major publications. The summaries are obtained from search and social metadata between 1998 and 2017.
WikiHow
Dataset contains article and summary pairs extracted and constructed from an online knowledge base written by different human authors.
WikiHow
Dataset contains article and summary pairs extracted and constructed from an online knowledge base written by different human authors.
CorpusTCC
This dataset contains scientific texts from brazilian community, about computer science field.
WikiSummary
A summarization dataset extracted from Wikipedia.
IndoSum
Dataset for text summarization in Indonesian that is compiled from online news articles and publicly available.
Multi-Xscience
A multi-document summarization dataset created from scientific articles. MultiXScience introduces a challenging multidocument summarization task: writing the related-work section of a paper based on its abstract and the articles it references.
NewSHead
Dataset contains 369,940 English stories with 932,571 unique URLs, among which we have 359,940 stories for training, 5,000 for validation, and 5,000 for testing, respectively. Each news story contains at least three (and up to five) articles.
NCLS-Corpora
Contains two datasets for cross-lingual summarization: ZH2ENSUM and EN2ZHSUM. There exists 370,759 English-to-Chinese cross-lingual summarization (CLS) pairs from ENSUM and 1,699,713 Chinese-to-English CLS pairs.
MEDIQA-Answer Summarization
Dataset containing question-driven summaries of answers to consumer health questions.
Wikipedia Current Events Portal (WCEP) Dataset
Dataset is used for multi-document summarization (MDS) and consists of short, human-written summaries about news events, obtained from the Wikipedia Current Events Portal (WCEP), each paired with a cluster of news articles associated with an event.
Gigaword
Dataset contains headline-generation on a corpus of article pairs from Gigaword consisting of around 4 million articles.
Opinosis
Dataset contains sentences extracted from reviews for 51 topics. Topics and opinions are obtained from Tripadvisor, Edmunds.com and Amazon.com.
BillSum
Dataset contains a summarization of US Congressional and California state bills.
SAMSum
Dataset contains over 16K chat dialogues with manually annotated summaries.
Annotated Enron Subject Line Corpus (AESLC)
Dataset contains email messages of employees in the Enron Corporation.
Multi-News
Dataset consists of news articles and human-written summaries of these articles from the site newser.com. Each summary is professionally written by editors and includes links to the original articles cited.
The New York Times Annotated Corpus
Dataset contains over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with article metadata provided by the New York Times Newsroom.
BigPatent
Dataset consists of 1.3 million records of U.S. patent documents along with human written abstractive summaries.
LCSTS
Dataset constructed from the Chinese microblogging website Sina Weibo. It consists of over 2 million real Chinese short texts with short summaries given by the author of each text. Requires application.
Essex Arabic Summaries Corpus (EASC)
Dataset contains 153 Arabic articles and 765 human-generated extractive summaries of those articles. These summaries were generated using Mechanical Turk.
KALIMAT Multipurpose Arabic Corpus
Dataset contains 20,291 Arabic articles collected from the Omani newspaper Alwatan. Extractive Single-document and multi-document system summaries. Named Entity Recognised articles. The data has 6 categories: culture, economy, local-news, international-news, religion, and sports.
SCITLDR
Dataset of a combination of TLDRs written by human experts and author written TLDRs of computer science papers from OpenReview.
MLSUM
Dataset was collected from online newspapers, it contains 1.5M+ article/summary pairs in 5 languages: French, German, Spanish, Russian, & Turkish.
Polish Summaries Corpus (PSC)
Dataset contains news articles and their summaries.
MATINF
A labeled dataset for classification, question answering and summarization. MATINF contains 1.07 million question-answer pairs with human-labeled categories and usergenerated question descriptions.
MultiLing Pilot 2011 Dataset
Dataset is derived from publicly available WikiNews English texts and translated into 7 languages: Arabic, Czech, English, French, Greek, Hebrew, Hindi.
Webis-TLDR-17 Corpus
Dataset contains 3 Million pairs of content and self-written summaries mined from Reddit. It is one of the first large-scale summarization dataset from the social media domain.
Webis-Snippet-20 Corpus
Dataset comprises four abstractive snippet dataset from ClueWeb09, Clueweb12, and DMOZ descriptions. More than 10 million <webpage, abstractive snippet> pairs / 3.5 million <query, webpage, abstractive snippet> pairs were collected.
CASS
Dataset is composed of decisions made by the French Court of cassation and summaries of these decisions made by lawyer.
How2
Dataset of instructional videos covering a wide variety of topics across video clips (about 2,000 hours), with word-level time alignments to the ground-truth English subtitles. And 300 hours was translated into Portuguese subtitles.
X-Sum
The XSum dataset consists of 226,711 Wayback archived BBC articles (2010 to 2017) and covering a wide variety of domains: News, Politics, Sports, Weather, Business, Technology, Science, Health, Family, Education, Entertainment and Arts.
Cornell Newsroom
Dataset contains 1.3 million articles and summaries written by authors and editors in the newsrooms of 38 major publications. The summaries are obtained from search and social metadata between 1998 and 2017.
WikiHow
Dataset contains article and summary pairs extracted and constructed from an online knowledge base written by different human authors.
WikiHow
Dataset contains article and summary pairs extracted and constructed from an online knowledge base written by different human authors.
CorpusTCC
This dataset contains scientific texts from brazilian community, about computer science field.
Classify and extract text 10x better and faster 🦾
Metatext helps you to classify and extract information from text and documents with customized language models with your data and expertise.