Clustering Datasets
There are 9 clustering datasets in our directory, 1 of which are benchmarks. Each links to its source, paper, and download — browse the full list below or filter by language.
Clustering is the task of grouping similar items together without any predefined labels. We catalog 9 datasets for it.
Updated June 2026
- Twenty Newsgroups DatasetClassification, ClusteringEnglish
- Examiner Pseudo-News CorpusClustering, Events, Sentiment AnalysisEnglish
- ASU Twitter DatasetClustering, Graph AnalysisEnglish
- Google Books N-gramsClassification, ClusteringMulti-Lingual
- SNAP Social Circles: Twitter DatabaseClustering, Graph AnalysisEnglish
- The Irish Times IRSClustering, Events, Language DetectionEnglish
- Worldwide News - Aggregate of 20K FeedsClustering, Events, Machine TranslationMulti-Lingual
- Yahoo! Music User Ratings of Musical ArtistsClustering, PCAEnglish
- News Headlines Dataset for Sarcasm DetectionClustering, Events, Language DetectionEnglishBenchmark