Multi-Lingual Datasets
We catalog 41 Multi-Lingual datasets for NLP and machine learning. Browse the list below or narrow down by task.
This page covers Multi-Lingual-language data. Our directory includes 41 datasets in Multi-Lingual.
Updated June 2026
- RELX & RELX-DistantRelation ClassificationMulti-Lingual
- MTOPSemantic ParsingMulti-Lingual
- CC NetText CorporaMulti-Lingual
- JW300Machine TranslationMulti-Lingual
- TanzilMachine TranslationMulti-Lingual
- VoxForgeSpeech CorporaMulti-Lingual
- ParaCrawl CorpusMachine TranslationMulti-Lingual
- Leipzig Corpora CollectionText CorporaMulti-Lingual
- The EUR-Lex DatasetClassificationMulti-Lingual
- BSNLP-2019Named Entity Recognition (NER), Entity LinkingMulti-Lingual
- The Cross-lingual Natural Language Inference corpus (XNLI)EntailmentMulti-Lingual
- WikiAnnNamed Entity Recognition (NER)Multi-Lingual
- Train-O-Matic LargeWord Sense Disambiguation Multi-Lingual
- Train-O-Matic SmallWord Sense Disambiguation Multi-Lingual
- OneSeC SmallWord Sense Disambiguation Multi-Lingual
- Bible CorpusMachine TranslationMulti-Lingual
- BianetMachine TranslationMulti-Lingual
- ECB CorpusText Corpora, Machine TranslationMulti-Lingual
- EMEAMachine TranslationMulti-Lingual
- EubookshopText Corpora, Machine TranslationMulti-Lingual
- WMT 14 English-GermanMachine TranslationMulti-Lingual
- WMT 15 English-CzechMachine TranslationMulti-Lingual
- WMT 19 Multiple DatasetsText Corpora, Machine TranslationMulti-Lingual
- Books CorpusMachine TranslationMulti-Lingual
- Parallel Meaning BankText CorporaMulti-Lingual
- OpenSubtitlesDialogueMulti-Lingual
- Web Inventory of Transcribed and Translated Talks (WIT3)Machine TranslationMulti-Lingual
- CommonCrawlText CorporaMulti-Lingual
- A Novel Approach to a Semantically-Aware Representation of Items (NASARI)Semantic Textual SimilarityMulti-Lingual
- Code-Mixed-DialogDialogueMulti-Lingual
- Common VoiceSpeech RecognitionMulti-Lingual
- DSL Corpus Collection (DSLCC)Discriminating between similar languagesMulti-Lingual
- European Parliament Proceedings (Europarl)Text Corpora, Machine TranslationMulti-Lingual
- DbpediaKnowledge BaseMulti-Lingual
- Google Books N-gramsClassification, ClusteringMulti-Lingual
- Guttenberg Book CorpusText CorporaMulti-Lingual
- Microsoft Speech Language Translation Corpus (MSLT)Speech Recognition, Machine TranslationMulti-Lingual
- One Week of Global News FeedsText CorporaMulti-Lingual
- The Winograd Schema ChallengeCoreference ResolutionMulti-Lingual
- VoxCelebSpeech Recognition, VisualMulti-Lingual
- Worldwide News - Aggregate of 20K FeedsClustering, Events, Machine TranslationMulti-Lingual
What tasks do Multi-Lingual datasets cover?
Machine Translation (16)Text Corpora (10)Word Sense Disambiguation (3)Speech Recognition (3)Classification (2)Named Entity Recognition (NER) (2)Dialogue (2)Clustering (2)Relation Classification (1)Semantic Parsing (1)Speech Corpora (1)Entity Linking (1)Entailment (1)Semantic Textual Similarity (1)Discriminating between similar languages (1)Knowledge Base (1)Coreference Resolution (1)Visual (1)