Arabic Datasets
We catalog 22 Arabic datasets for NLP and machine learning. Browse the list below or narrow down by task.
This page covers Arabic, a morphologically rich language spoken across the Middle East and North Africa. Our directory includes 22 datasets in Arabic.
Updated June 2026
- Arabic Dataset for Commonsense Validation Commonsense ValidationArabic
- CC100-ArabicText CorporaArabic
- CohereLabs/include-base-44Multiple ChoiceSQ, AR, HY
- CohereLabs/xP3xOtherAF, AR, AZ
- Essex Arabic Summaries Corpus (EASC)SummarizationArabic
- KALIMAT Multipurpose Arabic CorpusSummarization, Named Entity Recognition (NER), Part-of-Speech (POS)Arabic
- Saudi Newspapers CorpusText CorporaArabic
- Arabic in Business and Management Corpora (ABMC)Text CorporaArabic
- Arabic Jordanian General Tweets (AJGT)Classification, Sentiment AnalysisArabic
- Khaleej-2004 CorpusText CorporaArabic
- Watan-2004 CorpusText CorporaArabic
- Parallel Arabic DIalectal Corpus (PADIC)Text CorporaArabic
- legacy-datasets/mc4Text Generation, Fill MaskAF, AM, AR
- legacy-datasets/common_voiceAutomatic Speech RecognitionAB, AR, AS
- SemEvalCQAQuestion Answering, Reading ComprehensionArabic, English
- Twitter Dataset for Arabic Sentiment AnalysisClassification, Sentiment AnalysisArabic
- neulab/PangeaInstructVisual Question Answering, Question AnsweringAM, AR, BG
- wikimedia/wikisourceText Generation, Fill MaskAR, AS, AZ
- Helsinki-NLP/open_subtitlesTranslationAF, AR, BG
- miracl/miraclText RetrievalAR, BN, EN
- ClusterlabAi/101_billion_arabic_words_datasetText GenerationAR
- papluca/language-identificationText ClassificationAR, BG, DE
What tasks do Arabic datasets cover?
Text Corpora (6)Text Generation (3)Summarization (2)Classification (2)Sentiment Analysis (2)Fill Mask (2)Question Answering (2)Commonsense Validation (1)Multiple Choice (1)Other (1)Named Entity Recognition (NER) (1)Part-of-Speech (POS) (1)Automatic Speech Recognition (1)Reading Comprehension (1)Visual Question Answering (1)Translation (1)Text Retrieval (1)Text Classification (1)