Portuguese Datasets
We catalog 15 Portuguese datasets for NLP and machine learning. Browse the list below or narrow down by task.
This page covers Portuguese, an official language across Brazil, Portugal, and several African nations with growing NLP coverage. Our directory includes 15 datasets in Portuguese.
Updated June 2026
- B5 CorpusClassificationPortuguese
- Historical Portuguese Corpora (HPC)Text Corpora, Text ClassificationPortuguese
- ITD - Dataset de Acordãos do STF de 2010 a 2018Text CorporaPortuguese
- RhetalhoPortuguese
- Lex2KidsText CorporaPortuguese
- PortugueseGLUEGLUEPortuguese
- TweetSentBRText ClassificationPortuguese
- Mercadolibre Data Challenge 2019Text ClassificationPortuguese, Spanish
- CorpusTCCSummarizationPortuguese
- MilkQAQuestion AnsweringPortuguese
- CC100-PortugueseText CorporaPortuguese
- HAREMNamed Entity Recognition (NER)Portuguese
- CAPESMachine TranslationPortuguese, English
- ShadenA/MathNetQuestion Answering, Text Generation, Image To TextEN, PT, ES
- LEMAS-Project/LEMAS-Dataset-trainText To Speech, Automatic Speech RecognitionIT, PT, ES