Skip to content

commoncrawl/CommonLID

Text ClassificationACE, ACF, AEB

The commoncrawl/CommonLID dataset is a ACE, ACF, AEB text classification resource from commoncrawl at 2026. With 151 downloads and 53 likes, it is actively used by the community. It is released under the other license and is a 100K<n<1M-scale dataset.

About commoncrawl/CommonLID

CommonLID CommonLID is a community-created language identification (LID) benchmark. CommonLID consists of web text manually annotated for the language that it is written in. CommonLID contains annotations for 109 languages, where 78 of those la...

Details

Task
Text Classification
Language
ACE, ACF, AEB
Format
Parquet
Rows / instances
N/A
Size
100K<n<1M
Creator
commoncrawl
Year
2026
License
other
Downloads
151
Likes
53
Download Homepage

Related Text Classification datasets

FAQ