Skip to content

cis-lmu/GlotCC-V1

General NLPAAU, AAZ, ABcc0-1.0

The cis-lmu/GlotCC-V1 dataset is a AAU, AAZ, AB General NLP resource from cis-lmu at 2024. With 846 downloads and 60 likes, it is actively used by the community. It is released under the cc0-1.0 license and is a 1B<n<10B-scale dataset.

About cis-lmu/GlotCC-V1

Dataset Summary GlotCC-V1.0 is a document-level, general domain dataset derived from CommonCrawl, covering more than 1000 languages.It is built using the GlotLID language identification and Ungoliant pipeline from CommonCrawl.We release our ...

Details

Task
General NLP
Language
AAU, AAZ, AB
Format
Parquet
Rows / instances
N/A
Size
1B<n<10B
Creator
cis-lmu
Year
2024
License
cc0-1.0
Downloads
846
Likes
60
Download Homepage

Related General NLP datasets

FAQ