cis-lmu/GlotCC-V1
General NLPAAU, AAZ, ABcc0-1.0
The cis-lmu/GlotCC-V1 dataset is a AAU, AAZ, AB General NLP resource from cis-lmu at 2024. With 846 downloads and 60 likes, it is actively used by the community. It is released under the cc0-1.0 license and is a 1B<n<10B-scale dataset.
About cis-lmu/GlotCC-V1
Dataset Summary
GlotCC-V1.0 is a document-level, general domain dataset derived from CommonCrawl, covering more than 1000 languages.It is built using the GlotLID language identification and Ungoliant pipeline from CommonCrawl.We release our ...
Details
- Task
- General NLP
- Language
- AAU, AAZ, AB
- Format
- Parquet
- Rows / instances
- N/A
- Size
- 1B<n<10B
- Creator
- cis-lmu
- Year
- 2024
- License
- cc0-1.0
- Downloads
- 846
- Likes
- 60