HCAI-Lab/dolma3-6t-corpus-manifest
General NLPEnglish
HCAI-Lab/dolma3-6t-corpus-manifest is a General NLP dataset in English from HCAI-Lab in Parquet format. It has been downloaded 15.2K times.
About HCAI-Lab/dolma3-6t-corpus-manifest
dolma3-6t-corpus-manifest
Unified per-document manifest joining topic/format/quality/token-count/source-shard for the Dolma3 6T corpus. Cross-shard parquet partitioned dataset.
Provenance
This dataset was renamed on 2026-05-25 as...
Details
- Task
- General NLP
- Language
- English
- Format
- Parquet
- Rows / instances
- N/A
- Creator
- HCAI-Lab
- Year
- 2026
- Downloads
- 15172
- Likes
- 0