Avelina/smollm-corpus
Text GenerationENodc-by
Avelina/smollm-corpus is a text generation dataset in EN from Avelina in Parquet format. It is distributed under the odc-by license and falls in the 100M<n<1B size category, and has been downloaded 10.2K times.
About Avelina/smollm-corpus
SmolLM-Corpus: Now shuffled and sharded!
This is a version of the SmolLM-Corpus where the 3 subsets have been interleved, shuffled and sharded as 23698 jsonl.zst files for easy streaming!
The dataset is comprised of the cosmopedia-v2 and finewe...
Details
- Task
- Text Generation
- Language
- EN
- Format
- Parquet
- Rows / instances
- N/A
- Size
- 100M<n<1B
- Creator
- Avelina
- Year
- 2025
- License
- odc-by
- Downloads
- 10153
- Likes
- 5