Skip to content

Avelina/smollm-corpus

Text GenerationENodc-by

Avelina/smollm-corpus is a text generation dataset in EN from Avelina in Parquet format. It is distributed under the odc-by license and falls in the 100M<n<1B size category, and has been downloaded 10.2K times.

About Avelina/smollm-corpus

SmolLM-Corpus: Now shuffled and sharded! This is a version of the SmolLM-Corpus where the 3 subsets have been interleved, shuffled and sharded as 23698 jsonl.zst files for easy streaming! The dataset is comprised of the cosmopedia-v2 and finewe...

Details

Task
Text Generation
Language
EN
Format
Parquet
Rows / instances
N/A
Size
100M<n<1B
Creator
Avelina
Year
2025
License
odc-by
Downloads
10153
Likes
5
Download Homepage

Related Text Generation datasets

FAQ