Skip to content

HuggingFaceTB/smollm-corpus

General NLPENodc-by

HuggingFaceTB/smollm-corpus is a General NLP-focused dataset in EN that provides 236,980,453 labeled examples distributed in Parquet format. It is distributed under the odc-by license and falls in the 100M<n<1B size category, and has been downloaded 32.3K times.

About HuggingFaceTB/smollm-corpus

SmolLM-Corpus This dataset is a curated collection of high-quality educational and synthetic data designed for training small language models. You can find more details about the models trained on this dataset in our SmolLM blog post. ...

Details

Task
General NLP
Language
EN
Format
Parquet
Rows / instances
236980453
Size
100M<n<1B
Creator
HuggingFaceTB
Year
2024
License
odc-by
Downloads
32318
Likes
468
Download Homepage

Related General NLP datasets

FAQ