Skip to content

ArmelR/the-pile-splitted

General NLPEnglish

ArmelR/the-pile-splitted is a General NLP-focused dataset in English distributed in Parquet format. And falls in the 10M<n<100M size category, and has been downloaded 15.5K times.

About ArmelR/the-pile-splitted

Dataset description The pile is an 800GB dataset of english text designed by EleutherAI to train large-scale language models. The original version of the dataset can be found here. The dataset is divided into 22 smaller high-quality datasets. ...

Details

Task
General NLP
Language
English
Format
Parquet
Rows / instances
N/A
Size
10M<n<100M
Creator
ArmelR
Year
2023
Downloads
15489
Likes
23
Download Homepage

Related General NLP datasets

FAQ