ArmelR/the-pile-splitted
General NLPEnglish
ArmelR/the-pile-splitted is a General NLP-focused dataset in English distributed in Parquet format. And falls in the 10M<n<100M size category, and has been downloaded 15.5K times.
About ArmelR/the-pile-splitted
Dataset description
The pile is an 800GB dataset of english text
designed by EleutherAI to train large-scale language models. The original version of
the dataset can be found here.
The dataset is divided into 22 smaller high-quality datasets. ...
Details
- Task
- General NLP
- Language
- English
- Format
- Parquet
- Rows / instances
- N/A
- Size
- 10M<n<100M
- Creator
- ArmelR
- Year
- 2023
- Downloads
- 15489
- Likes
- 23