stas/openwebtext-10k
General NLPEnglish
Stas/openwebtext-10k is a General NLP-focused dataset in English distributed in Parquet format.
About stas/openwebtext-10k
An open-source replication of the WebText dataset from OpenAI.
This is a small subset representing the first 10K records from the original dataset - created for testing.
The full 8M-record dataset is at https://huggingface.co/datasets/openwebtext
Details
- Task
- General NLP
- Language
- English
- Format
- Parquet
- Rows / instances
- N/A
- Creator
- stas
- Year
- 2022