Skip to content

stas/openwebtext-10k

General NLPEnglish

Stas/openwebtext-10k is a General NLP-focused dataset in English distributed in Parquet format.

About stas/openwebtext-10k

An open-source replication of the WebText dataset from OpenAI. This is a small subset representing the first 10K records from the original dataset - created for testing. The full 8M-record dataset is at https://huggingface.co/datasets/openwebtext

Details

Task
General NLP
Language
English
Format
Parquet
Rows / instances
N/A
Creator
stas
Year
2022
Download

Related General NLP datasets

FAQ