ruggsea/infini-news-corpus
Text GenerationText ClassificationText RetrievalENG, SPA, RUS
Created by ruggsea at 2026, the ruggsea/infini-news-corpus is a text generation dataset in ENG, SPA, RUS in Parquet format.
About ruggsea/infini-news-corpus
INFINI-NEWS Corpus
A multilingual news corpus extracted from
Common Crawl CC-News WARC files.
One row per article, with body text extracted via
trafilatura,
WARC provenance, and derived metadata (publish date, language, topic,
byte hashes) in a...
Details
- Task
- Text Generation, Text Classification, Text Retrieval
- Language
- ENG, SPA, RUS
- Format
- Parquet
- Rows / instances
- N/A
- Creator
- ruggsea
- Year
- 2026