Skip to content

ruggsea/infini-news-corpus

Text GenerationText ClassificationText RetrievalENG, SPA, RUS

Created by ruggsea at 2026, the ruggsea/infini-news-corpus is a text generation dataset in ENG, SPA, RUS in Parquet format.

About ruggsea/infini-news-corpus

INFINI-NEWS Corpus A multilingual news corpus extracted from Common Crawl CC-News WARC files. One row per article, with body text extracted via trafilatura, WARC provenance, and derived metadata (publish date, language, topic, byte hashes) in a...

Details

Task
Text Generation, Text Classification, Text Retrieval
Language
ENG, SPA, RUS
Format
Parquet
Rows / instances
N/A
Creator
ruggsea
Year
2026
Download

FAQ