Skip to content

hltcoe/megawika

SummarizationQuestion AnsweringText GenerationAF, AR, AZcc-by-sa-4.0

Hltcoe/megawika is a summarization-focused dataset in AF, AR, AZ distributed in Parquet format. It is distributed under the cc-by-sa-4.0 license and falls in the 10M<n<100M size category, and has been downloaded 19.4K times.

About hltcoe/megawika

MegaWika is a multi- and crosslingual text dataset containing 30 million Wikipedia passages with their scraped and cleaned web citations. The passages span 50 Wikipedias in 50 languages, and the articles in which the passages were originally embed...

Details

Task
Summarization, Question Answering, Text Generation
Language
AF, AR, AZ
Format
Parquet
Rows / instances
N/A
Size
10M<n<100M
Creator
hltcoe
Year
2026
License
cc-by-sa-4.0
Downloads
19405
Likes
41
Download Homepage

FAQ