Skip to content

Cognitive-Lab/NayanaOCR_Corpus_2025

Image To TextVisual Question AnsweringImage Text To TextAR, BN, DEcc-by-nc-4.0

Cognitive-Lab/NayanaOCR_Corpus_2025 is a image to text-focused dataset in AR, BN, DE that provides 1,006,170 labeled examples distributed in Parquet format. It is distributed under the cc-by-nc-4.0 license and falls in the 1M<n<10M size category, and has been downloaded 12.9K times.

About Cognitive-Lab/NayanaOCR_Corpus_2025

🪷 NayanaOCR Corpus 2025 A 1M-page, 22-language fully-parallel synthetic OCR + VQA corpus for document-centric vision-language models — every page rendered in every language. NayanaOCR Corpus 2025 is one of the largest open-source multilingual...

Details

Task
Image To Text, Visual Question Answering, Image Text To Text
Language
AR, BN, DE
Format
Parquet
Rows / instances
1006170
Size
1M<n<10M
Creator
Cognitive-Lab
Year
2026
License
cc-by-nc-4.0
Downloads
12870
Likes
15
Download Homepage

FAQ