Skip to content

allenai/olmOCR-mix-1025

General NLPEnglishodc-by

Created by allenai at 2025, the allenai/olmOCR-mix-1025 is a General NLP dataset in English in Parquet format. With 1.9K downloads and 34 likes, it is actively used by the community. It is released under the odc-by license and is a 100K<n<1M-scale dataset.

About allenai/olmOCR-mix-1025

olmOCR-mix-1025 olmOCR-mix-1025 is a dataset of ~270,000 PDF pages which have been OCRed into plain-text in a natural reading order using gpt-4.1 and a special prompting strategy that preserves any born-digital content from each page. This data...

Details

Task
General NLP
Language
English
Format
Parquet
Rows / instances
N/A
Size
100K<n<1M
Creator
allenai
Year
2025
License
odc-by
Downloads
1915
Likes
34
Download Homepage

Related General NLP datasets

FAQ