pixparse/pdfa-eng-wds
Image To TextEN
Pixparse/pdfa-eng-wds is a image to text-focused dataset in EN distributed in Parquet format.
About pixparse/pdfa-eng-wds
Dataset Card for PDF Association dataset (PDFA)
Dataset Summary
PDFA dataset is a document dataset filtered from the SafeDocs corpus, aka CC-MAIN-2021-31-PDF-UNTRUNCATED. The original purpose of that corpus is for comprehensive pdf d...
Details
- Task
- Image To Text
- Language
- EN
- Format
- Parquet
- Rows / instances
- N/A
- Creator
- pixparse
- Year
- 2024