Skip to content

HuggingFaceM4/OBELICS

General NLPENcc-by-4.0

Created by HuggingFaceM4 at 2023, the HuggingFaceM4/OBELICS is a General NLP dataset in EN containing 275,696,552 records in Parquet format. With 6.2K downloads and 171 likes, it is actively used by the community. It is released under the cc-by-4.0 license and is a 100M<n<1B-scale dataset.

About HuggingFaceM4/OBELICS

Dataset Card for OBELICS OBELICS is an open, massive, and curated collection of interleaved image-text web documents, containing 141M English documents, 115B text tokens, and 353M images, extracted from Common Crawl dumps between February 2020 ...

Details

Task
General NLP
Language
EN
Format
Parquet
Rows / instances
275696552
Size
100M<n<1B
Creator
HuggingFaceM4
Year
2023
License
cc-by-4.0
Downloads
6179
Likes
171
Download Homepage

Related General NLP datasets

FAQ