HuggingFaceM4/OBELICS
General NLPENcc-by-4.0
Created by HuggingFaceM4 at 2023, the HuggingFaceM4/OBELICS is a General NLP dataset in EN containing 275,696,552 records in Parquet format. With 6.2K downloads and 171 likes, it is actively used by the community. It is released under the cc-by-4.0 license and is a 100M<n<1B-scale dataset.
About HuggingFaceM4/OBELICS
Dataset Card for OBELICS
OBELICS is an open, massive, and curated collection of interleaved image-text web documents, containing 141M English documents, 115B text tokens, and 353M images, extracted from Common Crawl dumps between February 2020 ...
Details
- Task
- General NLP
- Language
- EN
- Format
- Parquet
- Rows / instances
- 275696552
- Size
- 100M<n<1B
- Creator
- HuggingFaceM4
- Year
- 2023
- License
- cc-by-4.0
- Downloads
- 6179
- Likes
- 171