Question 1

What is the OpenGVLab/OmniCorpus-CC-210M dataset?

Accepted Answer

🐳 OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

This repository contains 210 million image-text interleaved documents filtered from the OmniCorpus-CC dataset, which was sourced from Common Crawl.

Repo...

Question 2

Is OpenGVLab/OmniCorpus-CC-210M a benchmark?

Accepted Answer

OpenGVLab/OmniCorpus-CC-210M is a dataset for training or evaluation; it isn't tracked as a standard LLM benchmark in our catalog.

Question 3

Where can I download OpenGVLab/OmniCorpus-CC-210M?

Accepted Answer

OpenGVLab/OmniCorpus-CC-210M is available at its source: https://huggingface.co/datasets/OpenGVLab/OmniCorpus-CC-210M.

OpenGVLab/OmniCorpus-CC-210M

About OpenGVLab/OmniCorpus-CC-210M