Visual Question Answering Datasets
There are 21 visual question answering datasets in our directory. Each links to its source, paper, and download — browse the full list below or filter by language.
Visual Question Answering is the task of answering natural-language questions about the contents of an image. We catalog 21 datasets for it.
Updated June 2026
- Visual GenomeVisual Question Answering, Knowkedge BaseEnglish
- Visual Commonsense GraphsVisual Question Answering, CommonsenseEnglish
- mvp-lab/LLaVA-OneVision-2-DataVideo Text To Text, Visual Question Answering, Image Text To TextEN
- OpenGVLab/ShareGPT-4oVisual Question Answering, Question AnsweringEN
- Visual QA (VQA)Visual Question AnsweringEnglish
- tomg-group-umd/pixelproseImage To Text, Text To Image, Visual Question AnsweringEN
- nvidia/Llama-Nemotron-VLM-Dataset-v1Visual Question Answering, Image Text To Text, Image To TextEnglish
- Xkev/LLaVA-CoT-100kVisual Question Answering, Image Text To TextEN
- HuggingFaceFV/finevideoVisual Question Answering, Video Text To TextEN
- HuggingFaceM4/DocmatixVisual Question AnsweringEN
- nvidia/Nemotron-VLM-Dataset-v2Visual Question Answering, Image Text To Text, Video Text To TextEnglish
- raidium/RadImageNet-VQAVisual Question AnsweringEN
- neulab/PangeaInstructVisual Question Answering, Question AnsweringAM, AR, BG
- OpenDataArena/MMFineReason-SFT-123K-Qwen3-VL-235B-ThinkingVisual Question Answering, Question Answering, Text GenerationEN
- ranjaykrishna/visual_genomeImage To Text, Object Detection, Visual Question AnsweringEN
- lmms-lab/M4-Instruct-DataVisual Question Answering, Question AnsweringEN
- nvidia/Nemotron-Image-Training-v3Visual Question Answering, Image Text To TextEnglish
- ScienceOne-AI/S1-MMAlignImage To Text, Visual Question Answering, Feature ExtractionEN
- VLR-CVC/DocVQA-2026Visual Question Answering, Document Question Answering, Image Text To Text, Question AnsweringEN
- multimodal-reasoning-lab/Zebra-CoTAny To Any, Image Text To Text, Visual Question AnsweringEnglish
- openbmb/RLHF-V-DatasetText Generation, Visual Question AnsweringEN