Image captioning Models
There are 36 AI and NLP models for Image captioning in our directory. Browse the full list below, or explore models by provider.
Image captioning is a machine-learning task covered in our directory. We list 36 models for it.
Updated June 2026
- VILA1.5-13BChat,Visual question answering,Image captioning,Language modeling/generation,Question answeringNVIDIA,Massachusetts Institute of Technology (MIT)
- Wu Dao 2.0Image captioning,Chat,Image generation,Text-to-image,Language modeling/generation,Question answering,Visual question answeringBeijing Academy of Artificial Intelligence / BAAI
- Claude Opus 4.5Code generation,Language modeling/generation,Quantitative reasoning,Search,Visual question answering,Translation,Image captioning,Instruction interpretation,Mathematical reasoning,Visual puzzles,Code autocompletion,Chat,Character recognition (OCR),Language modeling,Language generation,Text autocompletion,Retrieval-augmented generation,System controlAnthropic
- Gemini Robotics-ER 1.5Instruction interpretation,Robotic manipulation,Image captioning,Object detection,Search,Language modeling/generation,Question answering,Speech recognition (ASR)Google DeepMind
- Grok 4Language modeling/generation,Question answering,Search,Visual question answering,Character recognition (OCR),Image captioning,Quantitative reasoningxAI
- Gemini 2.5 Pro (Jun 2025)Language modeling/generation,Question answering,Code generation,Quantitative reasoning,Visual question answering,Translation,Image captioning,Video description,Speech recognition (ASR)Google DeepMind
- Claude Sonnet 4Code generation,Language modeling/generation,Quantitative reasoning,Search,Visual question answering,Translation,Image captioning,Instruction interpretation,Mathematical reasoning,Visual puzzles,Code autocompletion,Chat,Character recognition (OCR),Language modeling,Language generation,Text autocompletion,Retrieval-augmented generation,System controlAnthropic
- Claude Opus 4Code generation,Language modeling/generation,Quantitative reasoning,Search,Visual question answering,Translation,Image captioning,Instruction interpretation,Mathematical reasoning,Visual puzzles,Code autocompletion,Chat,Character recognition (OCR),Language modeling,Language generation,Text autocompletion,Retrieval-augmented generation,System controlAnthropic
- Gemini 2.5 Pro (May 2025)Language modeling/generation,Question answering,Code generation,Quantitative reasoning,Visual question answering,Translation,Image captioning,Video description,Speech recognition (ASR)Google DeepMind
- Gemini 2.5 Pro (Mar 2025)Language modeling/generation,Question answering,Code generation,Quantitative reasoning,Visual question answering,Translation,Image captioning,Video description,Speech recognition (ASR)Google DeepMind
- Amazon Nova ProLanguage modeling/generation,Retrieval-augmented generation,Visual question answering,Image captioning,Video description,Character recognition (OCR),Code generation,TranslationAmazon
- GPT-4 Turbo (Apr 2024)Chat,Language modeling/generation,Image generation,Speech synthesis,Table tasks,Visual question answering,Image captioningOpenAI
- Qwen-VL-MaxChat,Image captioning,Face recognition,Visual question answeringAlibaba
- GPT-4 Turbo (Nov 2023)Chat,Language modeling/generation,Image generation,Speech synthesis,Table tasks,Visual question answering,Image captioningOpenAI
- Qwen3-Omni-30B-A3BLanguage modeling/generation,Question answering,Visual question answering,Image captioning,Video description,Speech recognition (ASR),Speech synthesis,Speech-to-text,Text-to-speech (TTS)Alibaba
- Kimi k1.5Language modeling/generation,Code generation,Quantitative reasoning,Question answering,Visual question answering,Translation,Image captioning,Visual puzzlesMoonshot
- Llama 3.2 11BVisual question answering,Image captioning,Object detectionMeta AI
- Oryx 34BVisual question answering,Video compression,Image captioning,Video description,Language modeling/generationTsinghua University,Tencent,Nanyang Technological University
- LLaVA-OV-72BImage captioning,Visual question answering,Video description,Object recognition,Action recognition,Language modeling/generationByteDance,Nanyang Technological University,Chinese University of Hong Kong (CUHK),Hong Kong University of Science and Technology (HKUST)
- Cambrian-1-34BImage captioning,Visual question answering,Character recognition (OCR)New York University (NYU)
- Claude 3.5 SonnetChat,Image captioning,Code generation,Language modeling/generationAnthropic
- Reka CoreChat,Language modeling/generation,Image captioning,Code generation,Code autocompletion,Question answering,Visual question answering,Video description,Speech recognition (ASR),Speech-to-text,Quantitative reasoningReka AI
- MM1-30BChat,Image captioning,Visual question answeringApple
- Claude 3 SonnetChat,Image captioning,Code generation,Language modeling/generationAnthropic
- Claude 3 OpusChat,Image captioning,Code generation,Language modeling/generationAnthropic
- Gemini Nano-2Chat,Image captioning,Speech recognition (ASR)Google DeepMind
- Gemini Nano-1Chat,Image captioning,Speech recognition (ASR)Google DeepMind
- VILA-13BChat,Visual question answering,Image captioning,Language modeling/generation,Question answeringNVIDIA,Massachusetts Institute of Technology (MIT)
- SPHINX (Llama 2 13B)Visual question answering,Image captioningShanghai AI Lab,Chinese University of Hong Kong (CUHK),ShanghaiTech University
- mPLUG-Owl2Visual question answering,Image captioning,Language modeling/generationAlibaba
- CogVLM-17BImage captioning,Visual question answering,ChatTsinghua University,Z.ai (Zhipu AI),Beihang University
- PaLI-3Visual question answering,Character recognition (OCR),Image captioningGoogle DeepMind,Google Research,Google Cloud
- Qwen-VLImage captioning,Chat,Question answering,Visual question answeringAlibaba
- PaLI-XImage captioning,Video description,Character recognition (OCR),Visual question answeringGoogle Research
- PaLM-EVisual question answering,Robotic manipulation,Image captioning,Language generationGoogle,TU Berlin
- BLIP-2 (Q-Former)Visual question answering,Image captioningSalesforce Research