ViLaBench

Benchmark collection for Vision-Language Models (VLMs), hosted by the AntResearchNLP team.

These benchmark and result data are carefully compiled and merged from technical reports and official blogs of renowned multimodal models, including Google's Gemini series, OpenAI GPT-series and OpenAI o-series, Seed1.5-VL, MiMo-VL, Kimi-VL, Qwen2.5-VL, InternVL3, and other leading models' official technical documentation.

This collection provides researchers and developers with a comprehensive, standardized multimodal model evaluation benchmark comparison platform, helping to advance the development and research in the vision-language model field. Through unified data formats and visualization interfaces, users can more intuitively understand the performance of different models on various tasks, providing valuable references for model selection and improvement. Welcome to submit new benchmarks and results on GitHub!

Table Headers Explanation

Benchmark: The name of the vision-language benchmark. Click to visit the official page. 🔗
Year: The year the benchmark was published or released.
Cognitive Levels: The main cognitive ability required:
- Understanding - Basic comprehension and recognition tasks
- Reasoning - Logical inference and problem-solving tasks
- Comprehensive - Involving both basic understanding and advanced reasoning tasks
Domain: The application domain or context (e.g., natural/synthetic images, chart, etc.).
Modalities: The input data type(s) required (e.g., Single-Image, Multi-Image, Video).
Score: Model performance scores on the benchmark. Click the chart to zoom in for more details.
Note: Benchmarks are grouped by category. Click on category headers to collapse/expand groups.