Benchmark collection for Vision-Language Models (VLMs), hosted by the AntResearchNLP team.
These benchmark and result data are carefully compiled and merged from technical reports and official blogs of renowned multimodal models, including Google's Gemini series, OpenAI GPT-series and OpenAI o-series, Seed1.5-VL, MiMo-VL, Kimi-VL, Qwen2.5-VL, InternVL3, and other leading models' official technical documentation.
This collection provides researchers and developers with a comprehensive, standardized multimodal model evaluation benchmark comparison platform, helping to advance the development and research in the vision-language model field. Through unified data formats and visualization interfaces, users can more intuitively understand the performance of different models on various tasks, providing valuable references for model selection and improvement. Welcome to submit new benchmarks and results on GitHub!