Multimodal Benchmarks - a btjhjeon Collection

Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model

Paper • 2407.07053 • Published Jul 9, 2024 • 47

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

Paper • 2407.12772 • Published Jul 17, 2024 • 35

VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models

Paper • 2407.11691 • Published Jul 16, 2024 • 16

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

Paper • 2408.02718 • Published Aug 5, 2024 • 62

Teaching CLIP to Count to Ten

Paper • 2302.12066 • Published Feb 23, 2023

GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models

Paper • 2408.11817 • Published Aug 21, 2024 • 9

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

Paper • 2408.13257 • Published Aug 23, 2024 • 26

UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios

Paper • 2408.17267 • Published Aug 30, 2024 • 23

VLM4Bio: A Benchmark Dataset to Evaluate Pretrained Vision-Language Models for Trait Discovery from Biological Images

Paper • 2408.16176 • Published Aug 28, 2024 • 8

YesBut: A High-Quality Annotated Multimodal Dataset for evaluating Satire Comprehension capability of Vision-Language Models

Paper • 2409.13592 • Published Sep 20, 2024 • 50

Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos

Paper • 2410.02763 • Published Oct 3, 2024 • 7

HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex Diagrams in Coding Tasks

Paper • 2410.12381 • Published Oct 16, 2024 • 43

WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation

Paper • 2410.12722 • Published Oct 16, 2024 • 5

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

Paper • 2410.10139 • Published Oct 14, 2024 • 51

MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks

Paper • 2410.10563 • Published Oct 14, 2024 • 37

LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content

Paper • 2410.10783 • Published Oct 14, 2024 • 26

TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models

Paper • 2410.10818 • Published Oct 14, 2024 • 16

MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models

Paper • 2410.09733 • Published Oct 13, 2024 • 8

TVBench: Redesigning Video-Language Evaluation

Paper • 2410.07752 • Published Oct 10, 2024 • 6

MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures

Paper • 2410.13754 • Published Oct 17, 2024 • 75

The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio

Paper • 2410.12787 • Published Oct 16, 2024 • 30

JMMMU: A Japanese Massive Multi-discipline Multimodal Understanding Benchmark for Culture-aware Evaluation

Paper • 2410.17250 • Published Oct 22, 2024 • 14

NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples

Paper • 2410.14669 • Published Oct 18, 2024 • 39

CAMEL-Bench: A Comprehensive Arabic LMM Benchmark

Paper • 2410.18976 • Published Oct 24, 2024 • 13

TP-Eval: Tap Multimodal LLMs' Potential in Evaluation by Customizing Prompts

Paper • 2410.18071 • Published Oct 23, 2024 • 7

CLEAR: Character Unlearning in Textual and Visual Modalities

Paper • 2410.18057 • Published Oct 23, 2024 • 209

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

Paper • 2410.19168 • Published Oct 24, 2024 • 24

BenchX: A Unified Benchmark Framework for Medical Vision-Language Pretraining on Chest X-Rays

Paper • 2410.21969 • Published Oct 29, 2024 • 10

TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models

Paper • 2410.23266 • Published Oct 30, 2024 • 20

DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models

Paper • 2411.00836 • Published Oct 29, 2024 • 15

M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models

Paper • 2411.04075 • Published Nov 6, 2024 • 16

M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework

Paper • 2411.06176 • Published Nov 9, 2024 • 45

JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation

Paper • 2411.07975 • Published Nov 12, 2024 • 31

VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models

Paper • 2411.17451 • Published Nov 26, 2024 • 11

Interleaved Scene Graph for Interleaved Text-and-Image Generation Assessment

Paper • 2411.17188 • Published Nov 26, 2024 • 20

VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format

Paper • 2411.17991 • Published Nov 27, 2024 • 5

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

Paper • 2411.15296 • Published Nov 22, 2024 • 21

VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information

Paper • 2412.00947 • Published Dec 1, 2024 • 8

AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?

Paper • 2412.02611 • Published Dec 3, 2024 • 25

3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark

Paper • 2412.07825 • Published Dec 10, 2024 • 12

OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations

Paper • 2412.07626 • Published Dec 10, 2024 • 28

Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions

Paper • 2412.08737 • Published Dec 11, 2024 • 54

BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities

Paper • 2412.07769 • Published Dec 10, 2024 • 30

Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models

Paper • 2412.12606 • Published Dec 17, 2024 • 41

OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain

Paper • 2412.13018 • Published Dec 17, 2024 • 41

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

Paper • 2412.14171 • Published Dec 18, 2024 • 24

MMFactory: A Universal Solution Search Engine for Vision-Language Tasks

Paper • 2412.18072 • Published Dec 24, 2024 • 18

MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

Paper • 2501.02955 • Published Jan 6, 2025 • 44

OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?

Paper • 2501.05510 • Published Jan 9, 2025 • 44

LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs

Paper • 2501.06186 • Published Jan 10, 2025 • 65

MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents

Paper • 2501.08828 • Published Jan 15, 2025 • 30

Multimodal LLMs Can Reason about Aesthetics in Zero-Shot

Paper • 2501.09012 • Published Jan 15, 2025 • 10

VideoWorld: Exploring Knowledge Learning from Unlabeled Videos

Paper • 2501.09781 • Published Jan 16, 2025 • 27

MMVU: Measuring Expert-Level Multi-Discipline Video Understanding

Paper • 2501.12380 • Published Jan 21, 2025 • 84

MSTS: A Multimodal Safety Test Suite for Vision-Language Models

Paper • 2501.10057 • Published Jan 17, 2025 • 10

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

Paper • 2501.13826 • Published Jan 23, 2025 • 23

EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents

Paper • 2501.11858 • Published Jan 21, 2025 • 7

Redundancy Principles for MLLMs Benchmarks

Paper • 2501.13953 • Published Jan 20, 2025 • 29

PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding

Paper • 2501.16411 • Published Jan 27, 2025 • 19

MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models

Paper • 2502.00698 • Published Feb 2, 2025 • 24

SARChat-Bench-2M: A Multi-Task Vision-Language Benchmark for SAR Image Interpretation

Paper • 2502.08168 • Published Feb 12, 2025 • 12

EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents

Paper • 2502.09560 • Published Feb 13, 2025 • 35

MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency

Paper • 2502.09621 • Published Feb 13, 2025 • 28

mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data

Paper • 2502.08468 • Published Feb 12, 2025 • 16

ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models

Paper • 2502.09696 • Published Feb 13, 2025 • 43

MM-RLHF: The Next Step Forward in Multimodal LLM Alignment

Paper • 2502.10391 • Published Feb 14, 2025 • 34

MVL-SIB: A Massively Multilingual Vision-Language Benchmark for Cross-Modal Topical Matching

Paper • 2502.12852 • Published Feb 18, 2025 • 3

GIMMICK -- Globally Inclusive Multimodal Multitask Cultural Knowledge Benchmarking

Paper • 2502.13766 • Published Feb 19, 2025 • 3

VLM^2-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues

Paper • 2502.12084 • Published Feb 17, 2025 • 35

KITAB-Bench: A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding

Paper • 2502.14949 • Published Feb 20, 2025 • 9

Evaluating Multimodal Generative AI with Korean Educational Standards

Paper • 2502.15422 • Published Feb 21, 2025 • 10

Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models

Paper • 2502.16033 • Published Feb 22, 2025 • 18

M3-AGIQA: Multimodal, Multi-Round, Multi-Aspect AI-Generated Image Quality Assessment

Paper • 2502.15167 • Published Feb 21, 2025 • 2

OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference

Paper • 2502.18411 • Published Feb 25, 2025 • 74

TheoremExplainAgent: Towards Multimodal Explanations for LLM Theorem Understanding

Paper • 2502.19400 • Published Feb 26, 2025 • 47

QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension

Paper • 2503.08689 • Published Mar 11, 2025 • 4

VisualSimpleQA: A Benchmark for Decoupled Evaluation of Large Vision-Language Models in Fact-Seeking Question Answering

Paper • 2503.06492 • Published Mar 9, 2025 • 11

VisualPRM: An Effective Process Reward Model for Multimodal Reasoning

Paper • 2503.10291 • Published Mar 13, 2025 • 36

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

Paper • 2503.10615 • Published Mar 13, 2025 • 17

MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research

Paper • 2503.13399 • Published Mar 17, 2025 • 22

Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM

Paper • 2503.14478 • Published Mar 18, 2025 • 48

DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding

Paper • 2503.12797 • Published Mar 17, 2025 • 32

MPBench: A Comprehensive Multimodal Reasoning Benchmark for Process Errors Identification

Paper • 2503.12505 • Published Mar 16, 2025 • 11

PEBench: A Fictitious Dataset to Benchmark Machine Unlearning for Multimodal Large Language Models

Paper • 2503.12545 • Published Mar 16, 2025 • 7

Judge Anything: MLLM as a Judge Across Any Modality

Paper • 2503.17489 • Published Mar 21, 2025 • 23

Video SimpleQA: Towards Factuality Evaluation in Large Video Language Models

Paper • 2503.18923 • Published Mar 24, 2025 • 14

Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation

Paper • 2503.19622 • Published Mar 25, 2025 • 31

LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?

Paper • 2503.19990 • Published Mar 25, 2025 • 35

VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks

Paper • 2410.19100 • Published Oct 24, 2024 • 6

Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks

Paper • 2501.11733 • Published Jan 20, 2025 • 28

ViLBench: A Suite for Vision-Language Process Reward Modeling

Paper • 2503.20271 • Published Mar 26, 2025 • 7

VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

Paper • 2503.21755 • Published Mar 27, 2025 • 33

4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding

Paper • 2503.17827 • Published Mar 22, 2025 • 8

UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation

Paper • 2503.14941 • Published Mar 19, 2025 • 5

KOFFVQA: An Objectively Evaluated Free-form VQA Benchmark for Large Vision-Language Models in the Korean Language

Paper • 2503.23730 • Published Mar 31, 2025 • 3

Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1

Paper • 2503.24376 • Published Mar 31, 2025 • 38

OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts

Paper • 2503.22952 • Published Mar 29, 2025 • 17

Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing

Paper • 2504.02826 • Published Apr 3, 2025 • 68

GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation

Paper • 2504.02782 • Published Apr 3, 2025 • 57

MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models

Paper • 2504.03641 • Published Apr 4, 2025 • 14

CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation

Paper • 2504.00043 • Published Mar 30, 2025 • 10

VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning

Paper • 2504.07956 • Published Apr 10, 2025 • 46

MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in Multimodal Large Language Models

Paper • 2504.05782 • Published Apr 8, 2025 • 3

ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness

Paper • 2504.10514 • Published Apr 10, 2025 • 48

VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge

Paper • 2504.10342 • Published Apr 14, 2025 • 11

ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question Answering

Paper • 2504.05506 • Published Apr 7, 2025 • 25

LearnAct: Few-Shot Mobile GUI Agent with a Unified Demonstration Benchmark

Paper • 2504.13805 • Published Apr 18, 2025 • 11

Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs

Paper • 2504.15280 • Published Apr 21, 2025 • 25

MM-IFEngine: Towards Multimodal Instruction Following

Paper • 2504.07957 • Published Apr 10, 2025 • 35

IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs

Paper • 2504.15415 • Published Apr 21, 2025 • 23

VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models

Paper • 2504.15279 • Published Apr 21, 2025 • 78

Towards Understanding Camera Motions in Any Video

Paper • 2504.15376 • Published Apr 21, 2025 • 155

VideoVista-CulturalLingo: 360^circ Horizons-Bridging Cultures, Languages, and Domains in Video Comprehension

Paper • 2504.17821 • Published Apr 23, 2025 • 24

Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark

Paper • 2504.16427 • Published Apr 23, 2025 • 18

Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency

Paper • 2504.18589 • Published Apr 24, 2025 • 13

RoboVerse: Towards a Unified Platform, Dataset and Benchmark for Scalable and Generalizable Robot Learning

Paper • 2504.18904 • Published Apr 26, 2025 • 9

BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks

Paper • 2412.04626 • Published Dec 5, 2024 • 13

Beyond Recognition: Evaluating Visual Perspective Taking in Vision Language Models

Paper • 2505.03821 • Published May 3, 2025 • 24

OmniGIRL: A Multilingual and Multimodal Benchmark for GitHub Issue Resolution

Paper • 2505.04606 • Published May 7, 2025 • 9

OSUniverse: Benchmark for Multimodal GUI-navigation AI Agents

Paper • 2505.03570 • Published May 6, 2025 • 8

On Path to Multimodal Generalist: General-Level and General-Bench

Paper • 2505.04620 • Published May 7, 2025 • 82

ViMRHP: A Vietnamese Benchmark Dataset for Multimodal Review Helpfulness Prediction via Human-AI Collaborative Annotation

Paper • 2505.07416 • Published May 12, 2025 • 2

VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language Models

Paper • 2505.08455 • Published May 13, 2025 • 5

EWMBench: Evaluating Scene, Motion, and Semantic Quality in Embodied World Models

Paper • 2505.09694 • Published May 14, 2025 • 20

PointArena: Probing Multimodal Grounding Through Language-Guided Pointing

Paper • 2505.09990 • Published May 15, 2025 • 12

MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly

Paper • 2505.10610 • Published May 15, 2025 • 55

ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models

Paper • 2505.13444 • Published May 19, 2025 • 17

ViPlan: A Benchmark for Visual Planning with Symbolic Predicates and Vision-Language Models

Paper • 2505.13180 • Published May 19, 2025 • 13

VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation

Paper • 2505.14640 • Published May 20, 2025 • 16

HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation

Paper • 2505.11454 • Published May 16, 2025 • 5

VideoGameQA-Bench: Evaluating Vision-Language Models for Video Game Quality Assurance

Paper • 2505.15952 • Published May 21, 2025 • 20

SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding

Paper • 2505.17012 • Published May 22, 2025 • 12

RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding

Paper • 2505.14462 • Published May 20, 2025 • 4

AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models

Paper • 2505.16211 • Published May 22, 2025 • 18

FullFront: Benchmarking MLLMs Across the Full Front-End Engineering Workflow

Paper • 2505.17399 • Published May 23, 2025 • 14

RBench-V: A Primary Assessment for Visual Reasoning Models with Multi-modal Outputs

Paper • 2505.16770 • Published May 22, 2025 • 12

Are Vision-Language Models Safe in the Wild? A Meme-Based Benchmark Study

Paper • 2505.15389 • Published May 21, 2025 • 8

Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps

Paper • 2505.18675 • Published May 24, 2025 • 26

MMIG-Bench: Towards Comprehensive and Explainable Evaluation of Multi-Modal Image Generation Models

Paper • 2505.19415 • Published May 26, 2025 • 2

MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs

Paper • 2505.21327 • Published May 27, 2025 • 83

MMMR: Benchmarking Massive Multi-Modal Reasoning Tasks

Paper • 2505.16459 • Published May 22, 2025 • 45

MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios

Paper • 2505.21333 • Published May 27, 2025 • 38

NOVA: A Benchmark for Anomaly Localization and Clinical Reasoning in Brain MRI

Paper • 2505.14064 • Published May 20, 2025 • 19

VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos

Paper • 2505.23693 • Published May 29, 2025 • 53

VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?

Paper • 2505.23359 • Published May 29, 2025 • 38

Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or True Temporal Understanding?

Paper • 2505.14321 • Published May 20, 2025 • 11

MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence

Paper • 2505.23764 • Published May 29, 2025 • 3

Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents

Paper • 2505.24878 • Published May 30, 2025 • 23

More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models

Paper • 2505.21523 • Published May 23, 2025 • 13

EarthMind: Towards Multi-Granular and Multi-Sensor Earth Observation with Large Multimodal Models

Paper • 2506.01667 • Published Jun 2, 2025 • 21

VS-Bench: Evaluating VLMs for Strategic Reasoning and Decision-Making in Multi-Agent Environments

Paper • 2506.02387 • Published Jun 3, 2025 • 58

OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models

Paper • 2506.03135 • Published Jun 3, 2025 • 40

FinMME: Benchmark Dataset for Financial Multi-Modal Reasoning Evaluation

Paper • 2505.24714 • Published May 30, 2025 • 37

MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos

Paper • 2506.04141 • Published Jun 4, 2025 • 29

VLMs Can Aggregate Scattered Training Patches

Paper • 2506.03614 • Published Jun 4, 2025 • 2

VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos

Paper • 2506.05349 • Published Jun 5, 2025 • 24

AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs

Paper • 2506.05328 • Published Jun 5, 2025 • 21

Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations

Paper • 2506.04633 • Published Jun 5, 2025 • 20

MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning

Paper • 2506.05523 • Published Jun 5, 2025 • 34

MMRefine: Unveiling the Obstacles to Robust Refinement in Multimodal Large Language Models

Paper • 2506.04688 • Published Jun 5, 2025 • 3

MCA-Bench: A Multimodal Benchmark for Evaluating CAPTCHA Robustness Against VLM-based Attacks

Paper • 2506.05982 • Published Jun 6, 2025 • 2

Scientists' First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning

Paper • 2506.10521 • Published Jun 12, 2025 • 73

VideoMolmo: Spatio-Temporal Grounding Meets Pointing

Paper • 2506.05336 • Published Jun 5, 2025 • 9

SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification

Paper • 2506.15569 • Published Jun 18, 2025 • 12

ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models

Paper • 2506.21356 • Published Jun 26, 2025 • 22

Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation

Paper • 2506.21876 • Published Jun 27, 2025 • 28

SMMILE: An Expert-Driven Benchmark for Multimodal Medical In-Context Learning

Paper • 2506.21355 • Published Jun 26, 2025 • 10

MARBLE: A Hard Benchmark for Multimodal Spatial Reasoning and Planning

Paper • 2506.22992 • Published Jun 28, 2025 • 12

Kwai Keye-VL Technical Report

Paper • 2507.01949 • Published Jul 2, 2025 • 131

HalluSegBench: Counterfactual Visual Reasoning for Segmentation Hallucination Evaluation

Paper • 2506.21546 • Published Jun 26, 2025 • 2

OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding

Paper • 2507.07984 • Published Jul 10, 2025 • 43

Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers

Paper • 2507.10787 • Published Jul 14, 2025 • 13

MMHU: A Massive-Scale Multimodal Benchmark for Human Behavior Understanding

Paper • 2507.12463 • Published Jul 16, 2025 • 27

MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents

Paper • 2507.19478 • Published Jul 25, 2025 • 32

MoHoBench: Assessing Honesty of Multimodal Large Language Models via Unanswerable Visual Questions

Paper • 2507.21503 • Published Jul 29, 2025 • 3

AgroBench: Vision-Language Model Benchmark in Agriculture

Paper • 2507.20519 • Published Jul 28, 2025 • 8

DeepPHY: Benchmarking Agentic VLMs on Physical Reasoning

Paper • 2508.05405 • Published Aug 7, 2025 • 64

Can Large Multimodal Models Actively Recognize Faulty Inputs? A Systematic Evaluation Framework of Their Input Scrutiny Ability

Paper • 2508.04017 • Published Aug 6, 2025 • 11

MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents

Paper • 2508.13186 • Published Aug 14, 2025 • 19

MMAU-Pro: A Challenging and Comprehensive Benchmark for Holistic Evaluation of Audio General Intelligence

Paper • 2508.13992 • Published Aug 19, 2025 • 7

RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation

Paper • 2508.13968 • Published Aug 19, 2025 • 1

Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory

Paper • 2508.09736 • Published Aug 13, 2025 • 58

MathReal: We Keep It Real! A Real Scene Benchmark for Evaluating Math Reasoning in Multimodal Large Language Models

Paper • 2508.06009 • Published Aug 8, 2025 • 16

MEENA (PersianMMMU): Multimodal-Multilingual Educational Exams for N-level Assessment

Paper • 2508.17290 • Published Aug 24, 2025 • 8

SEAM: Semantically Equivalent Across Modalities Benchmark for Vision-Language Models

Paper • 2508.18179 • Published Aug 25, 2025 • 9

ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding

Paper • 2508.21496 • Published Aug 29, 2025 • 55

MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents

Paper • 2509.06477 • Published Sep 8, 2025 • 3

WildScore: Benchmarking MLLMs in-the-Wild Symbolic Music Reasoning

Paper • 2509.04744 • Published Sep 5, 2025 • 12

Measuring Epistemic Humility in Multimodal Large Language Models

Paper • 2509.09658 • Published Sep 11, 2025 • 7

Can Multimodal LLMs See Materials Clearly? A Multimodal Benchmark on Materials Characterization

Paper • 2509.09307 • Published Sep 11, 2025 • 6

GeoPQA: Bridging the Visual Perception Gap in MLLMs for Geometric Reasoning

Paper • 2509.17437 • Published Sep 22, 2025 • 17

VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction

Paper • 2509.19002 • Published Sep 23, 2025 • 3

OpenGVL - Benchmarking Visual Temporal Progress for Data Curation

Paper • 2509.17321 • Published Sep 22, 2025 • 3

GSM8K-V: Can Vision Language Models Solve Grade School Math Word Problems in Visual Contexts

Paper • 2509.25160 • Published Sep 29, 2025 • 32

IWR-Bench: Can LVLMs reconstruct interactive webpage from a user interaction video?

Paper • 2509.24709 • Published Sep 29, 2025 • 7

VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes

Paper • 2509.25339 • Published Sep 29, 2025 • 10

MedQ-Bench: Evaluating and Exploring Medical Image Quality Assessment Abilities in MLLMs

Paper • 2510.01691 • Published Oct 2, 2025 • 4

Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge Graphs

Paper • 2510.00507 • Published Oct 1, 2025 • 2

UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG

Paper • 2510.03663 • Published Oct 4, 2025 • 16

SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models

Paper • 2510.08559 • Published Oct 9, 2025 • 9

PhysToolBench: Benchmarking Physical Tool Understanding for MLLMs

Paper • 2510.09507 • Published Oct 10, 2025 • 11

ExpVid: A Benchmark for Experiment Video Understanding & Reasoning

Paper • 2510.11606 • Published Oct 13, 2025 • 6

OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs

Paper • 2510.10689 • Published Oct 12, 2025 • 47

InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models

Paper • 2510.11341 • Published Oct 13, 2025 • 35

MultiVerse: A Multi-Turn Conversation Benchmark for Evaluating Large Vision and Language Models

Paper • 2510.16641 • Published Oct 18, 2025 • 5

MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues

Paper • 2510.17722 • Published Oct 20, 2025 • 20

CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark

Paper • 2510.26160 • Published Oct 30, 2025 • 17

TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images Reasoning

Paper • 2511.01833 • Published Nov 3, 2025 • 16

Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench

Paper • 2510.26865 • Published Oct 30, 2025 • 12

VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation

Paper • 2511.02778 • Published Nov 4, 2025 • 102

When Modalities Conflict: How Unimodal Reasoning Uncertainty Governs Preference Dynamics in MLLMs

Paper • 2511.02243 • Published Nov 4, 2025 • 25

Can Visual Input Be Compressed? A Visual Token Compression Benchmark for Large Multimodal Models

Paper • 2511.02650 • Published Nov 4, 2025 • 10

MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity

Paper • 2511.03146 • Published Nov 5, 2025 • 8

GUI-360: A Comprehensive Dataset and Benchmark for Computer-Using Agents

Paper • 2511.04307 • Published Nov 6, 2025 • 15

MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs

Paper • 2511.07250 • Published Nov 10, 2025 • 18

MM-CRITIC: A Holistic Evaluation of Large Multimodal Models as Multimodal Critique

Paper • 2511.09067 • Published Nov 12, 2025 • 2

TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models

Paper • 2511.13704 • Published Nov 17, 2025 • 43

GGBench: A Geometric Generative Reasoning Benchmark for Unified Multimodal Models

Paper • 2511.11134 • Published Nov 14, 2025 • 33

Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark

Paper • 2511.13853 • Published Nov 17, 2025 • 36

MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs

Paper • 2511.14159 • Published Nov 18, 2025 • 25

Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks

Paper • 2511.15065 • Published Nov 19, 2025 • 77

TopoPerception: A Shortcut-Free Evaluation of Global Visual Perception in Large Vision-Language Models

Paper • 2511.11831 • Published Nov 14, 2025 • 1

V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models

Paper • 2511.16668 • Published Nov 20, 2025 • 55

M3-Bench: Multi-Modal, Multi-Hop, Multi-Threaded Tool-Using MLLM Agent Benchmark

Paper • 2511.17729 • Published Nov 21, 2025 • 17

Multimodal Evaluation of Russian-language Architectures

Paper • 2511.15552 • Published Nov 19, 2025 • 79

VQ-VA World: Towards High-Quality Visual Question-Visual Answering

Paper • 2511.20573 • Published Nov 25, 2025 • 7

Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following

Paper • 2511.21662 • Published Nov 26, 2025 • 11

CaptionQA: Is Your Caption as Useful as the Image Itself?

Paper • 2511.21025 • Published Nov 26, 2025 • 28

World in a Frame: Understanding Culture Mixing as a New Challenge for Vision-Language Models

Paper • 2511.22787 • Published Nov 27, 2025 • 10

SO-Bench: A Structural Output Evaluation of Multimodal LLMs

Paper • 2511.21750 • Published Nov 23, 2025 • 6

From Pixels to Feelings: Aligning MLLMs with Human Cognitive Perception of Images

Paper • 2511.22805 • Published Nov 27, 2025 • 4

Envision: Benchmarking Unified Understanding & Generation for Causal World Process Insights

Paper • 2512.01816 • Published Dec 1, 2025 • 93

RULER-Bench: Probing Rule-based Reasoning Abilities of Next-level Video Generation Models for Vision Foundation Intelligence

Paper • 2512.02622 • Published Dec 2, 2025 • 10

Benchmarking Scientific Understanding and Reasoning for Video Generation using VideoScience-Bench

Paper • 2512.02942 • Published Dec 2, 2025 • 5

IF-Bench: Benchmarking and Enhancing MLLMs for Infrared Images with Generative Visual Prompting

Paper • 2512.09663 • Published Dec 10, 2025 • 4

OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation

Paper • 2512.06589 • Published Dec 6, 2025 • 19

V-REX: Benchmarking Exploratory Visual Reasoning via Chain-of-Questions

Paper • 2512.11995 • Published Dec 12, 2025 • 10

From Macro to Micro: Benchmarking Microscopic Spatial Intelligence on Molecules via Vision-Language Models

Paper • 2512.10867 • Published Dec 11, 2025 • 16

GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation

Paper • 2512.17495 • Published Dec 19, 2025 • 20

HERBench: A Benchmark for Multi-Evidence Integration in Video Question Answering

Paper • 2512.14870 • Published Dec 16, 2025 • 15

A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos

Paper • 2512.16978 • Published Dec 18, 2025 • 6

Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image

Paper • 2512.16899 • Published Dec 18, 2025 • 14

VABench: A Comprehensive Benchmark for Audio-Video Generation

Paper • 2512.09299 • Published Dec 10, 2025 • 8

JMMMU-Pro: Image-based Japanese Multi-discipline Multimodal Understanding Benchmark via Vibe Benchmark Construction

Paper • 2512.14620 • Published Dec 16, 2025 • 2

UniPercept: Towards Unified Perceptual-Level Image Understanding across Aesthetics, Quality, Structure, and Texture

Paper • 2512.21675 • Published Dec 25, 2025 • 25

SciEvalKit: An Open-source Evaluation Toolkit for Scientific General Intelligence

Paper • 2512.22334 • Published Dec 26, 2025 • 36

TSRBench: A Comprehensive Multi-task Multi-modal Time Series Reasoning Benchmark for Generalist Models

Paper • 2601.18744 • Published Jan 26 • 10

Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models

Paper • 2602.02185 • Published 25 days ago • 125

Toward Cognitive Supersensing in Multimodal Large Language Model

Paper • 2602.01541 • Published 25 days ago • 16

AdaptMMBench: Benchmarking Adaptive Multimodal Reasoning for Mode Selection and Reasoning Process

Paper • 2602.02676 • Published 25 days ago • 10

WorldVQA: Measuring Atomic World Knowledge in Multimodal Large Language Models

Paper • 2602.02537 • Published 30 days ago • 6