Multimodal Benchmarks
updated
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning
Instruction Using Language Model
Paper
• 2407.07053
• Published
• 47
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models
Paper
• 2407.12772
• Published
• 35
VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality
Models
Paper
• 2407.11691
• Published
• 16
MMIU: Multimodal Multi-image Understanding for Evaluating Large
Vision-Language Models
Paper
• 2408.02718
• Published
• 62
Teaching CLIP to Count to Ten
Paper
• 2302.12066
• Published
GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models
Paper
• 2408.11817
• Published
• 9
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution
Real-World Scenarios that are Difficult for Humans?
Paper
• 2408.13257
• Published
• 26
UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal
Models in Multi-View Urban Scenarios
Paper
• 2408.17267
• Published
• 23
VLM4Bio: A Benchmark Dataset to Evaluate Pretrained Vision-Language
Models for Trait Discovery from Biological Images
Paper
• 2408.16176
• Published
• 8
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding
Benchmark
Paper
• 2409.02813
• Published
• 33
DSBench: How Far Are Data Science Agents to Becoming Data Science
Experts?
Paper
• 2409.07703
• Published
• 66
OmniBench: Towards The Future of Universal Omni-Language Models
Paper
• 2409.15272
• Published
• 30
YesBut: A High-Quality Annotated Multimodal Dataset for evaluating
Satire Comprehension capability of Vision-Language Models
Paper
• 2409.13592
• Published
• 50
Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short
Videos
Paper
• 2410.02763
• Published
• 7
HumanEval-V: Benchmarking High-Level Visual Reasoning with Complex
Diagrams in Coding Tasks
Paper
• 2410.12381
• Published
• 43
WorldMedQA-V: a multilingual, multimodal medical examination dataset for
multimodal language models evaluation
Paper
• 2410.12722
• Published
• 5
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large
Vision-Language Models
Paper
• 2410.10139
• Published
• 51
MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks
Paper
• 2410.10563
• Published
• 37
LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content
Paper
• 2410.10783
• Published
• 26
TemporalBench: Benchmarking Fine-grained Temporal Understanding for
Multimodal Video Models
Paper
• 2410.10818
• Published
• 16
MMCOMPOSITION: Revisiting the Compositionality of Pre-trained
Vision-Language Models
Paper
• 2410.09733
• Published
• 8
TVBench: Redesigning Video-Language Evaluation
Paper
• 2410.07752
• Published
• 6
MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures
Paper
• 2410.13754
• Published
• 75
The Curse of Multi-Modalities: Evaluating Hallucinations of Large
Multimodal Models across Language, Visual, and Audio
Paper
• 2410.12787
• Published
• 30
JMMMU: A Japanese Massive Multi-discipline Multimodal Understanding
Benchmark for Culture-aware Evaluation
Paper
• 2410.17250
• Published
• 14
NaturalBench: Evaluating Vision-Language Models on Natural Adversarial
Samples
Paper
• 2410.14669
• Published
• 39
CAMEL-Bench: A Comprehensive Arabic LMM Benchmark
Paper
• 2410.18976
• Published
• 13
TP-Eval: Tap Multimodal LLMs' Potential in Evaluation by Customizing
Prompts
Paper
• 2410.18071
• Published
• 7
CLEAR: Character Unlearning in Textual and Visual Modalities
Paper
• 2410.18057
• Published
• 209
MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark
Paper
• 2410.19168
• Published
• 24
BenchX: A Unified Benchmark Framework for Medical Vision-Language
Pretraining on Chest X-Rays
Paper
• 2410.21969
• Published
• 10
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal
Foundation Models
Paper
• 2410.23266
• Published
• 20
DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical
Reasoning Robustness of Vision Language Models
Paper
• 2411.00836
• Published
• 15
M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for
Evaluating Foundation Models
Paper
• 2411.04075
• Published
• 16
M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding
And A Retrieval-Aware Tuning Framework
Paper
• 2411.06176
• Published
• 45
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified
Multimodal Understanding and Generation
Paper
• 2411.07975
• Published
• 31
VLRewardBench: A Challenging Benchmark for Vision-Language Generative
Reward Models
Paper
• 2411.17451
• Published
• 11
Interleaved Scene Graph for Interleaved Text-and-Image Generation
Assessment
Paper
• 2411.17188
• Published
• 20
VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video
Comprehension with Video-Text Duet Interaction Format
Paper
• 2411.17991
• Published
• 5
MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs
Paper
• 2411.15296
• Published
• 21
VisOnlyQA: Large Vision Language Models Still Struggle with Visual
Perception of Geometric Information
Paper
• 2412.00947
• Published
• 8
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand
Audio-Visual Information?
Paper
• 2412.02611
• Published
• 25
3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark
Paper
• 2412.07825
• Published
• 12
OmniDocBench: Benchmarking Diverse PDF Document Parsing with
Comprehensive Annotations
Paper
• 2412.07626
• Published
• 28
Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity
Visual Descriptions
Paper
• 2412.08737
• Published
• 54
BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities
Paper
• 2412.07769
• Published
• 30
Multi-Dimensional Insights: Benchmarking Real-World Personalization in
Large Multimodal Models
Paper
• 2412.12606
• Published
• 41
OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in
Financial Domain
Paper
• 2412.13018
• Published
• 41
Thinking in Space: How Multimodal Large Language Models See, Remember,
and Recall Spaces
Paper
• 2412.14171
• Published
• 24
MMFactory: A Universal Solution Search Engine for Vision-Language Tasks
Paper
• 2412.18072
• Published
• 18
MotionBench: Benchmarking and Improving Fine-grained Video Motion
Understanding for Vision Language Models
Paper
• 2501.02955
• Published
• 44
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video
Understanding?
Paper
• 2501.05510
• Published
• 44
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
Paper
• 2501.06186
• Published
• 65
MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents
Paper
• 2501.08828
• Published
• 30
Multimodal LLMs Can Reason about Aesthetics in Zero-Shot
Paper
• 2501.09012
• Published
• 10
VideoWorld: Exploring Knowledge Learning from Unlabeled Videos
Paper
• 2501.09781
• Published
• 27
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
Paper
• 2501.12380
• Published
• 84
MSTS: A Multimodal Safety Test Suite for Vision-Language Models
Paper
• 2501.10057
• Published
• 10
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline
Professional Videos
Paper
• 2501.13826
• Published
• 23
EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents
Paper
• 2501.11858
• Published
• 7
Redundancy Principles for MLLMs Benchmarks
Paper
• 2501.13953
• Published
• 29
PhysBench: Benchmarking and Enhancing Vision-Language Models for
Physical World Understanding
Paper
• 2501.16411
• Published
• 19
MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal
Models
Paper
• 2502.00698
• Published
• 24
SARChat-Bench-2M: A Multi-Task Vision-Language Benchmark for SAR Image
Interpretation
Paper
• 2502.08168
• Published
• 12
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language
Models for Vision-Driven Embodied Agents
Paper
• 2502.09560
• Published
• 35
MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for
Reasoning Quality, Robustness, and Efficiency
Paper
• 2502.09621
• Published
• 28
mmE5: Improving Multimodal Multilingual Embeddings via High-quality
Synthetic Data
Paper
• 2502.08468
• Published
• 16
ZeroBench: An Impossible Visual Benchmark for Contemporary Large
Multimodal Models
Paper
• 2502.09696
• Published
• 43
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment
Paper
• 2502.10391
• Published
• 34
MVL-SIB: A Massively Multilingual Vision-Language Benchmark for
Cross-Modal Topical Matching
Paper
• 2502.12852
• Published
• 3
GIMMICK -- Globally Inclusive Multimodal Multitask Cultural Knowledge
Benchmarking
Paper
• 2502.13766
• Published
• 3
VLM^2-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit
Matching Visual Cues
Paper
• 2502.12084
• Published
• 35
KITAB-Bench: A Comprehensive Multi-Domain Benchmark for Arabic OCR and
Document Understanding
Paper
• 2502.14949
• Published
• 9
Evaluating Multimodal Generative AI with Korean Educational Standards
Paper
• 2502.15422
• Published
• 10
Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for
Multimodal Reasoning Models
Paper
• 2502.16033
• Published
• 18
M3-AGIQA: Multimodal, Multi-Round, Multi-Aspect AI-Generated Image
Quality Assessment
Paper
• 2502.15167
• Published
• 2
OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference
Paper
• 2502.18411
• Published
• 74
TheoremExplainAgent: Towards Multimodal Explanations for LLM Theorem
Understanding
Paper
• 2502.19400
• Published
• 47
QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long
Video Comprehension
Paper
• 2503.08689
• Published
• 4
VisualSimpleQA: A Benchmark for Decoupled Evaluation of Large
Vision-Language Models in Fact-Seeking Question Answering
Paper
• 2503.06492
• Published
• 11
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning
Paper
• 2503.10291
• Published
• 36
R1-Onevision: Advancing Generalized Multimodal Reasoning through
Cross-Modal Formalization
Paper
• 2503.10615
• Published
• 17
MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based
Scientific Research
Paper
• 2503.13399
• Published
• 22
Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM
Paper
• 2503.14478
• Published
• 48
DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs
for Knowledge-Intensive Visual Grounding
Paper
• 2503.12797
• Published
• 32
MPBench: A Comprehensive Multimodal Reasoning Benchmark for Process
Errors Identification
Paper
• 2503.12505
• Published
• 11
PEBench: A Fictitious Dataset to Benchmark Machine Unlearning for
Multimodal Large Language Models
Paper
• 2503.12545
• Published
• 7
Judge Anything: MLLM as a Judge Across Any Modality
Paper
• 2503.17489
• Published
• 23
Video SimpleQA: Towards Factuality Evaluation in Large Video Language
Models
Paper
• 2503.18923
• Published
• 14
Exploring Hallucination of Large Multimodal Models in Video
Understanding: Benchmark, Analysis and Mitigation
Paper
• 2503.19622
• Published
• 31
LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?
Paper
• 2503.19990
• Published
• 35
VideoWebArena: Evaluating Long Context Multimodal Agents with Video
Understanding Web Tasks
Paper
• 2410.19100
• Published
• 6
Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks
Paper
• 2501.11733
• Published
• 28
ViLBench: A Suite for Vision-Language Process Reward Modeling
Paper
• 2503.20271
• Published
• 7
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic
Faithfulness
Paper
• 2503.21755
• Published
• 33
4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object
Understanding
Paper
• 2503.17827
• Published
• 8
UPME: An Unsupervised Peer Review Framework for Multimodal Large
Language Model Evaluation
Paper
• 2503.14941
• Published
• 5
KOFFVQA: An Objectively Evaluated Free-form VQA Benchmark for Large
Vision-Language Models in the Korean Language
Paper
• 2503.23730
• Published
• 3
Exploring the Effect of Reinforcement Learning on Video Understanding:
Insights from SEED-Bench-R1
Paper
• 2503.24376
• Published
• 38
OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming
Video Contexts
Paper
• 2503.22952
• Published
• 17
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual
Editing
Paper
• 2504.02826
• Published
• 68
GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image
Generation
Paper
• 2504.02782
• Published
• 57
MME-Unify: A Comprehensive Benchmark for Unified Multimodal
Understanding and Generation Models
Paper
• 2504.03641
• Published
• 14
CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs
with Controllable Puzzle Generation
Paper
• 2504.00043
• Published
• 10
VCR-Bench: A Comprehensive Evaluation Framework for Video
Chain-of-Thought Reasoning
Paper
• 2504.07956
• Published
• 46
MDK12-Bench: A Multi-Discipline Benchmark for Evaluating Reasoning in
Multimodal Large Language Models
Paper
• 2504.05782
• Published
• 3
ColorBench: Can VLMs See and Understand the Colorful World? A
Comprehensive Benchmark for Color Perception, Reasoning, and Robustness
Paper
• 2504.10514
• Published
• 48
VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain
Knowledge
Paper
• 2504.10342
• Published
• 11
ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question
Answering
Paper
• 2504.05506
• Published
• 25
LearnAct: Few-Shot Mobile GUI Agent with a Unified Demonstration
Benchmark
Paper
• 2504.13805
• Published
• 11
Seeing from Another Perspective: Evaluating Multi-View Understanding in
MLLMs
Paper
• 2504.15280
• Published
• 25
MM-IFEngine: Towards Multimodal Instruction Following
Paper
• 2504.07957
• Published
• 35
IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning
in Multimodal LLMs
Paper
• 2504.15415
• Published
• 23
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal
Large Language Models
Paper
• 2504.15279
• Published
• 78
Towards Understanding Camera Motions in Any Video
Paper
• 2504.15376
• Published
• 155
VideoVista-CulturalLingo: 360^circ Horizons-Bridging Cultures,
Languages, and Domains in Video Comprehension
Paper
• 2504.17821
• Published
• 24
Can Large Language Models Help Multimodal Language Analysis? MMLA: A
Comprehensive Benchmark
Paper
• 2504.16427
• Published
• 18
Benchmarking Multimodal Mathematical Reasoning with Explicit Visual
Dependency
Paper
• 2504.18589
• Published
• 13
RoboVerse: Towards a Unified Platform, Dataset and Benchmark for
Scalable and Generalizable Robot Learning
Paper
• 2504.18904
• Published
• 9
BigDocs: An Open and Permissively-Licensed Dataset for Training
Multimodal Models on Document and Code Tasks
Paper
• 2412.04626
• Published
• 13
Beyond Recognition: Evaluating Visual Perspective Taking in Vision
Language Models
Paper
• 2505.03821
• Published
• 24
OmniGIRL: A Multilingual and Multimodal Benchmark for GitHub Issue
Resolution
Paper
• 2505.04606
• Published
• 9
OSUniverse: Benchmark for Multimodal GUI-navigation AI Agents
Paper
• 2505.03570
• Published
• 8
On Path to Multimodal Generalist: General-Level and General-Bench
Paper
• 2505.04620
• Published
• 82
ViMRHP: A Vietnamese Benchmark Dataset for Multimodal Review Helpfulness Prediction via Human-AI Collaborative Annotation
Paper
• 2505.07416
• Published
• 2
VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large
Video Language Models
Paper
• 2505.08455
• Published
• 5
EWMBench: Evaluating Scene, Motion, and Semantic Quality in Embodied
World Models
Paper
• 2505.09694
• Published
• 20
PointArena: Probing Multimodal Grounding Through Language-Guided
Pointing
Paper
• 2505.09990
• Published
• 12
MMLongBench: Benchmarking Long-Context Vision-Language Models
Effectively and Thoroughly
Paper
• 2505.10610
• Published
• 55
ChartMuseum: Testing Visual Reasoning Capabilities of Large
Vision-Language Models
Paper
• 2505.13444
• Published
• 17
ViPlan: A Benchmark for Visual Planning with Symbolic Predicates and
Vision-Language Models
Paper
• 2505.13180
• Published
• 13
VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation
Paper
• 2505.14640
• Published
• 16
HumaniBench: A Human-Centric Framework for Large Multimodal Models
Evaluation
Paper
• 2505.11454
• Published
• 5
VideoGameQA-Bench: Evaluating Vision-Language Models for Video Game
Quality Assurance
Paper
• 2505.15952
• Published
• 20
SpatialScore: Towards Unified Evaluation for Multimodal Spatial
Understanding
Paper
• 2505.17012
• Published
• 12
RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture
Understanding
Paper
• 2505.14462
• Published
• 4
AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large
Language Models
Paper
• 2505.16211
• Published
• 18
FullFront: Benchmarking MLLMs Across the Full Front-End Engineering
Workflow
Paper
• 2505.17399
• Published
• 14
RBench-V: A Primary Assessment for Visual Reasoning Models with
Multi-modal Outputs
Paper
• 2505.16770
• Published
• 12
Are Vision-Language Models Safe in the Wild? A Meme-Based Benchmark
Study
Paper
• 2505.15389
• Published
• 8
Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual
Reasoning from Transit Maps
Paper
• 2505.18675
• Published
• 26
MMIG-Bench: Towards Comprehensive and Explainable Evaluation of
Multi-Modal Image Generation Models
Paper
• 2505.19415
• Published
• 2
MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs
Paper
• 2505.21327
• Published
• 83
MMMR: Benchmarking Massive Multi-Modal Reasoning Tasks
Paper
• 2505.16459
• Published
• 45
MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in
Video Scenarios
Paper
• 2505.21333
• Published
• 38
NOVA: A Benchmark for Anomaly Localization and Clinical Reasoning in
Brain MRI
Paper
• 2505.14064
• Published
• 19
VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC
Videos
Paper
• 2505.23693
• Published
• 53
VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video
Reasoning?
Paper
• 2505.23359
• Published
• 38
Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or
True Temporal Understanding?
Paper
• 2505.14321
• Published
• 11
MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence
Paper
• 2505.23764
• Published
• 3
Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and
Benchmarking Multimodal LLM Agents
Paper
• 2505.24878
• Published
• 23
More Thinking, Less Seeing? Assessing Amplified Hallucination in
Multimodal Reasoning Models
Paper
• 2505.21523
• Published
• 13
EarthMind: Towards Multi-Granular and Multi-Sensor Earth Observation
with Large Multimodal Models
Paper
• 2506.01667
• Published
• 21
VS-Bench: Evaluating VLMs for Strategic Reasoning and Decision-Making in
Multi-Agent Environments
Paper
• 2506.02387
• Published
• 58
OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for
Vision Language Models
Paper
• 2506.03135
• Published
• 40
FinMME: Benchmark Dataset for Financial Multi-Modal Reasoning Evaluation
Paper
• 2505.24714
• Published
• 37
MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in
Videos
Paper
• 2506.04141
• Published
• 29
VLMs Can Aggregate Scattered Training Patches
Paper
• 2506.03614
• Published
• 2
VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal
Understanding in Videos
Paper
• 2506.05349
• Published
• 24
AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual
Counting for MLLMs
Paper
• 2506.05328
• Published
• 21
Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual
Simulations
Paper
• 2506.04633
• Published
• 20
MORSE-500: A Programmatically Controllable Video Benchmark to
Stress-Test Multimodal Reasoning
Paper
• 2506.05523
• Published
• 34
MMRefine: Unveiling the Obstacles to Robust Refinement in Multimodal
Large Language Models
Paper
• 2506.04688
• Published
• 3
MCA-Bench: A Multimodal Benchmark for Evaluating CAPTCHA Robustness
Against VLM-based Attacks
Paper
• 2506.05982
• Published
• 2
Scientists' First Exam: Probing Cognitive Abilities of MLLM via
Perception, Understanding, and Reasoning
Paper
• 2506.10521
• Published
• 73
VideoMolmo: Spatio-Temporal Grounding Meets Pointing
Paper
• 2506.05336
• Published
• 9
SciVer: Evaluating Foundation Models for Multimodal Scientific Claim
Verification
Paper
• 2506.15569
• Published
• 12
ShotBench: Expert-Level Cinematic Understanding in Vision-Language
Models
Paper
• 2506.21356
• Published
• 22
Do Vision-Language Models Have Internal World Models? Towards an Atomic
Evaluation
Paper
• 2506.21876
• Published
• 28
SMMILE: An Expert-Driven Benchmark for Multimodal Medical In-Context
Learning
Paper
• 2506.21355
• Published
• 10
MARBLE: A Hard Benchmark for Multimodal Spatial Reasoning and Planning
Paper
• 2506.22992
• Published
• 12
Kwai Keye-VL Technical Report
Paper
• 2507.01949
• Published
• 131
HalluSegBench: Counterfactual Visual Reasoning for Segmentation
Hallucination Evaluation
Paper
• 2506.21546
• Published
• 2
OST-Bench: Evaluating the Capabilities of MLLMs in Online
Spatio-temporal Scene Understanding
Paper
• 2507.07984
• Published
• 43
Can Multimodal Foundation Models Understand Schematic Diagrams? An
Empirical Study on Information-Seeking QA over Scientific Papers
Paper
• 2507.10787
• Published
• 13
MMHU: A Massive-Scale Multimodal Benchmark for Human Behavior
Understanding
Paper
• 2507.12463
• Published
• 27
MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI
Agents
Paper
• 2507.19478
• Published
• 32
MoHoBench: Assessing Honesty of Multimodal Large Language Models via
Unanswerable Visual Questions
Paper
• 2507.21503
• Published
• 3
AgroBench: Vision-Language Model Benchmark in Agriculture
Paper
• 2507.20519
• Published
• 8
DeepPHY: Benchmarking Agentic VLMs on Physical Reasoning
Paper
• 2508.05405
• Published
• 64
Can Large Multimodal Models Actively Recognize Faulty Inputs? A
Systematic Evaluation Framework of Their Input Scrutiny Ability
Paper
• 2508.04017
• Published
• 11
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents
Paper
• 2508.13186
• Published
• 19
MMAU-Pro: A Challenging and Comprehensive Benchmark for Holistic
Evaluation of Audio General Intelligence
Paper
• 2508.13992
• Published
• 7
RotBench: Evaluating Multimodal Large Language Models on Identifying
Image Rotation
Paper
• 2508.13968
• Published
• 1
Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with
Long-Term Memory
Paper
• 2508.09736
• Published
• 58
MathReal: We Keep It Real! A Real Scene Benchmark for Evaluating Math
Reasoning in Multimodal Large Language Models
Paper
• 2508.06009
• Published
• 16
MEENA (PersianMMMU): Multimodal-Multilingual Educational Exams for
N-level Assessment
Paper
• 2508.17290
• Published
• 8
SEAM: Semantically Equivalent Across Modalities Benchmark for
Vision-Language Models
Paper
• 2508.18179
• Published
• 9
ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long
Video Understanding
Paper
• 2508.21496
• Published
• 55
MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI
Agents
Paper
• 2509.06477
• Published
• 3
WildScore: Benchmarking MLLMs in-the-Wild Symbolic Music Reasoning
Paper
• 2509.04744
• Published
• 12
Measuring Epistemic Humility in Multimodal Large Language Models
Paper
• 2509.09658
• Published
• 7
Can Multimodal LLMs See Materials Clearly? A Multimodal Benchmark on
Materials Characterization
Paper
• 2509.09307
• Published
• 6
GeoPQA: Bridging the Visual Perception Gap in MLLMs for Geometric
Reasoning
Paper
• 2509.17437
• Published
• 17
VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLMs via
Travel Video Itinerary Reconstruction
Paper
• 2509.19002
• Published
• 3
OpenGVL - Benchmarking Visual Temporal Progress for Data Curation
Paper
• 2509.17321
• Published
• 3
GSM8K-V: Can Vision Language Models Solve Grade School Math Word
Problems in Visual Contexts
Paper
• 2509.25160
• Published
• 32
IWR-Bench: Can LVLMs reconstruct interactive webpage from a user
interaction video?
Paper
• 2509.24709
• Published
• 7
VisualOverload: Probing Visual Understanding of VLMs in Really Dense
Scenes
Paper
• 2509.25339
• Published
• 10
MedQ-Bench: Evaluating and Exploring Medical Image Quality Assessment
Abilities in MLLMs
Paper
• 2510.01691
• Published
• 4
Graph2Eval: Automatic Multimodal Task Generation for Agents via
Knowledge Graphs
Paper
• 2510.00507
• Published
• 2
UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG
Paper
• 2510.03663
• Published
• 16
SciVideoBench: Benchmarking Scientific Video Reasoning in Large
Multimodal Models
Paper
• 2510.08559
• Published
• 9
PhysToolBench: Benchmarking Physical Tool Understanding for MLLMs
Paper
• 2510.09507
• Published
• 11
ExpVid: A Benchmark for Experiment Video Understanding & Reasoning
Paper
• 2510.11606
• Published
• 6
OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni
MLLMs
Paper
• 2510.10689
• Published
• 47
InternSVG: Towards Unified SVG Tasks with Multimodal Large Language
Models
Paper
• 2510.11341
• Published
• 35
MultiVerse: A Multi-Turn Conversation Benchmark for Evaluating Large
Vision and Language Models
Paper
• 2510.16641
• Published
• 5
MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating
Multimodal LLMs in Multi-Turn Dialogues
Paper
• 2510.17722
• Published
• 20
CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark
Paper
• 2510.26160
• Published
• 17
TIR-Bench: A Comprehensive Benchmark for Agentic Thinking-with-Images
Reasoning
Paper
• 2511.01833
• Published
• 16
Do Vision-Language Models Measure Up? Benchmarking Visual Measurement
Reading with MeasureBench
Paper
• 2510.26865
• Published
• 12
VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual
Representation
Paper
• 2511.02778
• Published
• 102
When Modalities Conflict: How Unimodal Reasoning Uncertainty Governs
Preference Dynamics in MLLMs
Paper
• 2511.02243
• Published
• 25
Can Visual Input Be Compressed? A Visual Token Compression Benchmark for
Large Multimodal Models
Paper
• 2511.02650
• Published
• 10
MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive
Capacity
Paper
• 2511.03146
• Published
• 8
GUI-360: A Comprehensive Dataset and Benchmark for Computer-Using Agents
Paper
• 2511.04307
• Published
• 15
MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal
LLMs
Paper
• 2511.07250
• Published
• 18
MM-CRITIC: A Holistic Evaluation of Large Multimodal Models as Multimodal Critique
Paper
• 2511.09067
• Published
• 2
TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models
Paper
• 2511.13704
• Published
• 43
GGBench: A Geometric Generative Reasoning Benchmark for Unified Multimodal Models
Paper
• 2511.11134
• Published
• 33
Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark
Paper
• 2511.13853
• Published
• 36
MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs
Paper
• 2511.14159
• Published
• 25
Reasoning via Video: The First Evaluation of Video Models' Reasoning Abilities through Maze-Solving Tasks
Paper
• 2511.15065
• Published
• 77
TopoPerception: A Shortcut-Free Evaluation of Global Visual Perception in Large Vision-Language Models
Paper
• 2511.11831
• Published
• 1
V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models
Paper
• 2511.16668
• Published
• 55
M3-Bench: Multi-Modal, Multi-Hop, Multi-Threaded Tool-Using MLLM Agent Benchmark
Paper
• 2511.17729
• Published
• 17
Multimodal Evaluation of Russian-language Architectures
Paper
• 2511.15552
• Published
• 79
VQ-VA World: Towards High-Quality Visual Question-Visual Answering
Paper
• 2511.20573
• Published
• 7
Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following
Paper
• 2511.21662
• Published
• 11
CaptionQA: Is Your Caption as Useful as the Image Itself?
Paper
• 2511.21025
• Published
• 28
World in a Frame: Understanding Culture Mixing as a New Challenge for Vision-Language Models
Paper
• 2511.22787
• Published
• 10
SO-Bench: A Structural Output Evaluation of Multimodal LLMs
Paper
• 2511.21750
• Published
• 6
From Pixels to Feelings: Aligning MLLMs with Human Cognitive Perception of Images
Paper
• 2511.22805
• Published
• 4
Envision: Benchmarking Unified Understanding & Generation for Causal World Process Insights
Paper
• 2512.01816
• Published
• 93
RULER-Bench: Probing Rule-based Reasoning Abilities of Next-level Video Generation Models for Vision Foundation Intelligence
Paper
• 2512.02622
• Published
• 10
Benchmarking Scientific Understanding and Reasoning for Video Generation using VideoScience-Bench
Paper
• 2512.02942
• Published
• 5
IF-Bench: Benchmarking and Enhancing MLLMs for Infrared Images with Generative Visual Prompting
Paper
• 2512.09663
• Published
• 4
OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation
Paper
• 2512.06589
• Published
• 19
V-REX: Benchmarking Exploratory Visual Reasoning via Chain-of-Questions
Paper
• 2512.11995
• Published
• 10
From Macro to Micro: Benchmarking Microscopic Spatial Intelligence on Molecules via Vision-Language Models
Paper
• 2512.10867
• Published
• 16
GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation
Paper
• 2512.17495
• Published
• 20
HERBench: A Benchmark for Multi-Evidence Integration in Video Question Answering
Paper
• 2512.14870
• Published
• 15
A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos
Paper
• 2512.16978
• Published
• 6
Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image
Paper
• 2512.16899
• Published
• 14
VABench: A Comprehensive Benchmark for Audio-Video Generation
Paper
• 2512.09299
• Published
• 8
JMMMU-Pro: Image-based Japanese Multi-discipline Multimodal Understanding Benchmark via Vibe Benchmark Construction
Paper
• 2512.14620
• Published
• 2
UniPercept: Towards Unified Perceptual-Level Image Understanding across Aesthetics, Quality, Structure, and Texture
Paper
• 2512.21675
• Published
• 25
SciEvalKit: An Open-source Evaluation Toolkit for Scientific General Intelligence
Paper
• 2512.22334
• Published
• 36
TSRBench: A Comprehensive Multi-task Multi-modal Time Series Reasoning Benchmark for Generalist Models
Paper
• 2601.18744
• Published
• 10
Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models
Paper
• 2602.02185
• Published
• 125
Toward Cognitive Supersensing in Multimodal Large Language Model
Paper
• 2602.01541
• Published
• 16
AdaptMMBench: Benchmarking Adaptive Multimodal Reasoning for Mode Selection and Reasoning Process
Paper
• 2602.02676
• Published
• 10
WorldVQA: Measuring Atomic World Knowledge in Multimodal Large Language Models
Paper
• 2602.02537
• Published
• 6