Multimodal Image Classification - a Zhang124 Collection

Zhang124 's Collections

image Transformer

Multimodal Image Classification

Multimodal Image Classification

updated Jan 25

What Do You See? Enhancing Zero-Shot Image Classification with Multimodal Large Language Models

Paper • 2405.15668 • Published May 24, 2024
On Large Multimodal Models as Open-World Image Classifiers

Paper • 2503.21851 • Published Mar 27, 2025 • 5
Benchmarking Large Language Models for Image Classification of Marine Mammals

Paper • 2410.19848 • Published Oct 22, 2024
Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding

Paper • 2501.07783 • Published Jan 14, 2025 • 8
VALE: A Multimodal Visual and Language Explanation Framework for Image Classifiers using eXplainable AI and Language Models

Paper • 2408.12808 • Published Aug 23, 2024
Sparse Attention Vectors: Generative Multimodal Model Features Are Discriminative Vision-Language Classifiers

Paper • 2412.00142 • Published Nov 28, 2024 • 5
Interpretable Bilingual Multimodal Large Language Model for Diverse Biomedical Tasks

Paper • 2410.18387 • Published Oct 24, 2024
How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks

Paper • 2507.01955 • Published Jul 2, 2025 • 36
MMIG-Bench: Towards Comprehensive and Explainable Evaluation of Multi-Modal Image Generation Models

Paper • 2505.19415 • Published May 26, 2025 • 2
MM-DINOv2: Adapting Foundation Models for Multi-Modal Medical Image Analysis

Paper • 2509.06617 • Published Sep 8, 2025 • 1
GAIA: A Global, Multi-modal, Multi-scale Vision-Language Dataset for Remote Sensing Image Analysis

Paper • 2502.09598 • Published Feb 13, 2025
DuPLUS: Dual-Prompt Vision-Language Framework for Universal Medical Image Segmentation and Prognosis

Paper • 2510.03483 • Published Oct 3, 2025 • 1
Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion

Paper • 2502.04263 • Published Feb 6, 2025 • 1
GeoPix: Multi-Modal Large Language Model for Pixel-level Image Understanding in Remote Sensing

Paper • 2501.06828 • Published Jan 12, 2025
A multi-modal dataset for insect biodiversity with imagery and DNA at the trap and individual level

Paper • 2507.06972 • Published Jul 9, 2025
RS-RAG: Bridging Remote Sensing Imagery and Comprehensive Knowledge with a Multi-Modal Dataset and Retrieval-Augmented Generation Model

Paper • 2504.04988 • Published Apr 7, 2025
Towards Explainable Fake Image Detection with Multi-Modal Large Language Models

Paper • 2504.14245 • Published Apr 19, 2025
MINT: Multi-modal Chain of Thought in Unified Generative Models for Enhanced Image Generation

Paper • 2503.01298 • Published Mar 3, 2025 • 1
How Do Images Align and Complement LiDAR? Towards a Harmonized Multi-modal 3D Panoptic Segmentation

Paper • 2505.18956 • Published May 25, 2025 • 1
MMGR: Multi-Modal Generative Reasoning

Paper • 2512.14691 • Published Dec 16, 2025 • 119
CSFMamba: Cross State Fusion Mamba Operator for Multimodal Remote Sensing Image Classification

Paper • 2509.00677 • Published Aug 31, 2025
Head Pursuit: Probing Attention Specialization in Multimodal Transformers

Paper • 2510.21518 • Published Oct 24, 2025
Distill CLIP (DCLIP): Enhancing Image-Text Retrieval via Cross-Modal Transformer Distillation

Paper • 2505.21549 • Published May 25, 2025
MIRAGE: Multimodal foundation model and benchmark for comprehensive retinal OCT image analysis

Paper • 2506.08900 • Published Jun 10, 2025 • 4