Multimodal Reasoning
updated
InfiR : Crafting Effective Small Language Models and Multimodal Small
Language Models in Reasoning
Paper
• 2502.11573
• Published
• 9
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking
Paper
• 2502.02339
• Published
• 23
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model
Paper
• 2502.11775
• Published
• 9
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via
Collective Monte Carlo Tree Search
Paper
• 2412.18319
• Published
• 39
LLaVA-o1: Let Vision Language Models Reason Step-by-Step
Paper
• 2411.10440
• Published
• 129
Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for
Multimodal Reasoning Models
Paper
• 2502.16033
• Published
• 18
MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language
Models (VLMs) via Reinforcement Learning
Paper
• 2502.19634
• Published
• 63
Visual-RFT: Visual Reinforcement Fine-Tuning
Paper
• 2503.01785
• Published
• 86
MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale
Reinforcement Learning
Paper
• 2503.07365
• Published
• 61
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large
Language Models
Paper
• 2503.06749
• Published
• 31
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through
Two-Stage Rule-Based RL
Paper
• 2503.07536
• Published
• 88
Diving into Self-Evolving Training for Multimodal Reasoning
Paper
• 2412.17451
• Published
• 42
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large
Language Models
Paper
• 2411.14432
• Published
• 25
R1-Omni: Explainable Omni-Multimodal Emotion Recognition with
Reinforcing Learning
Paper
• 2503.05379
• Published
• 38
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning
Paper
• 2503.10291
• Published
• 36
R1-Onevision: Advancing Generalized Multimodal Reasoning through
Cross-Modal Formalization
Paper
• 2503.10615
• Published
• 17
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
Paper
• 2503.12605
• Published
• 35
R1-VL: Learning to Reason with Multimodal Large Language Models via
Step-wise Group Relative Policy Optimization
Paper
• 2503.12937
• Published
• 30
VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning
Paper
• 2503.13444
• Published
• 17
DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs
for Knowledge-Intensive Visual Grounding
Paper
• 2503.12797
• Published
• 32
OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning
via Iterative Self-Improvement
Paper
• 2503.17352
• Published
• 24
MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical
Problems
Paper
• 2503.16549
• Published
• 15
Vision-R1: Evolving Human-Free Alignment in Large Vision-Language Models
via Vision-Guided Reinforcement Learning
Paper
• 2503.18013
• Published
• 20
Video-R1: Reinforcing Video Reasoning in MLLMs
Paper
• 2503.21776
• Published
• 79
UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement
Learning
Paper
• 2503.21620
• Published
• 62
OThink-MR1: Stimulating multimodal generalized reasoning capabilities
via dynamic reinforcement learning
Paper
• 2503.16081
• Published
• 28
Improved Visual-Spatial Reasoning via R1-Zero-Like Training
Paper
• 2504.00883
• Published
• 67
Rethinking RL Scaling for Vision Language Models: A Transparent,
From-Scratch Framework and Comprehensive Evaluation Scheme
Paper
• 2504.02587
• Published
• 32
Why Reasoning Matters? A Survey of Advancements in Multimodal Reasoning
(v1)
Paper
• 2504.03151
• Published
• 15
Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought
Paper
• 2504.05599
• Published
• 85
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement
Fine-Tuning
Paper
• 2504.06958
• Published
• 13
OmniCaptioner: One Captioner to Rule Them All
Paper
• 2504.07089
• Published
• 20
Paper
• 2504.07491
• Published
• 137
InternVL3: Exploring Advanced Training and Test-Time Recipes for
Open-Source Multimodal Models
Paper
• 2504.10479
• Published
• 306
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models
with Reinforcement Learning
Paper
• 2504.08837
• Published
• 43
TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning
Paper
• 2504.09641
• Published
• 16
VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search
Paper
• 2504.09130
• Published
• 12
NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation
Paper
• 2504.13055
• Published
• 19
InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to
Deliberative Reasoners
Paper
• 2504.14239
• Published
• 14
Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning
Paper
• 2504.16656
• Published
• 57
Unified Multimodal Chain-of-Thought Reward Model through Reinforcement
Fine-Tuning
Paper
• 2505.03318
• Published
• 92
Perception, Reason, Think, and Plan: A Survey on Large Multimodal
Reasoning Models
Paper
• 2505.04921
• Published
• 186
X-Reasoner: Towards Generalizable Reasoning Across Modalities and
Domains
Paper
• 2505.03981
• Published
• 15
Seed1.5-VL Technical Report
Paper
• 2505.07062
• Published
• 155
Skywork-VL Reward: An Effective Reward Model for Multimodal
Understanding and Reasoning
Paper
• 2505.07263
• Published
• 30
Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?
Paper
• 2505.09439
• Published
• 10
OpenThinkIMG: Learning to Think with Images via Visual Tool
Reinforcement Learning
Paper
• 2505.08617
• Published
• 42
GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning
Paper
• 2505.11049
• Published
• 61
Visual Planning: Let's Think Only with Images
Paper
• 2505.11409
• Published
• 57
MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable
Step-Level Supervision
Paper
• 2505.13427
• Published
• 26
VisionReasoner: Unified Visual Perception and Reasoning via
Reinforcement Learning
Paper
• 2505.12081
• Published
• 18
Emerging Properties in Unified Multimodal Pretraining
Paper
• 2505.14683
• Published
• 133
VisualQuality-R1: Reasoning-Induced Image Quality Assessment via
Reinforcement Learning to Rank
Paper
• 2505.14460
• Published
• 33
Visionary-R1: Mitigating Shortcuts in Visual Reasoning with
Reinforcement Learning
Paper
• 2505.14677
• Published
• 15
UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement
Learning
Paper
• 2505.14231
• Published
• 53
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with
Curiosity-Driven Reinforcement Learning
Paper
• 2505.15966
• Published
• 53
GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation
with Reinforcement Learning
Paper
• 2505.17022
• Published
• 27
SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward
Paper
• 2505.17018
• Published
• 15
Think or Not? Selective Reasoning via Reinforcement Learning for
Vision-Language Models
Paper
• 2505.16854
• Published
• 11
GRIT: Teaching MLLMs to Think with Images
Paper
• 2505.15879
• Published
• 13
SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based
on Speech and Audio Information
Paper
• 2505.13237
• Published
• 1
VLM-R^3: Region Recognition, Reasoning, and Refinement for Enhanced
Multimodal Chain-of-Thought
Paper
• 2505.16192
• Published
• 12
Training-Free Reasoning and Reflection in MLLMs
Paper
• 2505.16151
• Published
• 9
Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System
Collaboration
Paper
• 2505.20256
• Published
• 19
G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language
Model via Reinforcement Learning
Paper
• 2505.13426
• Published
• 13
STAR-R1: Spatial TrAnsformation Reasoning by Reinforcing Multimodal LLMs
Paper
• 2505.15804
• Published
• 10
Jodi: Unification of Visual Generation and Understanding via Joint
Modeling
Paper
• 2505.19084
• Published
• 20
VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied
Iterative Policy Optimization
Paper
• 2505.19000
• Published
• 42
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
Paper
• 2505.21374
• Published
• 28
Active-O3: Empowering Multimodal Large Language Models with Active
Perception via GRPO
Paper
• 2505.21457
• Published
• 16
Beyond Distillation: Pushing the Limits of Medical LLM Reasoning with
Minimalist Rule-Based RL
Paper
• 2505.17952
• Published
• 20
R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large
Language Models via Share-GRPO
Paper
• 2505.16673
• Published
• 2
Sherlock: Self-Correcting Reasoning in Vision-Language Models
Paper
• 2505.22651
• Published
• 48
Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO
Paper
• 2505.22453
• Published
• 46
Advancing Multimodal Reasoning via Reinforcement Learning with Cold
Start
Paper
• 2505.22334
• Published
• 36
Fostering Video Reasoning via Next-Event Prediction
Paper
• 2505.22457
• Published
• 29
Thinking with Generated Images
Paper
• 2505.22525
• Published
• 15
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial
Intelligence
Paper
• 2505.23747
• Published
• 69
UniRL: Self-Improving Unified Multimodal Models via Supervised and
Reinforcement Learning
Paper
• 2505.23380
• Published
• 22
cadrille: Multi-modal CAD Reconstruction with Online Reinforcement
Learning
Paper
• 2505.22914
• Published
• 37
Grounded Reinforcement Learning for Visual Reasoning
Paper
• 2505.23678
• Published
• 2
More Thinking, Less Seeing? Assessing Amplified Hallucination in
Multimodal Reasoning Models
Paper
• 2505.21523
• Published
• 13
SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware
Reinforcement Learning
Paper
• 2506.01713
• Published
• 48
Advancing Multimodal Reasoning: From Optimized Cold Start to Staged
Reinforcement Learning
Paper
• 2506.04207
• Published
• 48
AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual
Counting for MLLMs
Paper
• 2506.05328
• Published
• 21
Perceptual Decoupling for Scalable Multi-modal Reasoning via
Reward-Optimized Captioning
Paper
• 2506.04559
• Published
• 2
Look Before You Leap: A GUI-Critic-R1 Model for Pre-Operative Error
Diagnosis in GUI Automation
Paper
• 2506.04614
• Published
• 19
ComfyUI-R1: Exploring Reasoning Models for Workflow Generation
Paper
• 2506.09790
• Published
• 53
DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware
Regressive GRPO
Paper
• 2506.07464
• Published
• 14
Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning
Paper
• 2506.13654
• Published
• 43
VGR: Visual Grounded Reasoning
Paper
• 2506.11991
• Published
• 20
Enhancing Step-by-Step and Verifiable Medical Reasoning in MLLMs
Paper
• 2506.16962
• Published
• 10
GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal
Reasoning
Paper
• 2506.16141
• Published
• 27
ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language
Models for Audio Generation and Editing
Paper
• 2506.21448
• Published
• 8
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable
Reinforcement Learning
Paper
• 2507.01006
• Published
• 251
HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context
Paper
• 2506.21277
• Published
• 14
Kwai Keye-VL Technical Report
Paper
• 2507.01949
• Published
• 131
Thinking with Images for Multimodal Reasoning: Foundations, Methods, and
Future Frontiers
Paper
• 2506.23918
• Published
• 90
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based
Reinforcement Learning
Paper
• 2507.05920
• Published
• 12
Perception-Aware Policy Optimization for Multimodal Reasoning
Paper
• 2507.06448
• Published
• 48
Skywork-R1V3 Technical Report
Paper
• 2507.06167
• Published
• 73
Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for
Visual Reasoning
Paper
• 2507.05255
• Published
• 75
VisionThink: Smart and Efficient Vision Language Model via Reinforcement
Learning
Paper
• 2507.13348
• Published
• 79
Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning
Paper
• 2507.16746
• Published
• 34
ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent
Planning
Paper
• 2507.16815
• Published
• 42
Semi-off-Policy Reinforcement Learning for Vision-Language Slow-thinking
Reasoning
Paper
• 2507.16814
• Published
• 21
VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced
Multimodal Reasoning
Paper
• 2507.22607
• Published
• 47
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
Paper
• 2503.15558
• Published
• 50
MolmoAct: Action Reasoning Models that can Reason in Space
Paper
• 2508.07917
• Published
• 44
We-Math 2.0: A Versatile MathBook System for Incentivizing Visual
Mathematical Reasoning
Paper
• 2508.10433
• Published
• 144
HumanSense: From Multimodal Perception to Empathetic Context-Aware
Responses through Reasoning MLLMs
Paper
• 2508.10576
• Published
• 8
R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs
via Bi-Mode Annealing and Reinforce Learning
Paper
• 2508.21113
• Published
• 110
LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model
Paper
• 2509.00676
• Published
• 85
Planning with Reasoning using Vision Language World Model
Paper
• 2509.02722
• Published
• 24
Focusing by Contrastive Attention: Enhancing VLMs' Visual Reasoning
Paper
• 2509.06461
• Published
• 20
Look Again, Think Slowly: Enhancing Visual Reflection in Vision-Language
Models
Paper
• 2509.12132
• Published
• 7
Multimodal Reasoning for Science: Technical Report and 1st Place
Solution to the ICML 2025 SeePhys Challenge
Paper
• 2509.06079
• Published
• 6
BaseReward: A Strong Baseline for Multimodal Reward Model
Paper
• 2509.16127
• Published
• 21
BTL-UI: Blink-Think-Link Reasoning Model for GUI Agent
Paper
• 2509.15566
• Published
• 14
MARS2 2025 Challenge on Multimodal Reasoning: Datasets, Methods,
Results, Discussion, and Outlook
Paper
• 2509.14142
• Published
• 10
MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and
Open Resources
Paper
• 2509.21268
• Published
• 104
Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified
Self-Play
Paper
• 2509.25541
• Published
• 140
More Thought, Less Accuracy? On the Dual Nature of Reasoning in
Vision-Language Models
Paper
• 2509.25848
• Published
• 80
VLA-R1: Enhancing Reasoning in Vision-Language-Action Models
Paper
• 2510.01623
• Published
• 12
Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large
Multimodal Models
Paper
• 2510.05034
• Published
• 51
UniVideo: Unified Understanding, Generation, and Editing for Videos
Paper
• 2510.08377
• Published
• 81
TTRV: Test-Time Reinforcement Learning for Vision Language Models
Paper
• 2510.06783
• Published
• 12
Generative Universal Verifier as Multimodal Meta-Reasoner
Paper
• 2510.13804
• Published
• 27
Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal
Evidence
Paper
• 2510.20579
• Published
• 56
Directional Reasoning Injection for Fine-Tuning MLLMs
Paper
• 2510.15050
• Published
• 12
Video-Thinker: Sparking "Thinking with Videos" via Reinforcement
Learning
Paper
• 2510.23473
• Published
• 85
SeeingEye: Agentic Information Flow Unlocks Multimodal Reasoning In
Text-only LLMs
Paper
• 2510.25092
• Published
• 8
Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with
Free-Form Preferences
Paper
• 2510.23451
• Published
• 28
When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for
Visual Chain-of-Thought
Paper
• 2511.02779
• Published
• 59
Thinking with Video: Video Generation as a Promising Multimodal
Reasoning Paradigm
Paper
• 2511.04570
• Published
• 240
V-Thinker: Interactive Thinking with Images
Paper
• 2511.04460
• Published
• 97
MathSE: Improving Multimodal Mathematical Reasoning via Self-Evolving Iterative Reflection and Reward-Guided Fine-Tuning
Paper
• 2511.06805
• Published
• 13
REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding
Paper
• 2511.13026
• Published
• 26
VisPlay: Self-Evolving Vision-Language Models from Images
Paper
• 2511.15661
• Published
• 43
Thinking-while-Generating: Interleaving Textual Reasoning throughout Visual Generation
Paper
• 2511.16671
• Published
• 16
MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models
Paper
• 2511.18373
• Published
• 7
OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe
Paper
• 2511.16334
• Published
• 93
GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization
Paper
• 2511.15705
• Published
• 97
SPHINX: A Synthetic Environment for Visual Perception and Reasoning
Paper
• 2511.20814
• Published
• 2
Think Visually, Reason Textually: Vision-Language Synergy in ARC
Paper
• 2511.15703
• Published
• 9
MIRA: Multimodal Iterative Reasoning Agent for Image Editing
Paper
• 2511.21087
• Published
• 10
REASONEDIT: Towards Reasoning-Enhanced Image Editing Models
Paper
• 2511.22625
• Published
• 47
Geometrically-Constrained Agent for Spatial Reasoning
Paper
• 2511.22659
• Published
• 41
DualVLA: Building a Generalizable Embodied Agent via Partial Decoupling of Reasoning and Action
Paper
• 2511.22134
• Published
• 22
Skywork-R1V4: Toward Agentic Multimodal Intelligence through Interleaved Thinking with Images and DeepResearch
Paper
• 2512.02395
• Published
• 49
Revisiting the Necessity of Lengthy Chain-of-Thought in Vision-centric Reasoning Generalization
Paper
• 2511.22586
• Published
• 7
Artemis: Structured Visual Reasoning for Perception Policy Learning
Paper
• 2512.01988
• Published
• 2
CodeV: Code with Images for Faithful Visual Reasoning via Tool-Aware Policy Optimization
Paper
• 2511.19661
• Published
• 2
OneThinker: All-in-one Reasoning Model for Image and Video
Paper
• 2512.03043
• Published
• 33
Thinking with Programming Vision: Towards a Unified View for Thinking with Images
Paper
• 2512.03746
• Published
• 17
ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning
Paper
• 2512.05111
• Published
• 50
Colon-X: Advancing Intelligent Colonoscopy from Multimodal Understanding to Clinical Reasoning
Paper
• 2512.03667
• Published
• 5
Rethinking Chain-of-Thought Reasoning for Videos
Paper
• 2512.09616
• Published
• 19
VG-Refiner: Towards Tool-Refined Referring Grounded Reasoning via Agentic Reinforcement Learning
Paper
• 2512.06373
• Published
• 9
Thinking with Images via Self-Calling Agent
Paper
• 2512.08511
• Published
• 23
Robust-R1: Degradation-Aware Reasoning for Robust Visual Understanding
Paper
• 2512.17532
• Published
• 67
Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space
Paper
• 2512.12623
• Published
• 4
MMGR: Multi-Modal Generative Reasoning
Paper
• 2512.14691
• Published
• 119
Latent Implicit Visual Reasoning
Paper
• 2512.21218
• Published
• 69
See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning
Paper
• 2512.22120
• Published
• 15
InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search
Paper
• 2512.18745
• Published
• 12
VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice
Paper
• 2601.05175
• Published
• 36
Forest Before Trees: Latent Superposition for Efficient Visual Reasoning
Paper
• 2601.06803
• Published
• 10
Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning
Paper
• 2601.09536
• Published
• 5
Urban Socio-Semantic Segmentation with Vision-Language Reasoning
Paper
• 2601.10477
• Published
• 155
LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning
Paper
• 2601.10129
• Published
• 12
FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation
Paper
• 2601.13976
• Published
• 21
Render-of-Thought: Rendering Textual Chain-of-Thought as Images for Visual Latent Reasoning
Paper
• 2601.14750
• Published
• 17
PROGRESSLM: Towards Progress Reasoning in Vision-Language Models
Paper
• 2601.15224
• Published
• 12
MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods
Paper
• 2601.21821
• Published
• 59
VTC-R1: Vision-Text Compression for Efficient Long-Context Reasoning
Paper
• 2601.22069
• Published
• 7
Thinking with Comics: Enhancing Multimodal Reasoning through Structured Visual Storytelling
Paper
• 2602.02453
• Published
• 36
Training Data Efficiency in Multimodal Process Reward Models
Paper
• 2602.04145
• Published
• 76
SwimBird: Eliciting Switchable Reasoning Mode in Hybrid Autoregressive MLLMs
Paper
• 2602.06040
• Published
• 10