CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models Paper • 2412.12932 • Published Dec 17, 2024 • 2
Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining Paper • 2412.10342 • Published Dec 13, 2024
Towards Multimodal Empathetic Response Generation: A Rich Text-Speech-Vision Avatar-based Benchmark Paper • 2502.04976 • Published Feb 7, 2025
Derm1M: A Million-scale Vision-Language Dataset Aligned with Clinical Ontology Knowledge for Dermatology Paper • 2503.14911 • Published Mar 19, 2025 • 3
Unveiling the Cognitive Compass: Theory-of-Mind-Guided Multimodal Emotion Reasoning Paper • 2602.00971 • Published Feb 28
SOAR: Self-Correction for Optimal Alignment and Refinement in Diffusion Models Paper • 2604.12617 • Published Apr 14 • 6
Taming Actor-Observer Asymmetry in Agents via Dialectical Alignment Paper • 2604.19548 • Published 28 days ago • 16
SOAR: Self-Correction for Optimal Alignment and Refinement in Diffusion Models Paper • 2604.12617 • Published Apr 14 • 6
Towards Semantic Equivalence of Tokenization in Multimodal LLM Paper • 2406.05127 • Published Jun 7, 2024
So-Fake: Benchmarking and Explaining Social Media Image Forgery Detection Paper • 2505.18660 • Published May 24, 2025 • 2
Mixed-R1: Unified Reward Perspective For Reasoning Capability in Multimodal Large Language Models Paper • 2505.24164 • Published May 30, 2025
SMAP: Self-supervised Motion Adaptation for Physically Plausible Humanoid Whole-body Control Paper • 2505.19463 • Published May 26, 2025
MCM-DPO: Multifaceted Cross-Modal Direct Preference Optimization for Alt-text Generation Paper • 2510.00647 • Published Oct 1, 2025
A Reason-then-Describe Instruction Interpreter for Controllable Video Generation Paper • 2511.20563 • Published Nov 25, 2025 • 1