-
DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding
Paper • 2503.12797 • Published • 32 -
CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era
Paper • 2503.12329 • Published • 27 -
GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing
Paper • 2503.10639 • Published • 53 -
SmolVLM: Redefining small and efficient multimodal models
Paper • 2504.05299 • Published • 200
Collections
Discover the best community collections!
Collections including paper arxiv:2503.10639
-
UCFE: A User-Centric Financial Expertise Benchmark for Large Language Models
Paper • 2410.14059 • Published • 62 -
Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching
Paper • 2503.05179 • Published • 46 -
Token-Efficient Long Video Understanding for Multimodal LLMs
Paper • 2503.04130 • Published • 96 -
GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing
Paper • 2503.10639 • Published • 53
-
MLLM-as-a-Judge for Image Safety without Human Labeling
Paper • 2501.00192 • Published • 31 -
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Paper • 2501.00958 • Published • 107 -
Xmodel-2 Technical Report
Paper • 2412.19638 • Published • 26 -
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
Paper • 2412.18925 • Published • 104
-
WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning
Paper • 2411.02337 • Published • 36 -
Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models
Paper • 2411.04996 • Published • 51 -
Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level
Paper • 2411.03562 • Published • 68 -
StructRAG: Boosting Knowledge Intensive Reasoning of LLMs via Inference-time Hybrid Information Structurization
Paper • 2410.08815 • Published • 47
-
PhysDreamer: Physics-Based Interaction with 3D Objects via Video Generation
Paper • 2404.13026 • Published • 24 -
Distilling Diversity and Control in Diffusion Models
Paper • 2503.10637 • Published • 14 -
GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing
Paper • 2503.10639 • Published • 53 -
GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation
Paper • 2504.02782 • Published • 57
-
RL + Transformer = A General-Purpose Problem Solver
Paper • 2501.14176 • Published • 28 -
Towards General-Purpose Model-Free Reinforcement Learning
Paper • 2501.16142 • Published • 30 -
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Paper • 2501.17161 • Published • 123 -
MaxInfoRL: Boosting exploration in reinforcement learning through information gain maximization
Paper • 2412.12098 • Published • 4
-
BrushEdit: All-In-One Image Inpainting and Editing
Paper • 2412.10316 • Published • 35 -
ColorFlow: Retrieval-Augmented Image Sequence Colorization
Paper • 2412.11815 • Published • 26 -
FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers
Paper • 2412.09611 • Published • 11 -
FireFlow: Fast Inversion of Rectified Flow for Image Semantic Editing
Paper • 2412.07517 • Published • 11
-
SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers
Paper • 2407.09413 • Published • 11 -
MAVIS: Mathematical Visual Instruction Tuning
Paper • 2407.08739 • Published • 33 -
Kvasir-VQA: A Text-Image Pair GI Tract Dataset
Paper • 2409.01437 • Published • 71 -
MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct
Paper • 2409.05840 • Published • 49
-
DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding
Paper • 2503.12797 • Published • 32 -
CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era
Paper • 2503.12329 • Published • 27 -
GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing
Paper • 2503.10639 • Published • 53 -
SmolVLM: Redefining small and efficient multimodal models
Paper • 2504.05299 • Published • 200
-
PhysDreamer: Physics-Based Interaction with 3D Objects via Video Generation
Paper • 2404.13026 • Published • 24 -
Distilling Diversity and Control in Diffusion Models
Paper • 2503.10637 • Published • 14 -
GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing
Paper • 2503.10639 • Published • 53 -
GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation
Paper • 2504.02782 • Published • 57
-
UCFE: A User-Centric Financial Expertise Benchmark for Large Language Models
Paper • 2410.14059 • Published • 62 -
Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching
Paper • 2503.05179 • Published • 46 -
Token-Efficient Long Video Understanding for Multimodal LLMs
Paper • 2503.04130 • Published • 96 -
GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing
Paper • 2503.10639 • Published • 53
-
RL + Transformer = A General-Purpose Problem Solver
Paper • 2501.14176 • Published • 28 -
Towards General-Purpose Model-Free Reinforcement Learning
Paper • 2501.16142 • Published • 30 -
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Paper • 2501.17161 • Published • 123 -
MaxInfoRL: Boosting exploration in reinforcement learning through information gain maximization
Paper • 2412.12098 • Published • 4
-
MLLM-as-a-Judge for Image Safety without Human Labeling
Paper • 2501.00192 • Published • 31 -
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Paper • 2501.00958 • Published • 107 -
Xmodel-2 Technical Report
Paper • 2412.19638 • Published • 26 -
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
Paper • 2412.18925 • Published • 104
-
BrushEdit: All-In-One Image Inpainting and Editing
Paper • 2412.10316 • Published • 35 -
ColorFlow: Retrieval-Augmented Image Sequence Colorization
Paper • 2412.11815 • Published • 26 -
FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers
Paper • 2412.09611 • Published • 11 -
FireFlow: Fast Inversion of Rectified Flow for Image Semantic Editing
Paper • 2412.07517 • Published • 11
-
WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning
Paper • 2411.02337 • Published • 36 -
Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models
Paper • 2411.04996 • Published • 51 -
Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level
Paper • 2411.03562 • Published • 68 -
StructRAG: Boosting Knowledge Intensive Reasoning of LLMs via Inference-time Hybrid Information Structurization
Paper • 2410.08815 • Published • 47
-
SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers
Paper • 2407.09413 • Published • 11 -
MAVIS: Mathematical Visual Instruction Tuning
Paper • 2407.08739 • Published • 33 -
Kvasir-VQA: A Text-Image Pair GI Tract Dataset
Paper • 2409.01437 • Published • 71 -
MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct
Paper • 2409.05840 • Published • 49