Models
Datasets
Spaces
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2503.10639

DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding

Paper • 2503.12797 • Published Mar 17 • 32
CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era

Paper • 2503.12329 • Published Mar 16 • 27
GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing

Paper • 2503.10639 • Published Mar 13 • 53
SmolVLM: Redefining small and efficient multimodal models

Paper • 2504.05299 • Published Apr 7 • 200

GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing

Paper • 2503.10639 • Published Mar 13 • 53

UCFE: A User-Centric Financial Expertise Benchmark for Large Language Models

Paper • 2410.14059 • Published Oct 17, 2024 • 62
Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching

Paper • 2503.05179 • Published Mar 7 • 46
Token-Efficient Long Video Understanding for Multimodal LLMs

Paper • 2503.04130 • Published Mar 6 • 96
GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing

Paper • 2503.10639 • Published Mar 13 • 53

MLLM-as-a-Judge for Image Safety without Human Labeling

Paper • 2501.00192 • Published Dec 31, 2024 • 31
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Paper • 2501.00958 • Published Jan 1 • 107
Xmodel-2 Technical Report

Paper • 2412.19638 • Published Dec 27, 2024 • 26
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

Paper • 2412.18925 • Published Dec 25, 2024 • 104

WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning

Paper • 2411.02337 • Published Nov 4, 2024 • 36
Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

Paper • 2411.04996 • Published Nov 7, 2024 • 51
Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level

Paper • 2411.03562 • Published Nov 5, 2024 • 68
StructRAG: Boosting Knowledge Intensive Reasoning of LLMs via Inference-time Hybrid Information Structurization

Paper • 2410.08815 • Published Oct 11, 2024 • 47

GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing

Paper • 2503.10639 • Published Mar 13 • 53
R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

Paper • 2503.10615 • Published Mar 13 • 17

PhysDreamer: Physics-Based Interaction with 3D Objects via Video Generation

Paper • 2404.13026 • Published Apr 19, 2024 • 24
Distilling Diversity and Control in Diffusion Models

Paper • 2503.10637 • Published Mar 13 • 14
GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing

Paper • 2503.10639 • Published Mar 13 • 53
GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation

Paper • 2504.02782 • Published Apr 3 • 57

RL+reason model

RL + Transformer = A General-Purpose Problem Solver

Paper • 2501.14176 • Published Jan 24 • 28
Towards General-Purpose Model-Free Reinforcement Learning

Paper • 2501.16142 • Published Jan 27 • 30
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Paper • 2501.17161 • Published Jan 28 • 123
MaxInfoRL: Boosting exploration in reinforcement learning through information gain maximization

Paper • 2412.12098 • Published Dec 16, 2024 • 4

about 22 hours ago

BrushEdit: All-In-One Image Inpainting and Editing

Paper • 2412.10316 • Published Dec 13, 2024 • 35
ColorFlow: Retrieval-Augmented Image Sequence Colorization

Paper • 2412.11815 • Published Dec 16, 2024 • 26
FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers

Paper • 2412.09611 • Published Dec 12, 2024 • 11
FireFlow: Fast Inversion of Rectified Flow for Image Semantic Editing

Paper • 2412.07517 • Published Dec 10, 2024 • 11

Multimodal Dataset

SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers

Paper • 2407.09413 • Published Jul 12, 2024 • 11
MAVIS: Mathematical Visual Instruction Tuning

Paper • 2407.08739 • Published Jul 11, 2024 • 33
Kvasir-VQA: A Text-Image Pair GI Tract Dataset

Paper • 2409.01437 • Published Sep 2, 2024 • 71
MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct

Paper • 2409.05840 • Published Sep 9, 2024 • 49

DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding

Paper • 2503.12797 • Published Mar 17 • 32
CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era

Paper • 2503.12329 • Published Mar 16 • 27
GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing

Paper • 2503.10639 • Published Mar 13 • 53
SmolVLM: Redefining small and efficient multimodal models

Paper • 2504.05299 • Published Apr 7 • 200

GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing

Paper • 2503.10639 • Published Mar 13 • 53
R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

Paper • 2503.10615 • Published Mar 13 • 17

GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing

Paper • 2503.10639 • Published Mar 13 • 53

PhysDreamer: Physics-Based Interaction with 3D Objects via Video Generation

Paper • 2404.13026 • Published Apr 19, 2024 • 24
Distilling Diversity and Control in Diffusion Models

Paper • 2503.10637 • Published Mar 13 • 14
GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing

Paper • 2503.10639 • Published Mar 13 • 53
GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation

Paper • 2504.02782 • Published Apr 3 • 57

UCFE: A User-Centric Financial Expertise Benchmark for Large Language Models

Paper • 2410.14059 • Published Oct 17, 2024 • 62
Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching

Paper • 2503.05179 • Published Mar 7 • 46
Token-Efficient Long Video Understanding for Multimodal LLMs

Paper • 2503.04130 • Published Mar 6 • 96
GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing

Paper • 2503.10639 • Published Mar 13 • 53

RL+reason model

RL + Transformer = A General-Purpose Problem Solver

Paper • 2501.14176 • Published Jan 24 • 28
Towards General-Purpose Model-Free Reinforcement Learning

Paper • 2501.16142 • Published Jan 27 • 30
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Paper • 2501.17161 • Published Jan 28 • 123
MaxInfoRL: Boosting exploration in reinforcement learning through information gain maximization

Paper • 2412.12098 • Published Dec 16, 2024 • 4

MLLM-as-a-Judge for Image Safety without Human Labeling

Paper • 2501.00192 • Published Dec 31, 2024 • 31
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Paper • 2501.00958 • Published Jan 1 • 107
Xmodel-2 Technical Report

Paper • 2412.19638 • Published Dec 27, 2024 • 26
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

Paper • 2412.18925 • Published Dec 25, 2024 • 104

about 22 hours ago

BrushEdit: All-In-One Image Inpainting and Editing

Paper • 2412.10316 • Published Dec 13, 2024 • 35
ColorFlow: Retrieval-Augmented Image Sequence Colorization

Paper • 2412.11815 • Published Dec 16, 2024 • 26
FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers

Paper • 2412.09611 • Published Dec 12, 2024 • 11
FireFlow: Fast Inversion of Rectified Flow for Image Semantic Editing

Paper • 2412.07517 • Published Dec 10, 2024 • 11

WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning

Paper • 2411.02337 • Published Nov 4, 2024 • 36
Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

Paper • 2411.04996 • Published Nov 7, 2024 • 51
Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level

Paper • 2411.03562 • Published Nov 5, 2024 • 68
StructRAG: Boosting Knowledge Intensive Reasoning of LLMs via Inference-time Hybrid Information Structurization

Paper • 2410.08815 • Published Oct 11, 2024 • 47

Multimodal Dataset

SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers

Paper • 2407.09413 • Published Jul 12, 2024 • 11
MAVIS: Mathematical Visual Instruction Tuning

Paper • 2407.08739 • Published Jul 11, 2024 • 33
Kvasir-VQA: A Text-Image Pair GI Tract Dataset

Paper • 2409.01437 • Published Sep 2, 2024 • 71
MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct

Paper • 2409.05840 • Published Sep 9, 2024 • 49

Previous
1
2
Next

Company

TOS Privacy About Careers

Website

Models Datasets Spaces Pricing Docs