Papers
arxiv:2511.00391

VinciCoder: Unifying Multimodal Code Generation via Coarse-to-fine Visual Reinforcement Learning

Published on Nov 1
Authors:
,
,
,
,
,
,
,
,
,

Abstract

VinciCoder, a unified multimodal code generation model, achieves state-of-the-art performance using a two-stage training framework that combines Supervised Finetuning and Visual Reinforcement Learning.

AI-generated summary

Multimodal code generation has garnered significant interest within the research community. Despite the notable success of recent vision-language models (VLMs) on specialized tasks like chart-to-code generation, their reliance on single-task training regimens fosters a narrow paradigm that hinders the development of generalized VIsioN Code Intelligence. In this work, we introduce VinciCoder, a unified multimodal code generation model that addresses this limitation via a two-stage training framework. We begin by constructing a large-scale Supervised Finetuning (SFT) corpus comprising 1.6M image-code pairs for tasks involving direct code generation and visual-based code refinement. Subsequently, we introduce a Visual Reinforcement Learning (ViRL) strategy, which employs a coarse-to-fine reward mechanism to improve visual fidelity by calculating visual similarity across local and global image patches. Extensive experiments on diverse multimodal code generation benchmarks demonstrate that VinciCoder achieves state-of-the-art performance, surpassing recent open-source models. The ablation study further validates the effectiveness of our proposed coarse-to-fine ViRL strategy. The data, code and model is available at https://github.com/DocTron-hub/VinciCoder.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2511.00391 in a model README.md to link it from this page.

Datasets citing this paper 2

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2511.00391 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.