DiffLMM Model Card
Paper
The model was presented in the paper Emergent Visual Grounding in Large Multimodal Models Without Grounding Supervision.
Project Page
https://GroundLMM-ICCV.github.io/
Code
The official implementation is available at: https://github.com/Shengcao-Cao/GroundLMM
Model details
Model type: DiffLMM is a multimodal model built based on LLaVA and Stable Diffusion with enhanced grounding ability and preserved conversation ability.
Sample Usage
DiffLMM can be used just like LLaVA-1.5-7B. This checkpoint only includes the LoRA weights, and thus the base model Vicuna-1.5-7B should always be included when using this model.
For example, you may have a conversation with DiffLMM just like LLaVA:
CUDA_VISIBLE_DEVICES=0 python -m llava.serve.cli \
--model-path Shengcao1006/difflmm-llava-v1.5-7b-lora \
--model-base lmsys/vicuna-7b-v1.5 \
--image-file images/llava_logo.png \
--conv-mode llava_v1 \
--temperature 0.2 \
--max-new-tokens 512
License
This project is released under the Apache 2.0 license. Other codes from open source repository follows the original distributive licenses.
Acknowledgements
Our work is greatly inspired by the following repositories:
We greatly appreciate their open-source work!
Citation
If you find our research interesting or use our code, model, or method in your research, please consider citing our work.
@inproceedings{cao2025emergent,
title={Emergent Visual Grounding in Large Multimodal Models Without Grounding Supervision},
author={Cao, Shengcao and Gui, Liang-Yan and Wang, Yu-Xiong},
booktitle={ICCV Findings},
year={2025}
}
- Downloads last month
- 11
Model tree for Shengcao1006/difflmm-llava-v1.5-7b-lora
Base model
liuhaotian/llava-v1.5-7b-lora