DiffLMM Model Card

Paper

The model was presented in the paper Emergent Visual Grounding in Large Multimodal Models Without Grounding Supervision.

Project Page

https://GroundLMM-ICCV.github.io/

Code

The official implementation is available at: https://github.com/Shengcao-Cao/GroundLMM

Model details

Model type: DiffLMM is a multimodal model built based on LLaVA and Stable Diffusion with enhanced grounding ability and preserved conversation ability.

Sample Usage

DiffLMM can be used just like LLaVA-1.5-7B. This checkpoint only includes the LoRA weights, and thus the base model Vicuna-1.5-7B should always be included when using this model.

For example, you may have a conversation with DiffLMM just like LLaVA:

CUDA_VISIBLE_DEVICES=0 python -m llava.serve.cli \
    --model-path Shengcao1006/difflmm-llava-v1.5-7b-lora \
    --model-base lmsys/vicuna-7b-v1.5 \
    --image-file images/llava_logo.png \
    --conv-mode llava_v1 \
    --temperature 0.2 \
    --max-new-tokens 512

License

This project is released under the Apache 2.0 license. Other codes from open source repository follows the original distributive licenses.

Acknowledgements

Our work is greatly inspired by the following repositories:

We greatly appreciate their open-source work!

Citation

If you find our research interesting or use our code, model, or method in your research, please consider citing our work.

@inproceedings{cao2025emergent,
  title={Emergent Visual Grounding in Large Multimodal Models Without Grounding Supervision},
  author={Cao, Shengcao and Gui, Liang-Yan and Wang, Yu-Xiong},
  booktitle={ICCV Findings},
  year={2025}
}

Downloads last month: 11

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Shengcao1006/difflmm-llava-v1.5-7b-lora

Base model

liuhaotian/llava-v1.5-7b-lora

Finetuned

(1)

this model