RegionRet / README.md
Aeryn666's picture
Upload folder using huggingface_hub
c737ebe verified
metadata
base_model: colqwen2.5-base
library_name: peft

RegionRet

RegionRet is a LoRA adapter model for region-level vision-language retrieval, fine-tuned from ColQwen2.5-Base using Parameter-Efficient Fine-Tuning (PEFT).

Model Details

  • Model Type: LoRA Adapter (PEFT)
  • Base Model: ColQwen2.5-Base
  • Task Type: Feature Extraction
  • Framework: PEFT 0.14.0

LoRA Configuration

  • Rank (r): 32
  • LoRA Alpha: 32
  • LoRA Dropout: 0.1
  • Target Modules: MLP projections (down_proj, gate_proj, up_proj) and attention projections (k_proj, q_proj, v_proj, o_proj), plus custom_text_proj

Model Architecture

  • Processor: ColQwen2_5_Processor
  • Max Visual Tokens: 1536
  • Attention: Flash Attention 2
  • Precision: bfloat16

Uses

Please refer to https://github.com/Aeryn666/RegionRAG.

Training Details

Training Data

  • VisRAG-Ret-Train-In-domain-data
  • Visual-CoT (DocVQA, TextCap, TextVQA, InfographicsVQA)

Training Configuration

  • Loss Function: RegionContraLoss (global_tau=0.02, local_tau=0.25, local_coef=0.01)
  • Epochs: 5
  • Batch Size: 80 per device
  • Learning Rate: 2e-4
  • Precision: bfloat16
  • Gradient Checkpointing: Enabled

Limitations

  • Requires ColQwen2.5-Base base model to function
  • Optimized for region-level vision-language retrieval tasks
  • GPU with bfloat16 and Flash Attention 2 support recommended

Citation

If you use this model, please cite:

@misc{li2025regionragregionlevelretrievalaugmentedgeneration,
      title={RegionRAG: Region-level Retrieval-Augmented Generation for Visual Document Understanding}, 
      author={Yinglu Li and Zhiying Lu and Zhihang Liu and Yiwei Sun and Chuanbin Liu and Hongtao Xie},
      year={2025},
      eprint={2510.27261},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.27261}, 
}

License

Please refer to the license of the base model ColQwen2.5.