TWIN
Collection
Datasets and models from the paper "Same or Not? Enhancing Visual Perception in Vision-Language Models"
•
4 items
•
Updated
•
1
This repository contains the InternVL3.5-1B model post-trained on the TWIN dataset, as introduced in the paper Same or Not? Enhancing Visual Perception in Vision-Language Models.
TWIN is a large-scale dataset of 561,000 image-pair queries designed to enhance the perceptual abilities of Vision-Language Models (VLMs). It tasks models to determine whether two visually similar images depict the same object, encouraging attention to nuanced visual cues. Fine-tuning on TWIN yields significant gains in fine-grained recognition across various domains like art, animals, plants, and landmarks.
If you use TWIN in your research, please consider citing the work:
@misc{marsili2025notenhancingvisualperception,
title={Same or Not? Enhancing Visual Perception in Vision-Language Models},
author={Damiano Marsili and Aditya Mehta and Ryan Y. Lin and Georgia Gkioxari},
year={2025},
eprint={2512.23592},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.23592},
}
Base model
OpenGVLab/InternVL3_5-1B-Pretrained