--- license: apache-2.0 pipeline_tag: image-feature-extraction tags: - model_hub_mixin - pytorch_model_hub_mixin - vision - image-tokenization --- # Communication-Inspired Tokenization for Structured Image Representations

Aram Davtyan • Yusuf Sahin • Yasaman Haghighi • Sebastian Stapf • Pablo Acuaviva • Alexandre Alahi • Paolo Favaro

Official pre-trained models for the paper: [Communication-Inspired Tokenization for Structured Image Representations](https://arxiv.org/abs/2602.20731). [[Website](https://araachie.github.io/comit/)] [[Code](https://github.com/Araachie/comit)] [[Paper](https://arxiv.org/abs/2602.20731)] ## Installation Follow the instructions at [https://github.com/Araachie/comit](https://github.com/Araachie/comit) ## Usage Example usage, downloading `COMiT-B` from the Hugging Face Hub: ```python import torch from comit import COMiT device = "cuda" if torch.cuda.is_available() else "cpu" model = COMiT.from_pretrained('cvg-unibe/comit-b') model.eval().to(device) ``` With a pretrained COMiT model images can be encoded into token sequences as follows: ```python with torch.no_grad(): token_dict = model.tokenize( batch, global_crop=False, # Whether to use the global crop as the first observation order="adaptive", # One of ["raster_scan", "random", "adaptive"] or a list of crop indices num_crops=3, # Used to truncate the list of crops to embed ) ``` By default the tokenization pipeline returns a list of 256 6-dimensional tokens. If token indices are needed instead, they can be obtained via: ```python token_ids = model.quantizer.codes_to_indices(token_dict["msgs"]) ``` To visually probe the information in the token sequences, one can decode the tokens back into images: ```python with torch.no_grad(): detoken_dict = model.detokenize( msgs=token_dict["msgs"], offsets=token_dict["offsets"], num_steps=10, # Number of denoising steps odesolver="euler", # The numerical velocity field integration method cfg_weight=7.5, # CFG strength ) ``` For convenience we also provide the `reconstruct` method that pipelines `tokenize` and `detokenize` into a single call: ```python with torch.no_grad(): rec_dict = model.reconstruct( batch, global_crop=False, order="adaptive", num_crops=3, num_steps=10, odesolver="euler", cfg_weight=7.5, ) ``` ## Licensing Unless otherwise noted, the model weights are licensed under Apache license 2.0. For the code licensing, see [https://github.com/Araachie/comit?tab=readme-ov-file#licensing](https://github.com/Araachie/comit?tab=readme-ov-file#licensing) ## Citation If you find this work helpful, please consider citing our work: ```bibtex @misc{davtyan2026comit, title={Communication-Inspired Tokenization for Structured Image Representations}, author={Aram Davtyan and Yusuf Sahin and Yasaman Haghighi and Sebastian Stapf and Pablo Acuaviva and Alexandre Alahi and Paolo Favaro}, year={2026}, eprint={2602.20731}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2602.20731}, } ```