Rethinking Patch Dependence for Masked Autoencoders
Paper • 2401.14391 • Published • 26
How to use birder-project/vit_reg4_so150m_p14_avg_mim-bio with Birder:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
A SoViT 150m p14 image encoder pretrained using Masked Image Modeling (MIM). It was trained with a mask unit size of 2 for 300 epochs. This model has not been fine-tuned for a specific classification task and is intended to be used as a general-purpose feature extractor or a backbone for downstream tasks like object detection, segmentation, or custom classification.
Model Type: Image classification and detection backbone
Model Stats:
Dataset: Trained on a diverse dataset of approximately 31M images, including:
Papers:
import birder
from birder.inference.classification import infer_image
# Option 1: manual setup (more control over preprocessing)
net, model_info = birder.load_pretrained_model("vit_reg4_so150m_p14_avg_mim-bio", inference=True)
# Get the image size the model was trained on
size = birder.get_size_from_signature(model_info.signature)
# Create an inference transform
transform = birder.classification_transform(size, model_info.rgb_stats)
# Option 2: helper (quick start with default preprocessing)
net, model_info, transform = birder.load_pretrained_model_and_transform("vit_reg4_so150m_p14_avg_mim-bio", inference=True)
image = "path/to/image.jpeg" # or a PIL image
out, embedding = infer_image(net, image, transform, return_embedding=True)
# embedding is a NumPy array with shape of (1, 896)
from PIL import Image
import birder
net, model_info, transform = birder.load_pretrained_model_and_transform("vit_reg4_so150m_p14_avg_mim-bio", inference=True)
image = Image.open("path/to/image.jpeg")
features = net.detection_features(transform(image).unsqueeze(0))
# features is a dict (stage name -> torch.Tensor)
print([(k, v.size()) for k, v in features.items()])
# Output example:
# [('stage1', torch.Size([1, 896, 16, 16]))]
@misc{dosovitskiy2021imageworth16x16words,
title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
author={Alexey Dosovitskiy and Lucas Beyer and Alexander Kolesnikov and Dirk Weissenborn and Xiaohua Zhai and Thomas Unterthiner and Mostafa Dehghani and Matthias Minderer and Georg Heigold and Sylvain Gelly and Jakob Uszkoreit and Neil Houlsby},
year={2021},
eprint={2010.11929},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2010.11929},
}
@misc{darcet2024visiontransformersneedregisters,
title={Vision Transformers Need Registers},
author={Timothée Darcet and Maxime Oquab and Julien Mairal and Piotr Bojanowski},
year={2024},
eprint={2309.16588},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2309.16588},
}
@misc{alabdulmohsin2024gettingvitshapescaling,
title={Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design},
author={Ibrahim Alabdulmohsin and Xiaohua Zhai and Alexander Kolesnikov and Lucas Beyer},
year={2024},
eprint={2305.13035},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2305.13035},
}
@misc{fu2025rethinkingpatchdependencemasked,
title={Rethinking Patch Dependence for Masked Autoencoders},
author={Letian Fu and Long Lian and Renhao Wang and Baifeng Shi and Xudong Wang and Adam Yala and Trevor Darrell and Alexei A. Efros and Ken Goldberg},
year={2025},
eprint={2401.14391},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2401.14391},
}