Model Card for vit_reg4_so150m_p14_avg_mim-bio

A SoViT 150m p14 image encoder pretrained using Masked Image Modeling (MIM). It was trained with a mask unit size of 2 for 300 epochs. This model has not been fine-tuned for a specific classification task and is intended to be used as a general-purpose feature extractor or a backbone for downstream tasks like object detection, segmentation, or custom classification.

Model Details

Model Type: Image classification and detection backbone
Model Stats:
- Params (M): 133.6
- Input image size: 224 x 224
Dataset: Trained on a diverse dataset of approximately 31M images, including:
- TreeOfLife-10M-EOL-NaturalImages
- iNaturalist 2021
- BIOSCAN-5M (pretrain split)
- TreeOfLife-200M (subset)
- IP102 v1.1
- iWildCam 2022 (subset)
- The Birder dataset (private dataset)
Papers:
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: https://arxiv.org/abs/2010.11929
- Vision Transformers Need Registers: https://arxiv.org/abs/2309.16588
- Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design: https://arxiv.org/abs/2305.13035
- Rethinking Patch Dependence for Masked Autoencoders: https://arxiv.org/abs/2401.14391

Model Usage

Image Embeddings

import birder
from birder.inference.classification import infer_image

# Option 1: manual setup (more control over preprocessing)
net, model_info = birder.load_pretrained_model("vit_reg4_so150m_p14_avg_mim-bio", inference=True)

# Get the image size the model was trained on
size = birder.get_size_from_signature(model_info.signature)

# Create an inference transform
transform = birder.classification_transform(size, model_info.rgb_stats)

# Option 2: helper (quick start with default preprocessing)
net, model_info, transform = birder.load_pretrained_model_and_transform("vit_reg4_so150m_p14_avg_mim-bio", inference=True)

image = "path/to/image.jpeg"  # or a PIL image
out, embedding = infer_image(net, image, transform, return_embedding=True)
# embedding is a NumPy array with shape of (1, 896)

Detection Feature Map

from PIL import Image
import birder

net, model_info, transform = birder.load_pretrained_model_and_transform("vit_reg4_so150m_p14_avg_mim-bio", inference=True)

image = Image.open("path/to/image.jpeg")
features = net.detection_features(transform(image).unsqueeze(0))
# features is a dict (stage name -> torch.Tensor)
print([(k, v.size()) for k, v in features.items()])
# Output example:
# [('stage1', torch.Size([1, 896, 16, 16]))]

Citation

@misc{dosovitskiy2021imageworth16x16words,
      title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
      author={Alexey Dosovitskiy and Lucas Beyer and Alexander Kolesnikov and Dirk Weissenborn and Xiaohua Zhai and Thomas Unterthiner and Mostafa Dehghani and Matthias Minderer and Georg Heigold and Sylvain Gelly and Jakob Uszkoreit and Neil Houlsby},
      year={2021},
      eprint={2010.11929},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2010.11929},
}

@misc{darcet2024visiontransformersneedregisters,
      title={Vision Transformers Need Registers},
      author={Timothée Darcet and Maxime Oquab and Julien Mairal and Piotr Bojanowski},
      year={2024},
      eprint={2309.16588},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2309.16588},
}

@misc{alabdulmohsin2024gettingvitshapescaling,
      title={Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design},
      author={Ibrahim Alabdulmohsin and Xiaohua Zhai and Alexander Kolesnikov and Lucas Beyer},
      year={2024},
      eprint={2305.13035},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2305.13035},
}

@misc{fu2025rethinkingpatchdependencemasked,
      title={Rethinking Patch Dependence for Masked Autoencoders},
      author={Letian Fu and Long Lian and Renhao Wang and Baifeng Shi and Xudong Wang and Adam Yala and Trevor Darrell and Alexei A. Efros and Ken Goldberg},
      year={2025},
      eprint={2401.14391},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2401.14391},
}

Downloads last month: 147

Papers for birder-project/vit_reg4_so150m_p14_avg_mim-bio