Dot Only Model

Model Description

SODA-VEC embedding model trained with dot product loss only. This model uses normalized embeddings with only contrastive learning (dot product) to learn biomedical text representations.

This model is part of the SODA-VEC (Scientific Open Domain Adaptation for Vector Embeddings) project, which focuses on creating high-quality embedding models for biomedical and life sciences text.

Key Features:

  • Trained on 26.5M biomedical title-abstract pairs from PubMed Central
  • Based on ModernBERT-base architecture
  • Optimized for biomedical text similarity and semantic search
  • Produces 768-dimensional embeddings with mean pooling

Training Details

Training Data

Training Procedure

Loss Function: Dot Only: normalized embeddings with only dot product loss (diagonal + off-diagonal)

Coefficients: dot=1.0 Base Model: answerdotai/ModernBERT-base

Training Configuration:

  • GPUs: 4
  • Batch Size per GPU: 16
  • Gradient Accumulation: 4
  • Effective Batch Size: 256
  • Learning Rate: 2e-05
  • Warmup Steps: 100
  • Pooling Strategy: mean
  • Epochs: 1 (full dataset pass)

Training Command:

python scripts/soda-vec-train.py --config dot_only --coeff_dot 1 --push_to_hub --hub_org EMBO --save_limit 5

Model Architecture

  • Base Architecture: ModernBERT-base (12 layers, 768 hidden size)
  • Pooling: Mean pooling over token embeddings
  • Output Dimension: 768
  • Normalization: L2-normalized embeddings (for VICReg-based models)

Usage

Using Sentence-Transformers

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer("EMBO/dot_only")

# Encode sentences
sentences = [
    "CRISPR-Cas9 gene editing in human cells",
    "Genome editing using CRISPR technology"
]

embeddings = model.encode(sentences)
print(f"Embedding shape: {embeddings.shape}")

# Compute similarity
from sentence_transformers.util import cos_sim
similarity = cos_sim(embeddings[0], embeddings[1])
print(f"Similarity: {similarity.item():.4f}")

Using Hugging Face Transformers

from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("EMBO/dot_only")
model = AutoModel.from_pretrained("EMBO/dot_only")

# Encode sentences
sentences = [
    "CRISPR-Cas9 gene editing in human cells",
    "Genome editing using CRISPR technology"
]

inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)
    
# Mean pooling
embeddings = outputs.last_hidden_state.mean(dim=1)

# Normalize (for VICReg models)
embeddings = F.normalize(embeddings, p=2, dim=1)

# Compute similarity
similarity = F.cosine_similarity(embeddings[0:1], embeddings[1:2])
print(f"Similarity: {similarity.item():.4f}")

Intended Use

This model is designed for:

  • Biomedical Semantic Search: Finding relevant papers, abstracts, or text passages
  • Scientific Text Similarity: Computing similarity between biomedical texts
Downloads last month
20
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for EMBO/dot_only

Finetuned
(984)
this model

Dataset used to train EMBO/dot_only