Harrier-Arabic-Matryoshka-270m

A 270M-parameter Arabic sentence embedding model based on microsoft/harrier-oss-v1-270m, fine-tuned for Arabic semantic similarity with Matryoshka Representation Learning.

Matryoshka training was applied across the dimension ladder 640 → 512 → 256 → 128 → 64, so you can truncate the output embedding to any of these sizes with minimal quality loss — useful for faster retrieval and lighter indexes.

Model details

Field Value
Base model microsoft/harrier-oss-v1-270m
Parameters ~270M
Full embedding dimension 640
Matryoshka dims 640, 512, 256, 128, 64
Max sequence length 32,768
Pooling inherited from base
Language Arabic (cross-lingual capabilities inherited from base)

Evaluation

Spearman correlation on Arabic semantic textual similarity tasks (MTEB), full-dim embeddings:

Task Subset Baseline (harrier-oss-v1-270m) This model Δ
STS17 ar-ar 0.7598 0.8135 +0.054
STS17 en-ar 0.7601 0.8145 +0.054
STS22.v2 ar 0.6510 0.6457 -0.005

Usage

Standard (full 640-dim embeddings)

from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    "Omartificial-Intelligence-Space/Harrier-Arabic-Matryoshka-270m",
    trust_remote_code=True,
)

sentences = [
    "تعلم اللغة العربية ممتع ومثير.",
    "دراسة العربية تجربة شيقة.",
    "القطط تحب اللعب في الحديقة.",
]

embeddings = model.encode(sentences, normalize_embeddings=True)
print(embeddings.shape)  # (3, 640)

Truncated (Matryoshka) embeddings

Pick any dim from the ladder for smaller, faster vectors:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    "Omartificial-Intelligence-Space/Harrier-Arabic-Matryoshka-270m",
    trust_remote_code=True,
    truncate_dim=256,  # one of: 640, 512, 256, 128, 64
)

embeddings = model.encode(["..."], normalize_embeddings=True)
print(embeddings.shape)  # (1, 256)

Cosine similarity

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

model = SentenceTransformer(
    "Omartificial-Intelligence-Space/Harrier-Arabic-Matryoshka-270m",
    trust_remote_code=True,
)

a = model.encode("تعلم اللغة العربية ممتع ومثير.", normalize_embeddings=True)
b = model.encode("دراسة العربية تجربة شيقة.", normalize_embeddings=True)
print(cos_sim(a, b))

Intended use

  • Arabic semantic textual similarity
  • Arabic sentence/passage retrieval and re-ranking
  • Cross-lingual retrieval against the base model's supported languages
  • Clustering and deduplication of Arabic text

Citation

If you use this model, please cite the base model and Matryoshka Representation Learning:

@misc{harrieross,
  title  = {Harrier OSS v1},
  author = {Microsoft},
  url    = {https://huggingface.co/microsoft/harrier-oss-v1-270m}
}

@inproceedings{kusupati2022matryoshka,
  title     = {Matryoshka Representation Learning},
  author    = {Kusupati, Aditya and others},
  booktitle = {NeurIPS},
  year      = {2022}
}

License

This model inherits the license of its base model. See microsoft/harrier-oss-v1-270m for terms.

Downloads last month
37
Safetensors
Model size
0.3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Omartificial-Intelligence-Space/Harrier-Arabic-Matryoshka-270m

Finetuned
(5)
this model

Collection including Omartificial-Intelligence-Space/Harrier-Arabic-Matryoshka-270m

Paper for Omartificial-Intelligence-Space/Harrier-Arabic-Matryoshka-270m