Access this EmbeddingGemma ONNX derivative
This repository is publicly accessible, but you have to accept the conditions to access its files and content.
This repository contains a Model Derivative of Google's EmbeddingGemma-300M and is distributed under the Gemma Terms of Use (https://ai.google.dev/gemma/terms) and subject to the Gemma Prohibited Use Policy (https://ai.google.dev/gemma/prohibited_use_policy). By requesting access you agree to be bound by both documents, which are included verbatim in this repository as LICENSE and PROHIBITED_USE_POLICY.md.
Log in or Sign Up to review the conditions and access this model content.
EmbeddingGemma-300M β ONNX
Modified from google/embeddinggemma-300m.
This is a Model Derivative under the Gemma Terms of Use. See LICENSE,
NOTICE, and PROHIBITED_USE_POLICY.md in this repository.
What this is
ONNX (opset 17) export of EmbeddingGemma-300M with the full
sentence-transformers pipeline β transformer backbone, mean pooling, and
the two dense projection layers β baked into a single graph, so one forward
pass yields the final 768-dim sentence embedding. Intended for CPU / GPU
inference via ONNX Runtime on resource-constrained environments (e.g.
Jetson Orin, Kubernetes init containers) where pulling in torch +
sentence-transformers is undesirable.
Files
| File | Precision | Size | Notes |
|---|---|---|---|
model.onnx |
fp32 | ~1.2 GB | Reference export, matches upstream accuracy. |
model_int8.onnx |
INT8 (dynamic, per-channel) | ~310 MB | ORT quantize_dynamic with QInt8, per_channel=True. Weights-only; activations stay fp32. |
Sanity check against the reference SentenceTransformer.encode(...)
implementation and an independently-served TEI endpoint of the upstream
model:
| Comparison | cosine |
|---|---|
model.onnx (fp32) vs upstream ST.encode |
1.00000 |
model.onnx (fp32) vs TEI-served reference |
1.00000 |
model_int8.onnx vs upstream ST.encode |
0.984 |
model_int8.onnx vs model.onnx |
0.984 |
The fp32 export is numerically equivalent to the upstream model. The INT8 variant drifts by ~0.016 cosine β retrieval ranking is preserved but do not mix fp32 and INT8 vectors in the same index.
Note (2026-04-19): an earlier upload of this repository was exported with an older
transformersversion than the one the upstream model was saved with, which caused the ONNX graph to silently diverge from the reference implementation. That artifact has been replaced. If you pulledmodel.onnxbefore 2026-04-19, re-pull β the SHA-256 of the current fp32 artifact ise7a1688794β¦. SeeNOTICEfor the full revision history.
I/O
| Name | Kind | Shape | Dtype |
|---|---|---|---|
input_ids |
input | [batch, seq] |
int64 |
attention_mask |
input | [batch, seq] |
int64 |
token_embeddings |
output | [batch, seq, 768] |
float32 |
sentence_embedding |
output | [batch, 768] |
float32 |
Max input length: 2048 tokens. Activations are not valid in float16; keep fp32 (or bf16 on supported hardware) as noted by the upstream model card.
Usage
import onnxruntime as ort
from transformers import AutoTokenizer
import numpy as np
tok = AutoTokenizer.from_pretrained(".") # or the HF repo id of this model
sess = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"])
# or for the INT8 variant:
# sess = ort.InferenceSession("model_int8.onnx", providers=["CPUExecutionProvider"])
texts = ["Which planet is known as the Red Planet?"]
enc = tok(texts, padding=True, truncation=True, max_length=2048, return_tensors="np")
outputs = sess.run(
["sentence_embedding"],
{"input_ids": enc["input_ids"].astype(np.int64),
"attention_mask": enc["attention_mask"].astype(np.int64)},
)
emb = outputs[0] # shape (1, 768)
# L2-normalize if you want cosine similarity; upstream model already outputs
# normalized vectors, but re-normalize after any MRL truncation.
For Matryoshka (MRL) truncation to 512 / 256 / 128 dims, slice
sentence_embedding[:, :D] and re-normalize.
How it was produced
See export.py. Summary:
huggingface-cli login # upstream is gated
uv add optimum optimum-onnx "transformers>=4.57,<5" \
"sentence-transformers>=5.1" torch onnx onnxruntime
uv run python export.py
# which runs:
# python -m optimum.exporters.onnx \
# --model google/embeddinggemma-300m \
# --library sentence_transformers \
# --opset 17 ./embeddinggemma-300m-onnx
# then onnxruntime.quantization.quantize_dynamic(
# weight_type=QInt8, per_channel=True) -> model_int8.onnx
Do not downgrade transformers below 4.57. The model was saved with
transformers==4.57.0.dev0; older versions ship an earlier Gemma 3 Text
forward pass that has since changed. Exporting under a downgraded
transformers still passes the exporter's own max diff check (the
reference PyTorch model is running the same stale forward) but yields
ONNX outputs that disagree with any correctly-configured serving stack
by ~0.3 cosine. Always verify a fresh export against an independent
oracle before publishing β see the sanity-check table above.
The --library sentence_transformers flag makes Optimum trace the full
SBERT forward (backbone β pooling β dense β dense), which is why a single
sentence_embedding output is emitted.
License & redistribution
This model is distributed under the Gemma Terms of Use
(see LICENSE). You must comply with Β§3.1 (Distribution and
Redistribution) and the Gemma Prohibited Use Policy (see
PROHIBITED_USE_POLICY.md) for any further redistribution or use.
Required on redistribution (quoting Β§3.1):
- Propagate the Section 3.2 use restrictions as an enforceable term to your downstream users.
- Provide all third-party recipients a copy of the Gemma Terms of Use.
- Mark any files you further modify with a prominent modification notice.
- Include a
NOTICEtext file with: "Gemma is provided under and subject to the Gemma Terms of Use found at ai.google.dev/gemma/terms".
This repository ships all four artifacts (LICENSE, NOTICE,
PROHIBITED_USE_POLICY.md, and this README.md documenting the
modifications).
Credits
- Upstream model: Google DeepMind β
google/embeddinggemma-300m(model card, paper). - Conversion tooling: π€ Optimum + PyTorch ONNX exporter.
- Downloads last month
- 6
Model tree for cookieshake/embeddinggemma-300m-onnx
Base model
google/embeddinggemma-300m