Access this EmbeddingGemma ONNX derivative

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

This repository contains a Model Derivative of Google's EmbeddingGemma-300M and is distributed under the Gemma Terms of Use (https://ai.google.dev/gemma/terms) and subject to the Gemma Prohibited Use Policy (https://ai.google.dev/gemma/prohibited_use_policy). By requesting access you agree to be bound by both documents, which are included verbatim in this repository as LICENSE and PROHIBITED_USE_POLICY.md.

Log in or Sign Up to review the conditions and access this model content.

EmbeddingGemma-300M β€” ONNX

Modified from google/embeddinggemma-300m. This is a Model Derivative under the Gemma Terms of Use. See LICENSE, NOTICE, and PROHIBITED_USE_POLICY.md in this repository.

What this is

ONNX (opset 17) export of EmbeddingGemma-300M with the full sentence-transformers pipeline β€” transformer backbone, mean pooling, and the two dense projection layers β€” baked into a single graph, so one forward pass yields the final 768-dim sentence embedding. Intended for CPU / GPU inference via ONNX Runtime on resource-constrained environments (e.g. Jetson Orin, Kubernetes init containers) where pulling in torch + sentence-transformers is undesirable.

Files

File Precision Size Notes
model.onnx fp32 ~1.2 GB Reference export, matches upstream accuracy.
model_int8.onnx INT8 (dynamic, per-channel) ~310 MB ORT quantize_dynamic with QInt8, per_channel=True. Weights-only; activations stay fp32.

Sanity check against the reference SentenceTransformer.encode(...) implementation and an independently-served TEI endpoint of the upstream model:

Comparison cosine
model.onnx (fp32) vs upstream ST.encode 1.00000
model.onnx (fp32) vs TEI-served reference 1.00000
model_int8.onnx vs upstream ST.encode 0.984
model_int8.onnx vs model.onnx 0.984

The fp32 export is numerically equivalent to the upstream model. The INT8 variant drifts by ~0.016 cosine β€” retrieval ranking is preserved but do not mix fp32 and INT8 vectors in the same index.

Note (2026-04-19): an earlier upload of this repository was exported with an older transformers version than the one the upstream model was saved with, which caused the ONNX graph to silently diverge from the reference implementation. That artifact has been replaced. If you pulled model.onnx before 2026-04-19, re-pull β€” the SHA-256 of the current fp32 artifact is e7a1688794…. See NOTICE for the full revision history.

I/O

Name Kind Shape Dtype
input_ids input [batch, seq] int64
attention_mask input [batch, seq] int64
token_embeddings output [batch, seq, 768] float32
sentence_embedding output [batch, 768] float32

Max input length: 2048 tokens. Activations are not valid in float16; keep fp32 (or bf16 on supported hardware) as noted by the upstream model card.

Usage

import onnxruntime as ort
from transformers import AutoTokenizer
import numpy as np

tok = AutoTokenizer.from_pretrained(".")  # or the HF repo id of this model
sess = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"])
# or for the INT8 variant:
# sess = ort.InferenceSession("model_int8.onnx", providers=["CPUExecutionProvider"])

texts = ["Which planet is known as the Red Planet?"]
enc = tok(texts, padding=True, truncation=True, max_length=2048, return_tensors="np")
outputs = sess.run(
    ["sentence_embedding"],
    {"input_ids": enc["input_ids"].astype(np.int64),
     "attention_mask": enc["attention_mask"].astype(np.int64)},
)
emb = outputs[0]  # shape (1, 768)
# L2-normalize if you want cosine similarity; upstream model already outputs
# normalized vectors, but re-normalize after any MRL truncation.

For Matryoshka (MRL) truncation to 512 / 256 / 128 dims, slice sentence_embedding[:, :D] and re-normalize.

How it was produced

See export.py. Summary:

huggingface-cli login  # upstream is gated
uv add optimum optimum-onnx "transformers>=4.57,<5" \
       "sentence-transformers>=5.1" torch onnx onnxruntime
uv run python export.py
# which runs:
#   python -m optimum.exporters.onnx \
#       --model google/embeddinggemma-300m \
#       --library sentence_transformers \
#       --opset 17 ./embeddinggemma-300m-onnx
# then onnxruntime.quantization.quantize_dynamic(
#       weight_type=QInt8, per_channel=True) -> model_int8.onnx

Do not downgrade transformers below 4.57. The model was saved with transformers==4.57.0.dev0; older versions ship an earlier Gemma 3 Text forward pass that has since changed. Exporting under a downgraded transformers still passes the exporter's own max diff check (the reference PyTorch model is running the same stale forward) but yields ONNX outputs that disagree with any correctly-configured serving stack by ~0.3 cosine. Always verify a fresh export against an independent oracle before publishing β€” see the sanity-check table above.

The --library sentence_transformers flag makes Optimum trace the full SBERT forward (backbone β†’ pooling β†’ dense β†’ dense), which is why a single sentence_embedding output is emitted.

License & redistribution

This model is distributed under the Gemma Terms of Use (see LICENSE). You must comply with Β§3.1 (Distribution and Redistribution) and the Gemma Prohibited Use Policy (see PROHIBITED_USE_POLICY.md) for any further redistribution or use.

Required on redistribution (quoting Β§3.1):

  1. Propagate the Section 3.2 use restrictions as an enforceable term to your downstream users.
  2. Provide all third-party recipients a copy of the Gemma Terms of Use.
  3. Mark any files you further modify with a prominent modification notice.
  4. Include a NOTICE text file with: "Gemma is provided under and subject to the Gemma Terms of Use found at ai.google.dev/gemma/terms".

This repository ships all four artifacts (LICENSE, NOTICE, PROHIBITED_USE_POLICY.md, and this README.md documenting the modifications).

Credits

  • Upstream model: Google DeepMind β€” google/embeddinggemma-300m (model card, paper).
  • Conversion tooling: πŸ€— Optimum + PyTorch ONNX exporter.
Downloads last month
6
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for cookieshake/embeddinggemma-300m-onnx

Quantized
(44)
this model

Paper for cookieshake/embeddinggemma-300m-onnx