jina-embeddings-v2-base-code — selectively quantized int8 ONNX

MatMul-only int8 quantization of jinaai/jina-embeddings-v2-base-code, produced for use with any ORT-based inference pipeline that handles the standard input_ids / attention_mask / token_type_ids → last_hidden_state contract. Drop-in replacement for the upstream fp32 export at roughly half the size and meaningfully faster CPU inference.

What's in this repo

File	Size	Notes
`model_int8.onnx`	302 MB	Single-file ONNX, no external data sidecar
`tokenizer.json`	4.3 MB	Copied verbatim from the upstream model
`tokenizer_config.json`	1.2 KB	Copied verbatim
`config.json`	1.1 KB	Copied verbatim

Quantization recipe

Generated with ONNX Runtime's quantize_dynamic, targeting only MatMul operators. This leaves LayerNorm Pow/ReduceMean/Sqrt, residual Adds, and attention softmax at fp32 — the ops that tolerate quantization poorly. 72 of 96 MatMul nodes quantize cleanly to int8.

from onnxruntime.quantization import quantize_dynamic, QuantType
quantize_dynamic(
    model_input="model.onnx",
    model_output="model_int8.onnx",
    op_types_to_quantize=["MatMul"],
    weight_type=QuantType.QInt8,
)

Measured against upstream fp32 on a 20-query code-retrieval fixture

	fp32	int8	Δ
Model size	641 MB	302 MB	−53%
Indexing throughput (CPU)	6.66 chunks/sec	9.17 cps	+37%
hit@5	0.900	0.850	−1 query
hit@10	0.950	0.950	same
NDCG@10	0.798	0.762	−0.036

Fixture: 20 natural-language queries over a medium-size Rust codebase (ripgrep 14.1.0, 952 chunks, mixed library + CLI code). Retrieval pipeline is hybrid dense + BM25 + RRF fusion; "hit@k" = expected file in the top-k results.

Architecture notes for ORT consumers

Inputs: input_ids (int64), attention_mask (int64), token_type_ids (int64). All 2D [batch, seq] with unbounded dynamic dims.
Output: last_hidden_state — [batch, seq, 768] raw token embeddings. Pool + L2-normalize in the caller.
Pooling: upstream convention is mean pooling over attention_mask.
No instruction prefix — queries and documents get raw text.
Encoder-only BERT with ALiBi attention — no position_ids input.

CoreML note

Both the fp32 upstream and this int8 export retain unbounded intermediate-tensor dims that CoreML's MIL runtime refuses to compile. A fixed-shape [32, 512] variant compiles cleanly but — because CoreML partitions the graph into many small subgraphs on this architecture — each inference call incurs enough inter-partition CPU↔ANE data-shuffling to end up slower than pure CPU. Recommend CPUExecutionProvider for this model on Apple Silicon today.

License

Apache-2.0, matching the upstream jinaai/jina-embeddings-v2-base-code.

Downloads last month: 32

Model tree for therealbrandonh6k/jina-embeddings-v2-base-code-bgrep-int8

Base model

jinaai/jina-embeddings-v2-base-code

Quantized

(10)

this model