jina-embeddings-v2-base-code β€” selectively quantized int8 ONNX

MatMul-only int8 quantization of jinaai/jina-embeddings-v2-base-code, produced for use with any ORT-based inference pipeline that handles the standard input_ids / attention_mask / token_type_ids β†’ last_hidden_state contract. Drop-in replacement for the upstream fp32 export at roughly half the size and meaningfully faster CPU inference.

What's in this repo

File Size Notes
model_int8.onnx 302 MB Single-file ONNX, no external data sidecar
tokenizer.json 4.3 MB Copied verbatim from the upstream model
tokenizer_config.json 1.2 KB Copied verbatim
config.json 1.1 KB Copied verbatim

Quantization recipe

Generated with ONNX Runtime's quantize_dynamic, targeting only MatMul operators. This leaves LayerNorm Pow/ReduceMean/Sqrt, residual Adds, and attention softmax at fp32 β€” the ops that tolerate quantization poorly. 72 of 96 MatMul nodes quantize cleanly to int8.

from onnxruntime.quantization import quantize_dynamic, QuantType
quantize_dynamic(
    model_input="model.onnx",
    model_output="model_int8.onnx",
    op_types_to_quantize=["MatMul"],
    weight_type=QuantType.QInt8,
)

Measured against upstream fp32 on a 20-query code-retrieval fixture

fp32 int8 Ξ”
Model size 641 MB 302 MB βˆ’53%
Indexing throughput (CPU) 6.66 chunks/sec 9.17 cps +37%
hit@5 0.900 0.850 βˆ’1 query
hit@10 0.950 0.950 same
NDCG@10 0.798 0.762 βˆ’0.036

Fixture: 20 natural-language queries over a medium-size Rust codebase (ripgrep 14.1.0, 952 chunks, mixed library + CLI code). Retrieval pipeline is hybrid dense + BM25 + RRF fusion; "hit@k" = expected file in the top-k results.

Architecture notes for ORT consumers

  • Inputs: input_ids (int64), attention_mask (int64), token_type_ids (int64). All 2D [batch, seq] with unbounded dynamic dims.
  • Output: last_hidden_state β€” [batch, seq, 768] raw token embeddings. Pool + L2-normalize in the caller.
  • Pooling: upstream convention is mean pooling over attention_mask.
  • No instruction prefix β€” queries and documents get raw text.
  • Encoder-only BERT with ALiBi attention β€” no position_ids input.

CoreML note

Both the fp32 upstream and this int8 export retain unbounded intermediate-tensor dims that CoreML's MIL runtime refuses to compile. A fixed-shape [32, 512] variant compiles cleanly but β€” because CoreML partitions the graph into many small subgraphs on this architecture β€” each inference call incurs enough inter-partition CPU↔ANE data-shuffling to end up slower than pure CPU. Recommend CPUExecutionProvider for this model on Apple Silicon today.

License

Apache-2.0, matching the upstream jinaai/jina-embeddings-v2-base-code.

Downloads last month
32
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for therealbrandonh6k/jina-embeddings-v2-base-code-bgrep-int8

Quantized
(10)
this model