jina-embeddings-v2-base-code β selectively quantized int8 ONNX
MatMul-only int8 quantization of jinaai/jina-embeddings-v2-base-code,
produced for use with any ORT-based inference pipeline that handles the
standard input_ids / attention_mask / token_type_ids β last_hidden_state
contract. Drop-in replacement for the upstream fp32 export at roughly half
the size and meaningfully faster CPU inference.
What's in this repo
| File | Size | Notes |
|---|---|---|
model_int8.onnx |
302 MB | Single-file ONNX, no external data sidecar |
tokenizer.json |
4.3 MB | Copied verbatim from the upstream model |
tokenizer_config.json |
1.2 KB | Copied verbatim |
config.json |
1.1 KB | Copied verbatim |
Quantization recipe
Generated with ONNX Runtime's quantize_dynamic, targeting only MatMul
operators. This leaves LayerNorm Pow/ReduceMean/Sqrt, residual Adds,
and attention softmax at fp32 β the ops that tolerate quantization poorly.
72 of 96 MatMul nodes quantize cleanly to int8.
from onnxruntime.quantization import quantize_dynamic, QuantType
quantize_dynamic(
model_input="model.onnx",
model_output="model_int8.onnx",
op_types_to_quantize=["MatMul"],
weight_type=QuantType.QInt8,
)
Measured against upstream fp32 on a 20-query code-retrieval fixture
| fp32 | int8 | Ξ | |
|---|---|---|---|
| Model size | 641 MB | 302 MB | β53% |
| Indexing throughput (CPU) | 6.66 chunks/sec | 9.17 cps | +37% |
| hit@5 | 0.900 | 0.850 | β1 query |
| hit@10 | 0.950 | 0.950 | same |
| NDCG@10 | 0.798 | 0.762 | β0.036 |
Fixture: 20 natural-language queries over a medium-size Rust codebase (ripgrep 14.1.0, 952 chunks, mixed library + CLI code). Retrieval pipeline is hybrid dense + BM25 + RRF fusion; "hit@k" = expected file in the top-k results.
Architecture notes for ORT consumers
- Inputs:
input_ids(int64),attention_mask(int64),token_type_ids(int64). All 2D[batch, seq]with unbounded dynamic dims. - Output:
last_hidden_stateβ[batch, seq, 768]raw token embeddings. Pool + L2-normalize in the caller. - Pooling: upstream convention is mean pooling over
attention_mask. - No instruction prefix β queries and documents get raw text.
- Encoder-only BERT with ALiBi attention β no
position_idsinput.
CoreML note
Both the fp32 upstream and this int8 export retain unbounded intermediate-tensor
dims that CoreML's MIL runtime refuses to compile. A fixed-shape [32, 512]
variant compiles cleanly but β because CoreML partitions the graph into many
small subgraphs on this architecture β each inference call incurs enough
inter-partition CPUβANE data-shuffling to end up slower than pure CPU.
Recommend CPUExecutionProvider for this model on Apple Silicon today.
License
Apache-2.0, matching the upstream jinaai/jina-embeddings-v2-base-code.
- Downloads last month
- 32
Model tree for therealbrandonh6k/jina-embeddings-v2-base-code-bgrep-int8
Base model
jinaai/jina-embeddings-v2-base-code