Octen-Embedding-0.6B โ€” FP16 ONNX

FP16-converted ONNX of Octen/Octen-Embedding-0.6B, a Qwen3-derived 1024-dim retrieval embedding model with 32k context and last-token pooling.

1.2 GB (50 % memory of FP32), retrieval-quality-equivalent to FP32 in our gates.

Quality

Metric Value Threshold
cos_min vs PyTorch FP32 reference (6-text multilingual probe) 1.000000 โ‰ฅ 0.99
cos_mean vs same 1.000000 โ€”

Validated under fastembed-rs' cosine_parity harness on probe/ort-rc12 (ORT 1.24). Probe set covers EN/DE/ZH plus retrieval-style sentences; per-row cosines all โ‰ฅ 0.999999.

Files

File Size Description
model.fp16.onnx ~5 MB ONNX header (external data)
model.fp16.onnx.data ~1.2 GB FP16 weights
tokenizer.json, config.json, tokenizer_config.json, special_tokens_map.json small tokenizer + model config

Conversion

Streaming FP32โ†’FP16 via convert_fp16_streaming.py (bypasses the 2 GB protobuf serialization limit by writing external data directly without an intermediate Python-side proto).

Use via fastembed-rs

let embedder = TextEmbedding::try_new(
    InitOptions::new(EmbeddingModel::OctenEmbedding0_6BFp16))?;
let vectors = embedder.embed(vec!["hello world"], None)?;

Pooling: last-token (auto-applied by fastembed-rs). Use "Query: " / "Document: " prefixes for asymmetric retrieval.

License

Apache 2.0, inherited from the base model.

Downloads last month
44
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for cstr/Octen-Embedding-0.6B-ONNX-FP16

Quantized
(14)
this model