bge-reranker-base β Core ML (.mlpackage) for Apple Silicon
Core ML port of BAAI/bge-reranker-base targeting the Apple Neural Engine on M-series Macs. Produced by the maintainer-side conversion tool at github.com/tcashel/juice-bge-reranker-coreml. Consumed by the Juice macOS app via swift-transformers.
This card is the integration contract. The Swift consumer relies on every section below; do not change a tensor name, shape, or token ID without bumping the variant tag (which the consumer pins in any per-model cache key, see below).
Requirements
- Apple Silicon Mac (M1 / M2 / M3 / M4 / later). The headline
-anebuild requires the Apple Neural Engine. - macOS 15.0 (Sequoia) or later. This is the artifact's
minimum_deployment_target; older macOS versions cannot load the.mlpackage. - Swift consumer:
swift-transformersβ₯ 1.3.0 forHubApi(snapshot download) andAutoTokenizer(XLM-R Unigram path). DirectMLModelload viaCoreMLalso works.
Usage
End-to-end working examples live in the GitHub repo's examples/ directory β both load the artifact, score one (query, doc) pair, and print the sigmoid-mapped relevance.
Swift (swift-transformers + CoreML)
The canonical consumer pattern; mirrors what the Juice macOS app does. Full source at examples/swift/Sources/Predict/main.swift. Key steps:
import CoreML
import Hub
import Tokenizers
let repo = Hub.Repo(id: "tcashel/bge-reranker-base-coreml", type: .models)
let folder = try await HubApi.shared.snapshot(from: repo, revision: "v0.1-ane")
let tokenizer = try await AutoTokenizer.from(modelFolder: folder)
// XLM-R paired-input template (swift-transformers does not expose textPair for Unigram):
let bos: Int32 = 0, eos: Int32 = 2, pad: Int32 = 1
let q = tokenizer.encode(text: query, addSpecialTokens: false).map(Int32.init)
let d = tokenizer.encode(text: doc, addSpecialTokens: false).map(Int32.init)
var ids: [Int32] = [bos] + q + [eos, eos] + d + [eos]
// ... pad to seq β {128, 256, 512}, fill 20 batch rows with <pad>, then:
let config = MLModelConfiguration()
config.computeUnits = .cpuAndNeuralEngine
let model = try MLModel(contentsOf: folder.appendingPathComponent("model.mlpackage"), configuration: config)
let prediction = try await model.prediction(from: provider)
let logit = Double(truncating: prediction.featureValue(for: "logit")!.multiArrayValue![[0, 0]])
let score = 1.0 / (1.0 + exp(-logit))
Run:
cd examples/swift
swift run Predict --tag v0.1-ane --query "what is the capital of france?" --doc "Paris is the capital of France."
Python (coremltools + transformers for tokenization)
For verifying the artifact end-to-end on macOS without a Swift toolchain. Full source at examples/predict.py:
import math, numpy as np
from coremltools.models import MLModel
from huggingface_hub import snapshot_download
from transformers import AutoTokenizer
folder = snapshot_download(repo_id="tcashel/bge-reranker-base-coreml", revision="v0.1-ane")
tokenizer = AutoTokenizer.from_pretrained(folder, use_fast=True)
model = MLModel(f"{folder}/model.mlpackage")
# Python's transformers builds the paired-input template internally:
enc = tokenizer(query, doc, padding="max_length", truncation=True, max_length=128, return_tensors="np")
# Pad up to the fixed batch=20 (read row 0 of the output, discard the rest):
ids = np.full ((20, 1, 1, 128), 1, dtype=np.int32); ids [0, 0, 0, :] = enc["input_ids"][0]
mask = np.zeros((20, 1, 1, 128), dtype=np.int32); mask[0, 0, 0, :] = enc["attention_mask"][0]
logit = float(model.predict({"input_ids": ids, "attention_mask": mask})["logit"][0, 0])
score = 1.0 / (1.0 + math.exp(-logit))
Run:
pixi run python examples/predict.py --source hub --tag v0.1-ane
Identity
- Source model:
BAAI/bge-reranker-base@<source_revision_sha>(set byconvert.py). - Conversion type: PyTorch (FP32) β Core ML
.mlpackage(FP16) with theapple/ml-ane-transformersprimitives (Conv2d 1Γ1 projections, BC1S layout,LayerNormANE) so the encoder lowers to the Apple Neural Engine. This is a precision reduction (FP32 β FP16) and format conversion, not integer quantization β there is no INT8/INT4 mapping. No fine-tuning, distillation, or weight pruning was applied; weights are bit-equivalent up to FP16 rounding. - Conversion stack: see
<variant>_provenance.jsonpublished alongside the artifact (records exact torch / transformers / coremltools versions and host machine). - License: MIT (inherited from the upstream model).
Variants
| Tag | Compute units | Intended use |
|---|---|---|
v{X}-ane |
cpuAndNeuralEngine |
Headline build. The 12-layer encoder backbone (924 ops: einsum, conv, softmax, layer_norm, gelu, transpose, residual add/mul) runs on the Apple Neural Engine. 31 boundary ops (embedding gather over the 250k vocab, position-id arithmetic, mask construction, casts) dispatch to CPU; this is the Pareto frontier for XLM-RoBERTa-class models with very large vocabularies. verify_ane.py enforces this exact 924/31 residency fingerprint as a regression gate β any drift fails. M-series Macs only. |
v{X}-cpugpu |
cpuAndGPU |
Known-good fallback β the same ANE port converted with compute_units=CPU_AND_GPU. Used by Swift if the -ane build fails to load (e.g. driver or macOS version mismatch). |
The Swift caller pins the tag in Hub.snapshot(repo: "tcashel/bge-reranker-base-coreml", revision: "<tag>") and embeds the same <tag> in any consumer-side cache key tied to model identity β rotating the tag invalidates downstream caches.
Repository layout. This repo uses git tags (not subdirectories or sibling repos) to distinguish variants β
v{X}-aneandv{X}-cpugpupoint to different commits, each containing exactly one variant's files at the repo root (onemodel.mlpackage, one set of tokenizer files, oneprovenance.json). Themainbranch reflects whichever variant was published last, so consumers should always pin to a specific tag rather than reading frommain. This layout optimizes for the Swift consumer:HubApi.shared.snapshot(from:, revision: <tag>)returns a flat ready-to-use directory.
Architecture
Heads-up β XLM-RoBERTa, not BERT. The encoder geometry is BERT-like (12L / 768H / 12 heads, GELU, post-LN), so a casual reader may pattern-match it as a BERT cross-encoder. It isn't. The upstream
config.jsondeclaresmodel_type: xlm-roberta,architectures: ["XLMRobertaForSequenceClassification"]. The tokenizer and special-token IDs differ accordingly (see below); don't reach for[CLS]/[SEP].
- 12 transformer encoder layers, hidden 768, 12 attention heads, intermediate FFN 3072.
- Single-segment model (
type_vocab_size = 1). - Classification head reads the
<s>token (position 0):dense(768β768) β tanh β out_proj(768β1). No pooler. - Output: a single logit per pair. Apply
sigmoidon the Swift side to get a relevance score in[0, 1].
Tokenizer
Class:
XLMRobertaTokenizer(SentencePiece-Unigram). Consumed in Swift viaswift-transformers'AutoTokenizer.from(modelFolder:), which dispatches toUnigramTokenizerfor thistokenizer_class.Files in this repo (under
tokenizer/):tokenizer.json(the fast-tokenizer file Swift consumes),tokenizer_config.json,special_tokens_map.json,sentencepiece.bpe.model. All four are required βtokenizer.jsonis the load path; the others are belt-and-braces.Special tokens:
Token ID <s>(BOS / CLS-equivalent)0 <pad>1 </s>(EOS / SEP-equivalent)2 <unk>3 <mask>250001 Padding: right-side pad with
<pad>(id 1).Vocab size: 250 002.
Max position embeddings: 514 (= 512 max content tokens +
padding_idx + 1offset).
Paired-input template (must be constructed by the Swift consumer)
<s> {query} </s></s> {document} </s>
The doubled </s></s> separator is XLM-RoBERTa-specific (NOT the single BERT [SEP] you might expect from the encoder geometry). swift-transformers does not expose encode(text:textPair:) for the Unigram path, so the Swift consumer must concatenate the template string itself before calling encode(text:). Do not pre-tokenize and concatenate token IDs β let the tokenizer handle the special-token IDs.
Truncation policy
If the tokenized template exceeds the target sequence length S, truncate the document side from the right. Never truncate the query β query terms drive both lexical and semantic match in the cross-encoder. Reserve 4 token slots for the special tokens (<s>, </s>, </s>, </s>):
max_doc_tokens = S - len(query_tokens) - 4
If max_doc_tokens <= 0, the query alone fills the budget β drop the document, the score is essentially noise, and the consumer should down-weight or skip this candidate at the orchestrator.
Input tensors (Core ML)
Both variants share the same input shape contract β they're the same architecture (the ANE-friendly port) converted with different compute_units. The (1, 1) middle dims are constant on the cpuAndGPU path (no overhead) and required by ANE's BC1S layout on the ANE path.
| Name | Dtype | Shape | Notes |
|---|---|---|---|
input_ids |
Int32 |
(20, 1, 1, S) |
S β {128, 256, 512} via EnumeratedShapes. Token IDs in [0, 250001]. |
attention_mask |
Int32 |
(20, 1, 1, S) |
1 for real tokens, 0 for <pad>. |
There is no token_type_ids input β type_vocab_size = 1, so token-type embedding is constant and folded internally.
Batch is fixed at 20 (sized for the consumer's typical post-RRF candidate pool). Smaller actual batches must be padded with <pad> rows on the Swift side; the corresponding attention_mask rows should be all-zeros. The classification head still emits 20 logits β the consumer reads the first actual_batch of them and discards the rest.
Output tensor
| Name | Dtype | Shape | Interpretation |
|---|---|---|---|
logit |
Float32 |
(20, 1) |
Raw logit. Apply sigmoid to get relevance score in [0, 1]. |
Position-ID computation (informational)
Position IDs inside the model are computed as:
position_ids[i] = (arange(S) + 2) * attention_mask + 1 * (1 - attention_mask)
i.e. real tokens get positions starting at 2 (= pad_token_id + 1), pad tokens get position 1 (= pad_token_id). This is bit-exact equivalent to HF's create_position_ids_from_input_ids when input is right-padded, and avoids cumsum (which doesn't lower cleanly to ANE). The Swift consumer does not pass position IDs as a model input.
Performance
Measured by bench.py on the maintainer's machine (recorded under <variant>_provenance.json β machine). 50 warmup + 100 timed iterations per cell. per-pair p95 = p95 / batch.
Variant: ane
| batch | seq | p50 (ms) | p95 (ms) | per-pair p95 (ms) |
|---|---|---|---|---|
| 1 | 128 | 50.45 | 52.47 | 52.47 |
| 4 | 128 | 50.34 | 51.63 | 12.91 |
| 10 | 128 | 50.53 | 51.95 | 5.19 |
| 20 | 128 | 51.24 | 52.46 | 2.62 |
| 1 | 256 | 127.76 | 128.99 | 128.99 |
| 4 | 256 | 128.50 | 129.16 | 32.29 |
| 10 | 256 | 129.70 | 131.15 | 13.12 |
| 20 | 256 | 129.46 | 130.74 | 6.54 |
| 1 | 512 | 344.20 | 346.74 | 346.74 |
| 4 | 512 | 343.03 | 346.89 | 86.72 |
| 10 | 512 | 343.46 | 345.43 | 34.54 |
| 20 | 512 | 346.01 | 348.64 | 17.43 |
Variant: cpugpu
| batch | seq | p50 (ms) | p95 (ms) | per-pair p95 (ms) |
|---|---|---|---|---|
| 1 | 128 | 122.79 | 123.10 | 123.10 |
| 4 | 128 | 123.01 | 123.34 | 30.83 |
| 10 | 128 | 123.13 | 123.46 | 12.35 |
| 20 | 128 | 122.69 | 138.34 | 6.92 |
| 1 | 256 | 242.07 | 242.87 | 242.87 |
| 4 | 256 | 241.94 | 242.82 | 60.70 |
| 10 | 256 | 242.10 | 243.17 | 24.32 |
| 20 | 256 | 242.16 | 243.11 | 12.16 |
| 1 | 512 | 503.81 | 504.98 | 504.98 |
| 4 | 512 | 503.97 | 506.10 | 126.53 |
| 10 | 512 | 503.95 | 504.87 | 50.49 |
| 20 | 512 | 504.06 | 504.82 | 25.24 |
Pass criterion (ANE variant): p95(batch=20, seq=256) < 200 ms AND per-pair p95 < 15 ms. Matches the consumer's reranker latency budget.
Quality regression eval
Validates that the FP32 β FP16 + Core ML conversion preserved upstream behavior. Scored by eval/quality_regression.py against the MTEB Reranking suite β the same benchmark family BAAI/bge-reranker-base is evaluated on. Pass criterion: |Ξ nDCG@10| < 0.005 per task vs the FP32 reference. Variant equivalence: scores apply to both -ane and -cpugpu (the FP16 weights inside each .mlpackage are bit-identical; only compute_units differs at load).
MTEB Reranking β FP32 reference vs Core ML FP16
Variant equivalence: FP16 weights are bit-identical between -ane and -cpugpu; both inherit these numbers.
| Task | n queries | FP32 nDCG@10 | Core ML nDCG@10 | Ξ nDCG@10 | FP32 MAP | Core ML MAP |
|---|---|---|---|---|---|---|
| scidocs-reranking | 3978 | 0.7410 | 0.7415 | +0.0005 | 0.6742 | 0.6743 |
Pass criterion: |Ξ nDCG@10| < 0.005 per task. FP32 baseline is BAAI/bge-reranker-base loaded with attn_implementation="eager".
Note on absolute scale: the nDCG@10 reported here (0.74) reflects macro nDCG@10 over the test split's pre-ranked candidate pool (1 positive + ~29 negatives per query), which is structurally different from the full-corpus eval setup the BGE paper reports (0.84). Ξ vs the FP32 reference on the same setup is the meaningful regression signal; the absolute number is not directly comparable to the upstream paper.
Failure modes the Swift consumer must handle
| Failure | Symptom | Recommended response |
|---|---|---|
| Download fails / hash mismatch | Hub.snapshot throws |
Surface a one-line UI banner; reranker is bypassed; RRF order returned unchanged. |
MLModel load fails (Intel Mac, missing ANE driver) |
MLModelLoadError |
Fall back to -cpugpu variant. If both fail, banner + RRF-only. |
| Per-query budget exceeded (>800 ms wall) | Cancel observed via Task.cancel |
Return RRF order, log slow query. |
| Op fallback to CPU at runtime | Latency outliers in monitoring | Out of scope to detect from Swift; bench harness should catch this pre-publish via verify_ane.py. |
Known limitations
- Apple Silicon only (
-anerequires the Apple Neural Engine; Intel Macs must use-cpugpu). - Fixed batch size 20. Smaller batches waste compute on pad rows; larger batches need a re-conversion.
- English-language reranking only (the upstream model is English; XLM-R's vocab supports more languages but the reranker has not been fine-tuned for them).
- FP16 internally on the ANE path β extreme inputs may show small numerical drift from the FP32 PyTorch reference. Tested within 1e-3 absolute tolerance on 16 fixed pairs; see
tests/test_numerical_equivalence.py.
References
- Source model:
BAAI/bge-reranker-base - BGE family papers:
- Apple Neural Engine + Core ML conversion:
apple/ml-ane-transformersβ the reference primitives (LayerNormANE, Conv2d-projection MultiHeadAttention) we vendor for the ANE rewrite.- Apple Machine Learning Research β Deploying Transformers on the Apple Neural Engine.
How to reproduce
git clone https://github.com/tcashel/juice-bge-reranker-coreml
cd juice-bge-reranker-coreml
pixi install
pixi run convert # produces build/bge-reranker-base-{ane,cpugpu}.mlpackage
pixi run verify-ane build/bge-reranker-base-ane.mlpackage
pixi run bench --variants ane:build/bge-reranker-base-ane.mlpackage cpugpu:build/bge-reranker-base-cpugpu.mlpackage --update-model-card MODEL_CARD.md
pixi run test
Publishing requires HUGGINGFACE_TOKEN in env and --confirm:
pixi run python publish.py --variant both --tag v0.1 --confirm
- Downloads last month
- 67
Model tree for tcashel/bge-reranker-base-coreml
Base model
BAAI/bge-reranker-basePapers for tcashel/bge-reranker-base-coreml
C-Pack: Packaged Resources To Advance General Chinese Embedding
Evaluation results
- nDCG@10 on SciDocs Rerankingtest set self-reported0.742
- MAP on SciDocs Rerankingtest set self-reported0.674