bge-reranker-base β€” Core ML (.mlpackage) for Apple Silicon

Core ML port of BAAI/bge-reranker-base targeting the Apple Neural Engine on M-series Macs. Produced by the maintainer-side conversion tool at github.com/tcashel/juice-bge-reranker-coreml. Consumed by the Juice macOS app via swift-transformers.

This card is the integration contract. The Swift consumer relies on every section below; do not change a tensor name, shape, or token ID without bumping the variant tag (which the consumer pins in any per-model cache key, see below).

Requirements

  • Apple Silicon Mac (M1 / M2 / M3 / M4 / later). The headline -ane build requires the Apple Neural Engine.
  • macOS 15.0 (Sequoia) or later. This is the artifact's minimum_deployment_target; older macOS versions cannot load the .mlpackage.
  • Swift consumer: swift-transformers β‰₯ 1.3.0 for HubApi (snapshot download) and AutoTokenizer (XLM-R Unigram path). Direct MLModel load via CoreML also works.

Usage

End-to-end working examples live in the GitHub repo's examples/ directory β€” both load the artifact, score one (query, doc) pair, and print the sigmoid-mapped relevance.

Swift (swift-transformers + CoreML)

The canonical consumer pattern; mirrors what the Juice macOS app does. Full source at examples/swift/Sources/Predict/main.swift. Key steps:

import CoreML
import Hub
import Tokenizers

let repo = Hub.Repo(id: "tcashel/bge-reranker-base-coreml", type: .models)
let folder = try await HubApi.shared.snapshot(from: repo, revision: "v0.1-ane")
let tokenizer = try await AutoTokenizer.from(modelFolder: folder)

// XLM-R paired-input template (swift-transformers does not expose textPair for Unigram):
let bos: Int32 = 0, eos: Int32 = 2, pad: Int32 = 1
let q = tokenizer.encode(text: query, addSpecialTokens: false).map(Int32.init)
let d = tokenizer.encode(text: doc,   addSpecialTokens: false).map(Int32.init)
var ids: [Int32] = [bos] + q + [eos, eos] + d + [eos]
// ... pad to seq ∈ {128, 256, 512}, fill 20 batch rows with <pad>, then:

let config = MLModelConfiguration()
config.computeUnits = .cpuAndNeuralEngine
let model = try MLModel(contentsOf: folder.appendingPathComponent("model.mlpackage"), configuration: config)
let prediction = try await model.prediction(from: provider)
let logit = Double(truncating: prediction.featureValue(for: "logit")!.multiArrayValue![[0, 0]])
let score = 1.0 / (1.0 + exp(-logit))

Run:

cd examples/swift
swift run Predict --tag v0.1-ane --query "what is the capital of france?" --doc "Paris is the capital of France."

Python (coremltools + transformers for tokenization)

For verifying the artifact end-to-end on macOS without a Swift toolchain. Full source at examples/predict.py:

import math, numpy as np
from coremltools.models import MLModel
from huggingface_hub import snapshot_download
from transformers import AutoTokenizer

folder = snapshot_download(repo_id="tcashel/bge-reranker-base-coreml", revision="v0.1-ane")
tokenizer = AutoTokenizer.from_pretrained(folder, use_fast=True)
model = MLModel(f"{folder}/model.mlpackage")

# Python's transformers builds the paired-input template internally:
enc = tokenizer(query, doc, padding="max_length", truncation=True, max_length=128, return_tensors="np")

# Pad up to the fixed batch=20 (read row 0 of the output, discard the rest):
ids  = np.full ((20, 1, 1, 128), 1, dtype=np.int32); ids [0, 0, 0, :] = enc["input_ids"][0]
mask = np.zeros((20, 1, 1, 128),    dtype=np.int32); mask[0, 0, 0, :] = enc["attention_mask"][0]

logit = float(model.predict({"input_ids": ids, "attention_mask": mask})["logit"][0, 0])
score = 1.0 / (1.0 + math.exp(-logit))

Run:

pixi run python examples/predict.py --source hub --tag v0.1-ane

Identity

  • Source model: BAAI/bge-reranker-base @ <source_revision_sha> (set by convert.py).
  • Conversion type: PyTorch (FP32) β†’ Core ML .mlpackage (FP16) with the apple/ml-ane-transformers primitives (Conv2d 1Γ—1 projections, BC1S layout, LayerNormANE) so the encoder lowers to the Apple Neural Engine. This is a precision reduction (FP32 β†’ FP16) and format conversion, not integer quantization β€” there is no INT8/INT4 mapping. No fine-tuning, distillation, or weight pruning was applied; weights are bit-equivalent up to FP16 rounding.
  • Conversion stack: see <variant>_provenance.json published alongside the artifact (records exact torch / transformers / coremltools versions and host machine).
  • License: MIT (inherited from the upstream model).

Variants

Tag Compute units Intended use
v{X}-ane cpuAndNeuralEngine Headline build. The 12-layer encoder backbone (924 ops: einsum, conv, softmax, layer_norm, gelu, transpose, residual add/mul) runs on the Apple Neural Engine. 31 boundary ops (embedding gather over the 250k vocab, position-id arithmetic, mask construction, casts) dispatch to CPU; this is the Pareto frontier for XLM-RoBERTa-class models with very large vocabularies. verify_ane.py enforces this exact 924/31 residency fingerprint as a regression gate β€” any drift fails. M-series Macs only.
v{X}-cpugpu cpuAndGPU Known-good fallback β€” the same ANE port converted with compute_units=CPU_AND_GPU. Used by Swift if the -ane build fails to load (e.g. driver or macOS version mismatch).

The Swift caller pins the tag in Hub.snapshot(repo: "tcashel/bge-reranker-base-coreml", revision: "<tag>") and embeds the same <tag> in any consumer-side cache key tied to model identity β€” rotating the tag invalidates downstream caches.

Repository layout. This repo uses git tags (not subdirectories or sibling repos) to distinguish variants β€” v{X}-ane and v{X}-cpugpu point to different commits, each containing exactly one variant's files at the repo root (one model.mlpackage, one set of tokenizer files, one provenance.json). The main branch reflects whichever variant was published last, so consumers should always pin to a specific tag rather than reading from main. This layout optimizes for the Swift consumer: HubApi.shared.snapshot(from:, revision: <tag>) returns a flat ready-to-use directory.

Architecture

Heads-up β€” XLM-RoBERTa, not BERT. The encoder geometry is BERT-like (12L / 768H / 12 heads, GELU, post-LN), so a casual reader may pattern-match it as a BERT cross-encoder. It isn't. The upstream config.json declares model_type: xlm-roberta, architectures: ["XLMRobertaForSequenceClassification"]. The tokenizer and special-token IDs differ accordingly (see below); don't reach for [CLS]/[SEP].

  • 12 transformer encoder layers, hidden 768, 12 attention heads, intermediate FFN 3072.
  • Single-segment model (type_vocab_size = 1).
  • Classification head reads the <s> token (position 0): dense(768β†’768) β†’ tanh β†’ out_proj(768β†’1). No pooler.
  • Output: a single logit per pair. Apply sigmoid on the Swift side to get a relevance score in [0, 1].

Tokenizer

  • Class: XLMRobertaTokenizer (SentencePiece-Unigram). Consumed in Swift via swift-transformers' AutoTokenizer.from(modelFolder:), which dispatches to UnigramTokenizer for this tokenizer_class.

  • Files in this repo (under tokenizer/): tokenizer.json (the fast-tokenizer file Swift consumes), tokenizer_config.json, special_tokens_map.json, sentencepiece.bpe.model. All four are required β€” tokenizer.json is the load path; the others are belt-and-braces.

  • Special tokens:

    Token ID
    <s> (BOS / CLS-equivalent) 0
    <pad> 1
    </s> (EOS / SEP-equivalent) 2
    <unk> 3
    <mask> 250001
  • Padding: right-side pad with <pad> (id 1).

  • Vocab size: 250 002.

  • Max position embeddings: 514 (= 512 max content tokens + padding_idx + 1 offset).

Paired-input template (must be constructed by the Swift consumer)

<s> {query} </s></s> {document} </s>

The doubled </s></s> separator is XLM-RoBERTa-specific (NOT the single BERT [SEP] you might expect from the encoder geometry). swift-transformers does not expose encode(text:textPair:) for the Unigram path, so the Swift consumer must concatenate the template string itself before calling encode(text:). Do not pre-tokenize and concatenate token IDs β€” let the tokenizer handle the special-token IDs.

Truncation policy

If the tokenized template exceeds the target sequence length S, truncate the document side from the right. Never truncate the query β€” query terms drive both lexical and semantic match in the cross-encoder. Reserve 4 token slots for the special tokens (<s>, </s>, </s>, </s>):

max_doc_tokens = S - len(query_tokens) - 4

If max_doc_tokens <= 0, the query alone fills the budget β€” drop the document, the score is essentially noise, and the consumer should down-weight or skip this candidate at the orchestrator.

Input tensors (Core ML)

Both variants share the same input shape contract β€” they're the same architecture (the ANE-friendly port) converted with different compute_units. The (1, 1) middle dims are constant on the cpuAndGPU path (no overhead) and required by ANE's BC1S layout on the ANE path.

Name Dtype Shape Notes
input_ids Int32 (20, 1, 1, S) S ∈ {128, 256, 512} via EnumeratedShapes. Token IDs in [0, 250001].
attention_mask Int32 (20, 1, 1, S) 1 for real tokens, 0 for <pad>.

There is no token_type_ids input β€” type_vocab_size = 1, so token-type embedding is constant and folded internally.

Batch is fixed at 20 (sized for the consumer's typical post-RRF candidate pool). Smaller actual batches must be padded with <pad> rows on the Swift side; the corresponding attention_mask rows should be all-zeros. The classification head still emits 20 logits β€” the consumer reads the first actual_batch of them and discards the rest.

Output tensor

Name Dtype Shape Interpretation
logit Float32 (20, 1) Raw logit. Apply sigmoid to get relevance score in [0, 1].

Position-ID computation (informational)

Position IDs inside the model are computed as:

position_ids[i] = (arange(S) + 2) * attention_mask + 1 * (1 - attention_mask)

i.e. real tokens get positions starting at 2 (= pad_token_id + 1), pad tokens get position 1 (= pad_token_id). This is bit-exact equivalent to HF's create_position_ids_from_input_ids when input is right-padded, and avoids cumsum (which doesn't lower cleanly to ANE). The Swift consumer does not pass position IDs as a model input.

Performance

Measured by bench.py on the maintainer's machine (recorded under <variant>_provenance.json β†’ machine). 50 warmup + 100 timed iterations per cell. per-pair p95 = p95 / batch.

Variant: ane

batch seq p50 (ms) p95 (ms) per-pair p95 (ms)
1 128 50.45 52.47 52.47
4 128 50.34 51.63 12.91
10 128 50.53 51.95 5.19
20 128 51.24 52.46 2.62
1 256 127.76 128.99 128.99
4 256 128.50 129.16 32.29
10 256 129.70 131.15 13.12
20 256 129.46 130.74 6.54
1 512 344.20 346.74 346.74
4 512 343.03 346.89 86.72
10 512 343.46 345.43 34.54
20 512 346.01 348.64 17.43

Variant: cpugpu

batch seq p50 (ms) p95 (ms) per-pair p95 (ms)
1 128 122.79 123.10 123.10
4 128 123.01 123.34 30.83
10 128 123.13 123.46 12.35
20 128 122.69 138.34 6.92
1 256 242.07 242.87 242.87
4 256 241.94 242.82 60.70
10 256 242.10 243.17 24.32
20 256 242.16 243.11 12.16
1 512 503.81 504.98 504.98
4 512 503.97 506.10 126.53
10 512 503.95 504.87 50.49
20 512 504.06 504.82 25.24

Pass criterion (ANE variant): p95(batch=20, seq=256) < 200 ms AND per-pair p95 < 15 ms. Matches the consumer's reranker latency budget.

Quality regression eval

Validates that the FP32 β†’ FP16 + Core ML conversion preserved upstream behavior. Scored by eval/quality_regression.py against the MTEB Reranking suite β€” the same benchmark family BAAI/bge-reranker-base is evaluated on. Pass criterion: |Ξ” nDCG@10| < 0.005 per task vs the FP32 reference. Variant equivalence: scores apply to both -ane and -cpugpu (the FP16 weights inside each .mlpackage are bit-identical; only compute_units differs at load).

MTEB Reranking β€” FP32 reference vs Core ML FP16

Variant equivalence: FP16 weights are bit-identical between -ane and -cpugpu; both inherit these numbers.

Task n queries FP32 nDCG@10 Core ML nDCG@10 Ξ” nDCG@10 FP32 MAP Core ML MAP
scidocs-reranking 3978 0.7410 0.7415 +0.0005 0.6742 0.6743

Pass criterion: |Ξ” nDCG@10| < 0.005 per task. FP32 baseline is BAAI/bge-reranker-base loaded with attn_implementation="eager".

Note on absolute scale: the nDCG@10 reported here (0.74) reflects macro nDCG@10 over the test split's pre-ranked candidate pool (1 positive + ~29 negatives per query), which is structurally different from the full-corpus eval setup the BGE paper reports (0.84). Ξ” vs the FP32 reference on the same setup is the meaningful regression signal; the absolute number is not directly comparable to the upstream paper.

Failure modes the Swift consumer must handle

Failure Symptom Recommended response
Download fails / hash mismatch Hub.snapshot throws Surface a one-line UI banner; reranker is bypassed; RRF order returned unchanged.
MLModel load fails (Intel Mac, missing ANE driver) MLModelLoadError Fall back to -cpugpu variant. If both fail, banner + RRF-only.
Per-query budget exceeded (>800 ms wall) Cancel observed via Task.cancel Return RRF order, log slow query.
Op fallback to CPU at runtime Latency outliers in monitoring Out of scope to detect from Swift; bench harness should catch this pre-publish via verify_ane.py.

Known limitations

  • Apple Silicon only (-ane requires the Apple Neural Engine; Intel Macs must use -cpugpu).
  • Fixed batch size 20. Smaller batches waste compute on pad rows; larger batches need a re-conversion.
  • English-language reranking only (the upstream model is English; XLM-R's vocab supports more languages but the reranker has not been fine-tuned for them).
  • FP16 internally on the ANE path β€” extreme inputs may show small numerical drift from the FP32 PyTorch reference. Tested within 1e-3 absolute tolerance on 16 fixed pairs; see tests/test_numerical_equivalence.py.

References

How to reproduce

git clone https://github.com/tcashel/juice-bge-reranker-coreml
cd juice-bge-reranker-coreml
pixi install
pixi run convert         # produces build/bge-reranker-base-{ane,cpugpu}.mlpackage
pixi run verify-ane build/bge-reranker-base-ane.mlpackage
pixi run bench --variants ane:build/bge-reranker-base-ane.mlpackage cpugpu:build/bge-reranker-base-cpugpu.mlpackage --update-model-card MODEL_CARD.md
pixi run test

Publishing requires HUGGINGFACE_TOKEN in env and --confirm:

pixi run python publish.py --variant both --tag v0.1 --confirm
Downloads last month
67
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for tcashel/bge-reranker-base-coreml

Quantized
(20)
this model

Papers for tcashel/bge-reranker-base-coreml

Evaluation results