bge-reranker-base — Core ML (.mlpackage) for Apple Silicon

Core ML port of BAAI/bge-reranker-base targeting the Apple Neural Engine on M-series Macs. Produced by the maintainer-side conversion tool at github.com/tcashel/juice-bge-reranker-coreml. Consumed by the Juice macOS app via swift-transformers.

This card is the integration contract. The Swift consumer relies on every section below; do not change a tensor name, shape, or token ID without bumping the variant tag (which the consumer pins in any per-model cache key, see below).

Requirements

Apple Silicon Mac (M1 / M2 / M3 / M4 / later). The headline -ane build requires the Apple Neural Engine.
macOS 15.0 (Sequoia) or later. This is the artifact's minimum_deployment_target; older macOS versions cannot load the .mlpackage.
Swift consumer: swift-transformers ≥ 1.3.0 for HubApi (snapshot download) and AutoTokenizer (XLM-R Unigram path). Direct MLModel load via CoreML also works.

Usage

End-to-end working examples live in the GitHub repo's examples/ directory — both load the artifact, score one (query, doc) pair, and print the sigmoid-mapped relevance.

Swift (`swift-transformers` + `CoreML`)

The canonical consumer pattern; mirrors what the Juice macOS app does. Full source at examples/swift/Sources/Predict/main.swift. Key steps:

import CoreML
import Hub
import Tokenizers

let repo = Hub.Repo(id: "tcashel/bge-reranker-base-coreml", type: .models)
let folder = try await HubApi.shared.snapshot(from: repo, revision: "v0.1-ane")
let tokenizer = try await AutoTokenizer.from(modelFolder: folder)

// XLM-R paired-input template (swift-transformers does not expose textPair for Unigram):
let bos: Int32 = 0, eos: Int32 = 2, pad: Int32 = 1
let q = tokenizer.encode(text: query, addSpecialTokens: false).map(Int32.init)
let d = tokenizer.encode(text: doc,   addSpecialTokens: false).map(Int32.init)
var ids: [Int32] = [bos] + q + [eos, eos] + d + [eos]
// ... pad to seq ∈ {128, 256, 512}, fill 20 batch rows with <pad>, then:

let config = MLModelConfiguration()
config.computeUnits = .cpuAndNeuralEngine
let model = try MLModel(contentsOf: folder.appendingPathComponent("model.mlpackage"), configuration: config)
let prediction = try await model.prediction(from: provider)
let logit = Double(truncating: prediction.featureValue(for: "logit")!.multiArrayValue![[0, 0]])
let score = 1.0 / (1.0 + exp(-logit))

Run:

cd examples/swift
swift run Predict --tag v0.1-ane --query "what is the capital of france?" --doc "Paris is the capital of France."

Python (`coremltools` + `transformers` for tokenization)

For verifying the artifact end-to-end on macOS without a Swift toolchain. Full source at examples/predict.py:

import math, numpy as np
from coremltools.models import MLModel
from huggingface_hub import snapshot_download
from transformers import AutoTokenizer

folder = snapshot_download(repo_id="tcashel/bge-reranker-base-coreml", revision="v0.1-ane")
tokenizer = AutoTokenizer.from_pretrained(folder, use_fast=True)
model = MLModel(f"{folder}/model.mlpackage")

# Python's transformers builds the paired-input template internally:
enc = tokenizer(query, doc, padding="max_length", truncation=True, max_length=128, return_tensors="np")

# Pad up to the fixed batch=20 (read row 0 of the output, discard the rest):
ids  = np.full ((20, 1, 1, 128), 1, dtype=np.int32); ids [0, 0, 0, :] = enc["input_ids"][0]
mask = np.zeros((20, 1, 1, 128),    dtype=np.int32); mask[0, 0, 0, :] = enc["attention_mask"][0]

logit = float(model.predict({"input_ids": ids, "attention_mask": mask})["logit"][0, 0])
score = 1.0 / (1.0 + math.exp(-logit))

Run:

pixi run python examples/predict.py --source hub --tag v0.1-ane

Identity

Source model: BAAI/bge-reranker-base @ <source_revision_sha> (set by convert.py).
Conversion type: PyTorch (FP32) → Core ML .mlpackage (FP16) with the apple/ml-ane-transformers primitives (Conv2d 1×1 projections, BC1S layout, LayerNormANE) so the encoder lowers to the Apple Neural Engine. This is a precision reduction (FP32 → FP16) and format conversion, not integer quantization — there is no INT8/INT4 mapping. No fine-tuning, distillation, or weight pruning was applied; weights are bit-equivalent up to FP16 rounding.
Conversion stack: see <variant>_provenance.json published alongside the artifact (records exact torch / transformers / coremltools versions and host machine).
License: MIT (inherited from the upstream model).

Variants

Tag	Compute units	Intended use
`v{X}-ane`	`cpuAndNeuralEngine`	Headline build. The 12-layer encoder backbone (924 ops: einsum, conv, softmax, layer_norm, gelu, transpose, residual add/mul) runs on the Apple Neural Engine. 31 boundary ops (embedding gather over the 250k vocab, position-id arithmetic, mask construction, casts) dispatch to CPU; this is the Pareto frontier for XLM-RoBERTa-class models with very large vocabularies. `verify_ane.py` enforces this exact 924/31 residency fingerprint as a regression gate — any drift fails. M-series Macs only.
`v{X}-cpugpu`	`cpuAndGPU`	Known-good fallback — the same ANE port converted with `compute_units=CPU_AND_GPU`. Used by Swift if the `-ane` build fails to load (e.g. driver or macOS version mismatch).

The Swift caller pins the tag in Hub.snapshot(repo: "tcashel/bge-reranker-base-coreml", revision: "<tag>") and embeds the same <tag> in any consumer-side cache key tied to model identity — rotating the tag invalidates downstream caches.

Repository layout. This repo uses git tags (not subdirectories or sibling repos) to distinguish variants — v{X}-ane and v{X}-cpugpu point to different commits, each containing exactly one variant's files at the repo root (one model.mlpackage, one set of tokenizer files, one provenance.json). The main branch reflects whichever variant was published last, so consumers should always pin to a specific tag rather than reading from main. This layout optimizes for the Swift consumer: HubApi.shared.snapshot(from:, revision: <tag>) returns a flat ready-to-use directory.

Architecture

Heads-up — XLM-RoBERTa, not BERT. The encoder geometry is BERT-like (12L / 768H / 12 heads, GELU, post-LN), so a casual reader may pattern-match it as a BERT cross-encoder. It isn't. The upstream config.json declares model_type: xlm-roberta, architectures: ["XLMRobertaForSequenceClassification"]. The tokenizer and special-token IDs differ accordingly (see below); don't reach for [CLS]/[SEP].

12 transformer encoder layers, hidden 768, 12 attention heads, intermediate FFN 3072.
Single-segment model (type_vocab_size = 1).
Classification head reads the <s> token (position 0): dense(768→768) → tanh → out_proj(768→1). No pooler.
Output: a single logit per pair. Apply sigmoid on the Swift side to get a relevance score in [0, 1].

Tokenizer

Class: XLMRobertaTokenizer (SentencePiece-Unigram). Consumed in Swift via swift-transformers' AutoTokenizer.from(modelFolder:), which dispatches to UnigramTokenizer for this tokenizer_class.
Files in this repo (under tokenizer/): tokenizer.json (the fast-tokenizer file Swift consumes), tokenizer_config.json, special_tokens_map.json, sentencepiece.bpe.model. All four are required — tokenizer.json is the load path; the others are belt-and-braces.
Special tokens:

Token ID

<s> (BOS / CLS-equivalent) 0

<pad> 1

</s> (EOS / SEP-equivalent) 2

<unk> 3

<mask> 250001
Padding: right-side pad with <pad> (id 1).
Vocab size: 250 002.
Max position embeddings: 514 (= 512 max content tokens + padding_idx + 1 offset).

Token	ID
`<s>` (BOS / CLS-equivalent)	0
`<pad>`	1
`</s>` (EOS / SEP-equivalent)	2
`<unk>`	3
`<mask>`	250001

Paired-input template (must be constructed by the Swift consumer)

<s> {query} </s></s> {document} </s>

The doubled </s></s> separator is XLM-RoBERTa-specific (NOT the single BERT [SEP] you might expect from the encoder geometry). swift-transformers does not expose encode(text:textPair:) for the Unigram path, so the Swift consumer must concatenate the template string itself before calling encode(text:). Do not pre-tokenize and concatenate token IDs — let the tokenizer handle the special-token IDs.

Truncation policy

If the tokenized template exceeds the target sequence length S, truncate the document side from the right. Never truncate the query — query terms drive both lexical and semantic match in the cross-encoder. Reserve 4 token slots for the special tokens (<s>, </s>, </s>, </s>):

max_doc_tokens = S - len(query_tokens) - 4

If max_doc_tokens <= 0, the query alone fills the budget — drop the document, the score is essentially noise, and the consumer should down-weight or skip this candidate at the orchestrator.

Input tensors (Core ML)

Both variants share the same input shape contract — they're the same architecture (the ANE-friendly port) converted with different compute_units. The (1, 1) middle dims are constant on the cpuAndGPU path (no overhead) and required by ANE's BC1S layout on the ANE path.

Name	Dtype	Shape	Notes
`input_ids`	`Int32`	`(20, 1, 1, S)`	`S ∈ {128, 256, 512}` via `EnumeratedShapes`. Token IDs in `[0, 250001]`.
`attention_mask`	`Int32`	`(20, 1, 1, S)`	`1` for real tokens, `0` for `<pad>`.

There is no token_type_ids input — type_vocab_size = 1, so token-type embedding is constant and folded internally.

Batch is fixed at 20 (sized for the consumer's typical post-RRF candidate pool). Smaller actual batches must be padded with <pad> rows on the Swift side; the corresponding attention_mask rows should be all-zeros. The classification head still emits 20 logits — the consumer reads the first actual_batch of them and discards the rest.

Output tensor

Name	Dtype	Shape	Interpretation
`logit`	`Float32`	`(20, 1)`	Raw logit. Apply `sigmoid` to get relevance score in `[0, 1]`.

Position-ID computation (informational)

Position IDs inside the model are computed as:

position_ids[i] = (arange(S) + 2) * attention_mask + 1 * (1 - attention_mask)

i.e. real tokens get positions starting at 2 (= pad_token_id + 1), pad tokens get position 1 (= pad_token_id). This is bit-exact equivalent to HF's create_position_ids_from_input_ids when input is right-padded, and avoids cumsum (which doesn't lower cleanly to ANE). The Swift consumer does not pass position IDs as a model input.

Performance

Measured by bench.py on the maintainer's machine (recorded under <variant>_provenance.json → machine). 50 warmup + 100 timed iterations per cell. per-pair p95 = p95 / batch.

Variant: `ane`

batch	seq	p50 (ms)	p95 (ms)	per-pair p95 (ms)
1	128	50.45	52.47	52.47
4	128	50.34	51.63	12.91
10	128	50.53	51.95	5.19
20	128	51.24	52.46	2.62
1	256	127.76	128.99	128.99
4	256	128.50	129.16	32.29
10	256	129.70	131.15	13.12
20	256	129.46	130.74	6.54
1	512	344.20	346.74	346.74
4	512	343.03	346.89	86.72
10	512	343.46	345.43	34.54
20	512	346.01	348.64	17.43

Variant: `cpugpu`

batch	seq	p50 (ms)	p95 (ms)	per-pair p95 (ms)
1	128	122.79	123.10	123.10
4	128	123.01	123.34	30.83
10	128	123.13	123.46	12.35
20	128	122.69	138.34	6.92
1	256	242.07	242.87	242.87
4	256	241.94	242.82	60.70
10	256	242.10	243.17	24.32
20	256	242.16	243.11	12.16
1	512	503.81	504.98	504.98
4	512	503.97	506.10	126.53
10	512	503.95	504.87	50.49
20	512	504.06	504.82	25.24

Pass criterion (ANE variant): p95(batch=20, seq=256) < 200 ms AND per-pair p95 < 15 ms. Matches the consumer's reranker latency budget.

Quality regression eval

Validates that the FP32 → FP16 + Core ML conversion preserved upstream behavior. Scored by eval/quality_regression.py against the MTEB Reranking suite — the same benchmark family BAAI/bge-reranker-base is evaluated on. Pass criterion: |Δ nDCG@10| < 0.005 per task vs the FP32 reference. Variant equivalence: scores apply to both -ane and -cpugpu (the FP16 weights inside each .mlpackage are bit-identical; only compute_units differs at load).

MTEB Reranking — FP32 reference vs Core ML FP16

Variant equivalence: FP16 weights are bit-identical between -ane and -cpugpu; both inherit these numbers.

Task	n queries	FP32 nDCG@10	Core ML nDCG@10	Δ nDCG@10	FP32 MAP	Core ML MAP
scidocs-reranking	3978	0.7410	0.7415	+0.0005	0.6742	0.6743

Pass criterion: |Δ nDCG@10| < 0.005 per task. FP32 baseline is BAAI/bge-reranker-base loaded with attn_implementation="eager".

Note on absolute scale: the nDCG@10 reported here (0.74) reflects macro nDCG@10 over the test split's pre-ranked candidate pool (1 positive + ~29 negatives per query), which is structurally different from the full-corpus eval setup the BGE paper reports (0.84). Δ vs the FP32 reference on the same setup is the meaningful regression signal; the absolute number is not directly comparable to the upstream paper.

Failure modes the Swift consumer must handle

Failure	Symptom	Recommended response
Download fails / hash mismatch	`Hub.snapshot` throws	Surface a one-line UI banner; reranker is bypassed; RRF order returned unchanged.
`MLModel` load fails (Intel Mac, missing ANE driver)	`MLModelLoadError`	Fall back to `-cpugpu` variant. If both fail, banner + RRF-only.
Per-query budget exceeded (>800 ms wall)	Cancel observed via `Task.cancel`	Return RRF order, log slow query.
Op fallback to CPU at runtime	Latency outliers in monitoring	Out of scope to detect from Swift; bench harness should catch this pre-publish via `verify_ane.py`.

Known limitations

Apple Silicon only (-ane requires the Apple Neural Engine; Intel Macs must use -cpugpu).
Fixed batch size 20. Smaller batches waste compute on pad rows; larger batches need a re-conversion.
English-language reranking only (the upstream model is English; XLM-R's vocab supports more languages but the reranker has not been fine-tuned for them).
FP16 internally on the ANE path — extreme inputs may show small numerical drift from the FP32 PyTorch reference. Tested within 1e-3 absolute tolerance on 16 fixed pairs; see tests/test_numerical_equivalence.py.

References

Source model: BAAI/bge-reranker-base
BGE family papers:
- C-Pack: Packed Resources For General Chinese Embeddings (Xiao et al., 2023)
- Making Large Language Models A Better Foundation For Dense Retrieval (Li et al., 2023)
Apple Neural Engine + Core ML conversion:
- apple/ml-ane-transformers — the reference primitives (LayerNormANE, Conv2d-projection MultiHeadAttention) we vendor for the ANE rewrite.
- Apple Machine Learning Research — Deploying Transformers on the Apple Neural Engine.

How to reproduce

git clone https://github.com/tcashel/juice-bge-reranker-coreml
cd juice-bge-reranker-coreml
pixi install
pixi run convert         # produces build/bge-reranker-base-{ane,cpugpu}.mlpackage
pixi run verify-ane build/bge-reranker-base-ane.mlpackage
pixi run bench --variants ane:build/bge-reranker-base-ane.mlpackage cpugpu:build/bge-reranker-base-cpugpu.mlpackage --update-model-card MODEL_CARD.md
pixi run test

Publishing requires HUGGINGFACE_TOKEN in env and --confirm:

pixi run python publish.py --variant both --tag v0.1 --confirm

Downloads last month: 67

Inference Providers NEW

Text Ranking

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tcashel/bge-reranker-base-coreml

Base model

BAAI/bge-reranker-base

Quantized

(20)

this model

Papers for tcashel/bge-reranker-base-coreml

Making Large Language Models A Better Foundation For Dense Retrieval

Paper • 2312.15503 • Published Dec 24, 2023 • 2

C-Pack: Packaged Resources To Advance General Chinese Embedding

Paper • 2309.07597 • Published Sep 14, 2023 • 1

Evaluation results

nDCG@10 on SciDocs Reranking
test set self-reported

0.742
MAP on SciDocs Reranking
test set self-reported

0.674