Instructions to use lthn/lemer-mlx-bf16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use lthn/lemer-mlx-bf16 with MLX:
# Make sure mlx-vlm is installed # pip install --upgrade mlx-vlm from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template from mlx_vlm.utils import load_config # Load the model model, processor = load("lthn/lemer-mlx-bf16") config = load_config("lthn/lemer-mlx-bf16") # Prepare input image = ["http://images.cocodataset.org/val2017/000000039769.jpg"] prompt = "Describe this image." # Apply chat template formatted_prompt = apply_chat_template( processor, config, prompt, num_images=1 ) # Generate output output = generate(model, processor, formatted_prompt, image) print(output) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Pi new
How to use lthn/lemer-mlx-bf16 with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "lthn/lemer-mlx-bf16"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "lthn/lemer-mlx-bf16" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use lthn/lemer-mlx-bf16 with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "lthn/lemer-mlx-bf16"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default lthn/lemer-mlx-bf16
Run Hermes
hermes
Lemer (MLX BF16) — Gemma 4 E2B + LEK
Full-precision MLX reference build of lemer — Gemma 4 E2B with the Lethean Ethical Kernel (LEK) merged into the text attention weights, converted to native MLX bfloat16 format for Apple Silicon. This is the reference model: no quantisation, full precision, maximum fidelity. The source from which lemer-mlx-q8 and lemer-mlx (Q4) are quantised.
Other formats in the Lemma family:
| Repo | Format | Size | Use case |
|---|---|---|---|
| lthn/lemer | HF + GGUF + MLX Q4 bundled | 3–9 GB per variant | Main consumer repo — everything in one place |
| lthn/lemer-mlx-bf16 | MLX BF16 | 10.2 GB | You are here — full-precision reference |
| lthn/lemer-mlx-q8 | MLX Q8 | 5.9 GB | Near-lossless quantised |
| lthn/lemer-mlx | MLX Q4 | 4.1 GB | On-device default |
| LetheanNetwork/lemer | HF BF16 (unmodified base) | 10.2 GB | Raw Google Gemma 4 E2B fork, no LEK |
What This Is
The Lethean Ethical Kernel (LEK) has been merged directly into the text attention projections (100 q/k/v/o_proj layers) of Gemma 4 E2B via LoRA finetune, then folded into the base weights so inference uses a single standalone model with no PEFT runtime required. The vision tower and audio tower are preserved unmodified from Google's upstream — LEK only shifts text reasoning.
This variant is the native MLX BF16 conversion of the merged model. Full precision, no quantisation error, 3 safetensor shards totalling ~10.2 GB. Loads directly via mlx-lm and mlx-vlm for native Apple Silicon inference.
Use this variant when:
- You want the highest-fidelity reference for research, benchmarking, or as the source for your own quantisation
- You have the memory budget (~12 GB runtime peak on M-series)
- You're comparing against other full-precision models and don't want quantisation error in the measurement
For everyday on-device inference, prefer lemer-mlx (Q4) at 4.1 GB. See the main lthn/lemer card for the full story.
Quick Start
mlx-lm (text)
uv tool install mlx-lm
mlx_lm.chat --model lthn/lemer-mlx-bf16
mlx_lm.generate --model lthn/lemer-mlx-bf16 --prompt "Hello, how are you?"
mlx-vlm (vision + audio multimodal)
uv tool install mlx-vlm
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
model, processor = load("lthn/lemer-mlx-bf16")
config = load_config("lthn/lemer-mlx-bf16")
image = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
prompt = "Describe this image in one sentence."
formatted_prompt = apply_chat_template(
processor, config, prompt, num_images=1
)
output = generate(model, processor, formatted_prompt, image)
print(output.text)
mlx-vlm server (OpenAI-compatible API)
mlx_vlm.server --model lthn/lemer-mlx-bf16 --port 8080
Then any OpenAI-compatible client can hit http://localhost:8080/v1/chat/completions.
Note: use
mlx_vlm.server(notmlx_lm.server) because lemer is multimodal. The text-onlymlx_lm.serverdoes not correctly route the vision/audio tensors for Gemma 4.
Recommended Sampling
Per Google's Gemma 4 model card, use these across all use cases. Gemma 4 is calibrated for temperature=1.0 — greedy / temperature=0 is NOT recommended and will measurably underperform.
| Parameter | Value |
|---|---|
temperature |
1.0 |
top_p |
0.95 |
top_k |
64 |
These defaults are pre-configured in generation_config.json and will be picked up automatically by mlx-lm, mlx-vlm, and any OpenAI-compatible client.
Model Details
| Property | Value |
|---|---|
| Architecture | Gemma 4 E2B |
| Format | MLX BF16 |
| Parameters | 5.1B total, 2.3B effective (Per-Layer Embeddings) |
| Layers | 35 text decoder layers |
| Context Length | 128K tokens |
| Vocabulary | 262K tokens |
| Modalities | Text, Image, Audio |
| Vision Encoder | ~150M params (preserved unmodified from Google) |
| Audio Encoder | ~300M params (preserved unmodified from Google) |
| Weight files | 3 shards (model-00001-of-00003.safetensors, …, 10.2 GB total) |
| LEK delta | LoRA rank 8 merged into 100 text attention projections |
| Base fork | LetheanNetwork/lemer (unmodified Google fork) |
| Licence | EUPL-1.2 |
Full Model Card
Detailed documentation — Lemma family overview, GGUF variants, capability map, benchmarks, the "why EUPL-1.2" framing, and the Roadmap — lives on the main repo:
About Lethean
Lethean is a social enterprise building ethical AI infrastructure. The Lemma model family is part of the LEM (Lethean Ethical Model) project — training protocol and tooling for intrinsic ethical alignment of language models via consent-based LoRA finetunes, shipped EUPL-1.2 so the ethical layer stays in the open.
- Website: lthn.ai
- GitHub: LetheanNetwork
- Axioms (public domain): Snider/ai-ethics
- Licence: EUPL-1.2
- Downloads last month
- 4
Quantized
Model tree for lthn/lemer-mlx-bf16
Base model
google/gemma-4-E2B