EmberNet-Trial β BitNet b1.58 MoE VLM
Status: Trial run β Stage 2/2, Epoch 1/1, Loss 4.4079
EmberNet is a tiny but capable Vision-Language Model built for edge deployment and domain-expert reasoning. It combines a frozen SigLIP vision backbone with a BitNet b1.58 ternary-quantized Mixture-of-Experts language decoder, achieving ~3Γ memory reduction over a full-precision equivalent while preserving strong visual understanding across 8 specialised domains.
Model Details
| Property | Value |
|---|---|
| Model type | Vision-Language Model (VLM) |
| Quantisation | BitNet b1.58 (ternary weights: β1, 0, +1) |
| Total parameters | 840.8 M |
| Trainable parameters | 723.3 M |
| Active parameters / forward | ~235.4 M (top-2 routing) |
| Carbon footprint | 0.0624 kg COβeq |
| Training stage | Stage 2/2 β Expert SFT |
| Epoch | 1/1 |
| Best loss | 4.4079 |
| Last updated | 2026-02-25 15:21 UTC |
Architecture
EmberNet VLM
βββ Vision Encoder (frozen)
β βββ SigLIP-base-patch16-224 ~92.9 M params
β βββ Token Compressor (pixel-shuffle + pooling) β 64 tokens
β βββ Spatial Pooler ~2.4 M params
β βββ BitLinear Projector ~10.1 M params
β
βββ BitNet b1.58 MoE Decoder
βββ Layers: 16 Hidden: 768 Heads: 12 (GQA kv=6)
βββ Experts: 8 domain specialists + 1 shared expert (always active)
βββ Routing: Top-2 per token
βββ Quantisation: ternary weights, 4-bit activations
Expert Domains
| ID | Expert | Trained on |
|---|---|---|
| 0 | vision_ocr |
TextVQA, DocVQA, OCR-VQA, InfoVQA |
| 1 | vision_diagram |
AI2D, InfoVQA diagrams |
| 2 | code_math_chart |
ChartQA, PlotQA, FigureQA, DVQA |
| 3 | code_math_formula |
MathVista, math formula datasets |
| 4 | spatial_scene |
VQAv2, GQA, Visual Genome |
| 5 | spatial_reasoning |
RefCOCO, GQA spatial splits |
| 6 | agentic_knowledge |
OK-VQA, A-OKVQA |
| 7 | agentic_reasoning |
ScienceQA, CLEVR |
| β | shared |
All domains (always active) |
Training
Configuration
stage_1_projector_alignment:
epochs: 3
batch_size: 8 (effective: 32 with grad-accum 4)
learning_rate: 1e-4
trainable: vision projector + compressor + pooler only
stage_2_expert_sft:
epochs: 10
batch_size: 4 (effective: 16 with grad-accum 4)
learning_rate: 3e-4
trainable: router + all 8 expert FFNs + shared expert
expert_supervision_weight: 0.1
Optimiser
- BitNetStableOptimizer β custom Adam with FP32 master weights
- Two-phase LR: full LR for 60 % of training, then 0.1 Γ LR
- Warmup: 100 steps
- Weight clamp: [β3, 3] (maps cleanly to β1 / 0 / +1 at inference)
Usage
import torch
from PIL import Image
from transformers import AutoTokenizer
# Clone the repo and add it to your Python path, then:
from models import EmberNetVLM
from models.vlm import EmberNetConfig
# Load
config = EmberNetConfig()
model = EmberNetVLM(config)
ckpt = torch.load("pytorch_model.bin", map_location="cpu", weights_only=False)
model.load_state_dict(ckpt["model_state_dict"])
model.eval()
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
# Inference
image = Image.open("scene.jpg").convert("RGB")
prompt = "<image>\nDescribe what you see."
response = model.generate(
image=image,
prompt=prompt,
tokenizer=tokenizer,
max_new_tokens=256,
)
print(response)
Intended Uses
- Edge & embedded deployment β ternary weights run efficiently on CPUs and NPUs
- Domain-aware visual reasoning β dedicated experts for OCR, charts, math, spatial, and agentic tasks
- Robotic / agentic pipelines β
agentic_knowledge+agentic_reasoningexperts support multi-step planning - Fine-tuning base β swap in domain datasets to specialise any of the 8 experts independently
Limitations
- Optimised for efficiency; maximum single-task accuracy is lower than full-precision models of similar size
- Image resolution fixed at 224 Γ 224; very fine-grained OCR may degrade
- Expert routing is learned; novel domains may activate sub-optimal experts until fine-tuned
- Tokeniser vocabulary (32 002) is Phi-2 derived; non-English performance is limited
Citation
@software{embernet_vlm,
title = {EmberNet: Tiny BitNet b1.58 MoE Vision-Language Model},
author = {Aman Euh},
year = {2026},
url = {https://huggingface.co/euhidaman/EmberNet-Trial}
}
- Downloads last month
- 93
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support