EmberNet-Trial — BitNet b1.58 MoE VLM

Status: Trial run — Stage 2/2, Epoch 1/1, Loss 4.4079

EmberNet is a tiny but capable Vision-Language Model built for edge deployment and domain-expert reasoning. It combines a frozen SigLIP vision backbone with a BitNet b1.58 ternary-quantized Mixture-of-Experts language decoder, achieving ~3× memory reduction over a full-precision equivalent while preserving strong visual understanding across 8 specialised domains.

Model Details

Property	Value
Model type	Vision-Language Model (VLM)
Quantisation	BitNet b1.58 (ternary weights: −1, 0, +1)
Total parameters	840.8 M
Trainable parameters	723.3 M
Active parameters / forward	~235.4 M (top-2 routing)
Carbon footprint	0.0624 kg CO₂eq
Training stage	Stage 2/2 — Expert SFT
Epoch	1/1
Best loss	4.4079
Last updated	2026-02-25 15:21 UTC

Architecture

EmberNet VLM
├── Vision Encoder  (frozen)
│   ├── SigLIP-base-patch16-224       ~92.9 M params
│   ├── Token Compressor (pixel-shuffle + pooling) → 64 tokens
│   ├── Spatial Pooler                ~2.4 M params
│   └── BitLinear Projector           ~10.1 M params
│
└── BitNet b1.58 MoE Decoder
    ├── Layers: 16   Hidden: 768   Heads: 12 (GQA kv=6)
    ├── Experts: 8 domain specialists + 1 shared expert (always active)
    ├── Routing: Top-2 per token
    └── Quantisation: ternary weights, 4-bit activations

Expert Domains

ID	Expert	Trained on
0	`vision_ocr`	TextVQA, DocVQA, OCR-VQA, InfoVQA
1	`vision_diagram`	AI2D, InfoVQA diagrams
2	`code_math_chart`	ChartQA, PlotQA, FigureQA, DVQA
3	`code_math_formula`	MathVista, math formula datasets
4	`spatial_scene`	VQAv2, GQA, Visual Genome
5	`spatial_reasoning`	RefCOCO, GQA spatial splits
6	`agentic_knowledge`	OK-VQA, A-OKVQA
7	`agentic_reasoning`	ScienceQA, CLEVR
—	`shared`	All domains (always active)

Training

Configuration

stage_1_projector_alignment:
  epochs: 3
  batch_size: 8  (effective: 32 with grad-accum 4)
  learning_rate: 1e-4
  trainable: vision projector + compressor + pooler only

stage_2_expert_sft:
  epochs: 10
  batch_size: 4  (effective: 16 with grad-accum 4)
  learning_rate: 3e-4
  trainable: router + all 8 expert FFNs + shared expert
  expert_supervision_weight: 0.1

Optimiser

BitNetStableOptimizer — custom Adam with FP32 master weights
Two-phase LR: full LR for 60 % of training, then 0.1 × LR
Warmup: 100 steps
Weight clamp: [−3, 3] (maps cleanly to −1 / 0 / +1 at inference)

Usage

import torch
from PIL import Image
from transformers import AutoTokenizer

# Clone the repo and add it to your Python path, then:
from models import EmberNetVLM
from models.vlm import EmberNetConfig

# Load
config = EmberNetConfig()
model = EmberNetVLM(config)
ckpt = torch.load("pytorch_model.bin", map_location="cpu", weights_only=False)
model.load_state_dict(ckpt["model_state_dict"])
model.eval()

tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

# Inference
image = Image.open("scene.jpg").convert("RGB")
prompt = "<image>\nDescribe what you see."

response = model.generate(
    image=image,
    prompt=prompt,
    tokenizer=tokenizer,
    max_new_tokens=256,
)
print(response)

Intended Uses

Edge & embedded deployment — ternary weights run efficiently on CPUs and NPUs
Domain-aware visual reasoning — dedicated experts for OCR, charts, math, spatial, and agentic tasks
Robotic / agentic pipelines — agentic_knowledge + agentic_reasoning experts support multi-step planning
Fine-tuning base — swap in domain datasets to specialise any of the 8 experts independently

Limitations

Optimised for efficiency; maximum single-task accuracy is lower than full-precision models of similar size
Image resolution fixed at 224 × 224; very fine-grained OCR may degrade
Expert routing is learned; novel domains may activate sub-optimal experts until fine-tuned
Tokeniser vocabulary (32 002) is Phi-2 derived; non-English performance is limited

Citation

@software{embernet_vlm,
  title  = {EmberNet: Tiny BitNet b1.58 MoE Vision-Language Model},
  author = {Aman Euh},
  year   = {2026},
  url    = {https://huggingface.co/euhidaman/EmberNet-Trial}
}

Downloads last month: 93

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support