EmberNet-Trial β€” BitNet b1.58 MoE VLM

Status: Trial run β€” Stage 2/2, Epoch 1/1, Loss 4.4079

EmberNet is a tiny but capable Vision-Language Model built for edge deployment and domain-expert reasoning. It combines a frozen SigLIP vision backbone with a BitNet b1.58 ternary-quantized Mixture-of-Experts language decoder, achieving ~3Γ— memory reduction over a full-precision equivalent while preserving strong visual understanding across 8 specialised domains.


Model Details

Property Value
Model type Vision-Language Model (VLM)
Quantisation BitNet b1.58 (ternary weights: βˆ’1, 0, +1)
Total parameters 840.8 M
Trainable parameters 723.3 M
Active parameters / forward ~235.4 M (top-2 routing)
Carbon footprint 0.0624 kg COβ‚‚eq
Training stage Stage 2/2 β€” Expert SFT
Epoch 1/1
Best loss 4.4079
Last updated 2026-02-25 15:21 UTC

Architecture

EmberNet VLM
β”œβ”€β”€ Vision Encoder  (frozen)
β”‚   β”œβ”€β”€ SigLIP-base-patch16-224       ~92.9 M params
β”‚   β”œβ”€β”€ Token Compressor (pixel-shuffle + pooling) β†’ 64 tokens
β”‚   β”œβ”€β”€ Spatial Pooler                ~2.4 M params
β”‚   └── BitLinear Projector           ~10.1 M params
β”‚
└── BitNet b1.58 MoE Decoder
    β”œβ”€β”€ Layers: 16   Hidden: 768   Heads: 12 (GQA kv=6)
    β”œβ”€β”€ Experts: 8 domain specialists + 1 shared expert (always active)
    β”œβ”€β”€ Routing: Top-2 per token
    └── Quantisation: ternary weights, 4-bit activations

Expert Domains

ID Expert Trained on
0 vision_ocr TextVQA, DocVQA, OCR-VQA, InfoVQA
1 vision_diagram AI2D, InfoVQA diagrams
2 code_math_chart ChartQA, PlotQA, FigureQA, DVQA
3 code_math_formula MathVista, math formula datasets
4 spatial_scene VQAv2, GQA, Visual Genome
5 spatial_reasoning RefCOCO, GQA spatial splits
6 agentic_knowledge OK-VQA, A-OKVQA
7 agentic_reasoning ScienceQA, CLEVR
β€” shared All domains (always active)

Training

Configuration

stage_1_projector_alignment:
  epochs: 3
  batch_size: 8  (effective: 32 with grad-accum 4)
  learning_rate: 1e-4
  trainable: vision projector + compressor + pooler only

stage_2_expert_sft:
  epochs: 10
  batch_size: 4  (effective: 16 with grad-accum 4)
  learning_rate: 3e-4
  trainable: router + all 8 expert FFNs + shared expert
  expert_supervision_weight: 0.1

Optimiser

  • BitNetStableOptimizer β€” custom Adam with FP32 master weights
  • Two-phase LR: full LR for 60 % of training, then 0.1 Γ— LR
  • Warmup: 100 steps
  • Weight clamp: [βˆ’3, 3] (maps cleanly to βˆ’1 / 0 / +1 at inference)

Usage

import torch
from PIL import Image
from transformers import AutoTokenizer

# Clone the repo and add it to your Python path, then:
from models import EmberNetVLM
from models.vlm import EmberNetConfig

# Load
config = EmberNetConfig()
model = EmberNetVLM(config)
ckpt = torch.load("pytorch_model.bin", map_location="cpu", weights_only=False)
model.load_state_dict(ckpt["model_state_dict"])
model.eval()

tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

# Inference
image = Image.open("scene.jpg").convert("RGB")
prompt = "<image>\nDescribe what you see."

response = model.generate(
    image=image,
    prompt=prompt,
    tokenizer=tokenizer,
    max_new_tokens=256,
)
print(response)

Intended Uses

  • Edge & embedded deployment β€” ternary weights run efficiently on CPUs and NPUs
  • Domain-aware visual reasoning β€” dedicated experts for OCR, charts, math, spatial, and agentic tasks
  • Robotic / agentic pipelines β€” agentic_knowledge + agentic_reasoning experts support multi-step planning
  • Fine-tuning base β€” swap in domain datasets to specialise any of the 8 experts independently

Limitations

  • Optimised for efficiency; maximum single-task accuracy is lower than full-precision models of similar size
  • Image resolution fixed at 224 Γ— 224; very fine-grained OCR may degrade
  • Expert routing is learned; novel domains may activate sub-optimal experts until fine-tuned
  • Tokeniser vocabulary (32 002) is Phi-2 derived; non-English performance is limited

Citation

@software{embernet_vlm,
  title  = {EmberNet: Tiny BitNet b1.58 MoE Vision-Language Model},
  author = {Aman Euh},
  year   = {2026},
  url    = {https://huggingface.co/euhidaman/EmberNet-Trial}
}
Downloads last month
93
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support