---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- text-generation
- llama
- small-language-model
- efficient
- edge-deployment
- speculative-decoding
- tiny-model
- 12m-parameters
- kaggle-trained
- educational
- research
- low-resource
- cpu-inference
- mobile-deployment
- preview
- stentor2
- tokenmonster
pipeline_tag: text-generation
datasets:
- HuggingFaceFW/fineweb-edu
thumbnail: https://huggingface.co/StentorLabs/Stentor2-12M-Preview/resolve/main/thumbnail.png
widget:
- text: "Once upon a time"
  example_title: "Story Generation"
- text: "Explain neural networks in simple terms."
  example_title: "Toy Explanation (Often Wrong)"
- text: "def fibonacci(n):"
  example_title: "Code Continuation"
- text: "The laws of thermodynamics describe"
  example_title: "Science Continuation"
model_card_authors:
- StentorLabs
model-index:
- name: Stentor2-12M-Preview
  results:
  - task:
      type: text-generation
    dataset:
      name: FineWeb-Edu (validation split)
      type: HuggingFaceFW/fineweb-edu
    metrics:
    - name: Best Validation Loss
      type: loss
      value: 3.9145
    - name: Best Perplexity (at best checkpoint)
      type: perplexity
      value: 50.07
    - name: Final Epoch Validation Loss
      type: loss
      value: 4.0083
    - name: Final Epoch Perplexity
      type: perplexity
      value: 55.05
---

# Stentor2-12M-Preview

![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)
![Model Size](https://img.shields.io/badge/parameters-12.3M-green.svg)
![Training Time](https://img.shields.io/badge/training-4.4h-orange.svg)
![Hardware](https://img.shields.io/badge/hardware-2x%20T4-red.svg)
![Context Length](https://img.shields.io/badge/context-1024%20tokens-purple.svg)
![Vocab Size](https://img.shields.io/badge/vocab-8064%20tokens-blue.svg)
[![Hugging Face](https://img.shields.io/badge/🤗-Hugging%20Face-yellow.svg)](https://huggingface.co/StentorLabs)
![Status](https://img.shields.io/badge/status-Research%20Artifact%20Only-red.svg)

> 🔬 **Research Artifact — Not a Production Model.** This is an early preview checkpoint released for research, experimentation, and community feedback. It is not suitable for deployment in any user-facing application. See [Intended Uses](#use-cases--intended-uses) for details.

> ⚠️ **This is a preview release.** Stentor2-12M-Preview is an early taste of the Stentor2 family — a substantially redesigned architecture over Stentor v1. Further improvements have already been identified and a refined final release is actively in progress. This checkpoint is **not** the ceiling of what Stentor2 will be.
>
> 🚫 **A Stentor2-30M-Preview will NOT be released.** This model exists solely to give the community an early look at the Stentor2 *direction* and design philosophy. It is not a stepping stone to larger preview drops. The next public release from StentorLabs will be the finished, polished Stentor2 model.
>
> 🙏 **A sincere apology about the brief private period.** Shortly after the initial release, the repo was temporarily made private. I want to be completely upfront about what happened: the `AutoModelForCausalLM.from_pretrained()` loading issue described in detail below was discovered *after* going public, and the repo needed to come down immediately to prevent more people from downloading a silently broken model. I'm a high school student working on this alone in my very limited free time, and tracking down exactly why the model was producing no output at all — or throwing an error — despite the weights loading without a visible crash took me an entire day of debugging. I know that if you downloaded the model before the fix, you may have spent hours staring at a prompt that returned nothing and had no idea where to even start. That's an awful experience and I'm genuinely sorry. The model is now fully public, stable, and loads correctly with the custom loader described in this README. Thank you for your patience. 🙏

---

## Table of Contents

1. [What Is This?](#what-is-this)
2. [The Core Design Insight: Vocabulary Efficiency](#the-core-design-insight-vocabulary-efficiency)
3. [Head-to-Head: Stentor v1 vs Stentor2 Preview](#head-to-head-stentor-v1-vs-stentor2-preview)
4. [Quick Start](#quick-start)
5. [Known Loading Issue — Please Read](#known-loading-issue--please-read)
6. [Important Limitations](#important-limitations)
7. [Model Architecture — Full Specification](#model-architecture--full-specification)
8. [The Tokenizer: TokenMonster](#the-tokenizer-tokenmonster)
9. [Training Infrastructure](#training-infrastructure)
10. [Training Hyperparameters — Complete Reference](#training-hyperparameters--complete-reference)
11. [The T4 Mixed-Precision Recipe — Deep Dive](#the-t4-mixed-precision-recipe--deep-dive)
12. [Data Pipeline](#data-pipeline)
13. [Weight Initialization](#weight-initialization)
14. [Evaluation & Results](#evaluation--results)
15. [Training Dynamics](#training-dynamics)
16. [Use Cases & Intended Uses](#use-cases--intended-uses)
17. [Out-of-Scope Uses](#out-of-scope-uses)
18. [Ethical Considerations & Societal Impact](#ethical-considerations--societal-impact)
19. [Inference Guide](#inference-guide)
20. [Real Model Responses](#real-model-responses)
21. [Quantization](#quantization)
22. [Format Conversion](#format-conversion)
23. [Speculative Decoding](#speculative-decoding)
24. [Bias, Risks & Limitations](#bias-risks--limitations)
25. [Related Work](#related-work)
26. [What's Next](#whats-next)
27. [Environmental Impact](#environmental-impact)
28. [Citation](#citation)

---

## What Is This?

Stentor2-12M-Preview is the first public checkpoint from the **Stentor2** model family — a ground-up redesign of the original Stentor v1 line. At ~12.3M parameters, it is a compact base language model (LLM) built entirely from scratch on free-tier Kaggle compute using two NVIDIA Tesla T4 GPUs.

Like all Stentor models, this is a **base next-token predictor**, not a chat assistant. It will not reliably follow instructions, has no safety tuning, and is best used for research, prototyping, speculative decoding, and edge-deployment experimentation. The value of this model is not its conversational capability — it's what it represents architecturally: a dramatic efficiency gain over v1 at the same scale, achieved by fixing the root cause of v1's underperformance.

---

## The Core Design Insight: Vocabulary Efficiency

The most consequential change in Stentor2 is the replacement of the standard Llama/Mistral 32,768-token vocabulary with a purpose-built **8,000-token English vocabulary** from the TokenMonster project (`english-8000-consistent-v1`, padded to 8,064 for hardware alignment).

This is not a minor tweak — it is the entire architectural story of Stentor2.

### Why Vocabulary Size Matters So Much at This Scale

In a transformer language model, the embedding table has shape `[vocab_size × hidden_size]`. When you tie word embeddings (share the embedding and output projection weights, which Stentor does), this table appears once in the parameter count. At 12M total parameters, the fraction consumed by this table dictates how much "brain" is left over for the actual transformer layers.

**Stentor-12M (v1)** used a 32,768-token vocabulary. At a hidden size of 192:

```
embedding_params = 32,768 × 192 = 6,291,456
total_params     = 12,047,040
embedding_share  = 52.2%
```

Over half of the model was a lookup table. The transformer stack — the part that actually *learns language patterns* — had fewer than 6 million parameters to work with. It was more dictionary than reasoner.

**Stentor2-12M-Preview** uses an 8,064-token vocabulary. At a hidden size of 256:

```
embedding_params = 8,064 × 256 = 2,064,384
total_params     = 12,294,400
embedding_share  = 16.8%
```

By shrinking the vocabulary, the embedding table was cut from 6.3M to 2.1M parameters — freeing up ~4.2M parameters that were redistributed into transformer depth (12 layers vs 9) and width (hidden size 256 vs 192), where they contribute directly to language modeling quality.

The result is a **~43.8% reduction in perplexity** (89.01 → ~50.07) compared to Stentor-12M. Note that the comparison is close but not perfectly controlled — v1 trained on a mix of FineWeb-Edu and Cosmopedia v2, while Stentor2 trained on FineWeb-Edu only — making this an apple-to-apple-banana comparison rather than a pure ablation, but meaningful nonetheless.

---

## Head-to-Head: Stentor v1 vs Stentor2 Preview

| Property | Stentor-12M (v1) | Stentor2-12M-Preview |
|---|---|---|
| **Vocabulary** | 32,768 (Mistral BPE) | 8,064 (TokenMonster English) |
| **Hidden Size** | 192 | 256 |
| **Intermediate Size** | 576 | 768 |
| **Num Layers** | 9 | 12 |
| **Attention Heads** | 3 | 4 |
| **Head Dimension** | 64 | 64 |
| **Context Length** | 512 tokens | 1,024 tokens |
| **Total Parameters** | 12,047,040 | 12,294,400 |
| **Embedding Share** | 52.2% | 16.8% |
| **Non-Embedding Params** | ~5.76M | ~10.23M |
| **Token Budget** | 200M | 240M |
| **Training Time** | ~1.3h | ~4.4h |
| **Best Perplexity** | 89.01 | ~50.07 |
| **Perplexity Reduction** | — | **~43.8%** |
| **Tokenizer** | Mistral BPE | TokenMonster |
| **Architecture** | LlamaForCausalLM | LlamaForCausalLM |
| **Training Precision** | fp16 | fp16 + INT8 forward |

---

## 🚀 Quick Start

### 1. Install Dependencies

```bash
pip install transformers torch safetensors huggingface_hub
```

> `tokenmonster` will be installed automatically by the loader — you don't need to install it yourself.

### 2. Load the Model

This model needs a small custom loader script because of a quirk in how the checkpoint was saved during training. The loader is just a Python file (`load_stentor2.py`) that lives in this repo. You have two options for using it — pick whichever is easier for you:

---

**Option A — Pull it straight from the repo (easiest, no files to manage)**

The repo is fully public — no token or authentication is required. This downloads the loader file from HuggingFace into your local cache automatically, then runs it. The file is cached after the first download so it's fast on every run after that.

```python
from huggingface_hub import hf_hub_download
import importlib.util, sys, torch

# Download the loader from the HuggingFace repo (cached after first run)
path = hf_hub_download(repo_id="StentorLabs/Stentor2-12M-Preview", filename="load_stentor2.py")

# Import it as a Python module
spec = importlib.util.spec_from_file_location("load_stentor2", path)
mod  = importlib.util.module_from_spec(spec)
sys.modules["load_stentor2"] = mod
spec.loader.exec_module(mod)

# Load the model
model, tokenizer = mod.load_stentor2()
```

The `importlib` lines are just Python's way of loading a `.py` file that isn't in your current folder. After those lines, `mod` behaves exactly like a normal imported module and `mod.load_stentor2()` works exactly like a normal function call.

---

**Option B — Download the file once, import it normally**

Download `load_stentor2.py` from the **Files** tab on this page and put it in the same folder as your script. Then just import it like any normal Python file:

```python
from load_stentor2 import load_stentor2
import torch

model, tokenizer = load_stentor2()
```

If you move your project to a different folder, bring `load_stentor2.py` with it.

---

**Which should I use?**

| | Option A | Option B |
|---|---|---|
| Manual file download needed? | No | Yes (once) |
| Best for | Notebooks, Kaggle, Colab | Local projects |
| Code complexity | A few extra lines | Simple import |

---

**GPU (FP16) — recommended if you have a CUDA GPU:**
```python
model, tokenizer = mod.load_stentor2(dtype=torch.float16)  # Option A
model, tokenizer = load_stentor2(dtype=torch.float16)       # Option B
```

### 3. Generate Text

Once loaded, the model works like any standard HuggingFace model. Because this is a **base model**, it continues text rather than answering questions — give it the beginning of a sentence and it will complete it.

```python
input_ids      = torch.tensor([tokenizer.encode("The history of computing")], dtype=torch.long).to(next(model.parameters()).device)
attention_mask = torch.ones_like(input_ids)

with torch.inference_mode():
    output = model.generate(
        input_ids,
        attention_mask=attention_mask,
        max_new_tokens=80,
        do_sample=True,
        temperature=1.1,
        top_p=0.55,
        repetition_penalty=1.15,
        pad_token_id=tokenizer.pad_token_id,
    )

print(tokenizer.decode(output[0].tolist()))
```

> **Why `attention_mask`?** The model's pad token and EOS token are the same ID. Without an explicit attention mask, HuggingFace throws a warning because it can't tell which tokens are real vs padding. Passing `torch.ones_like(input_ids)` tells the model that every token in the input is real — which is always true here since we never pad single-sequence inference.

### 4. Recommended Generation Settings

These settings were validated through hands-on testing and produce the best results for a base model at this scale:

| Parameter | Recommended Range | Notes |
|---|---|---|
| `temperature` | 0.65 – 1.2 | Lower = more focused, higher = more creative |
| `top_p` | 0.5 – 0.8 | Nucleus sampling cutoff |
| `max_new_tokens` | 10 – 60 | Keep outputs short to stay on topic |
| `repetition_penalty` | 1.1 – 1.2 | Helps prevent looping |

> ⚠️ **Keep `max_new_tokens` low.** This is a 12M parameter base model — it does not have robust long-range coherence. Short completions are significantly more coherent than long ones. Going beyond ~60 tokens will often result in the model wandering off topic or repeating itself.

---

## ⚠️ Known Loading Issue — Please Read

**`AutoModelForCausalLM.from_pretrained()` does NOT work with this model.** This section explains exactly what goes wrong, why, and how the loader fixes it. This is a preview-only issue — it will not exist in the final Stentor2-12M release.

### What Goes Wrong

If you try to load the model the normal way:

```python
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("StentorLabs/Stentor2-12M-Preview")
```

You will get a load report showing a bunch of `UNEXPECTED` and `MISSING` keys for layers 2–8:

```
model.layers.{2,3,4,5,6,7,8}.self_attn.q_proj.weight_master  | UNEXPECTED
model.layers.{2,3,4,5,6,7,8}.self_attn.q_proj.weight         | MISSING
```

Those layers will be loaded with **uninitialized weights** — the checkpoint has the right data, it's just stored under the wrong name. The result is that the model either produces no output at all, or throws an error during generation. There is no clear indication of why. You can stare at a prompt that returns nothing and have no obvious place to start debugging — which is exactly what makes this failure so painful to track down.

### Why It Happens

Think of a model checkpoint as a dictionary where every layer's weights are stored under a name, like a filing cabinet. The standard name for a weight is `.weight`. HuggingFace opens the filing cabinet, looks for files labeled `.weight`, and loads them.

During training, layers 2–8 used a special training wrapper called `Int8LinearT4` that stored weights under `.weight_master` instead of `.weight`. When the training finished and the checkpoint was saved, those non-standard labels were written to disk exactly as-is.

So HuggingFace opens the filing cabinet, looks for `.weight` in layers 2–8, finds nothing (MISSING), then notices there are `.weight_master` labels it doesn't recognize (UNEXPECTED), and moves on — leaving those layers randomly initialized. The model runs. The output is meaningless. No error is ever raised.

### How the Loader Fixes It

`load_stentor2.py` opens the raw checkpoint file itself before the model ever sees it, finds every `.weight_master` label, and renames it to `.weight`:

```
model.layers.3.self_attn.q_proj.weight_master  →  model.layers.3.self_attn.q_proj.weight
```

Here is exactly what that key-renaming logic looks like:

```python
sd = {}
masters = {k for k in raw_sd if k.endswith(".weight_master")}
skip    = {k[:-len("_master")] for k in masters}
for k, v in raw_sd.items():
    if k.endswith(".weight_master"):
        sd[k[:-len("_master")]] = v   # rename: drop "_master"
    elif k not in skip:
        sd[k] = v                      # keep everything else unchanged

model.load_state_dict(sd, strict=False)
```

Then it hands the corrected checkpoint to the model. The model just sees normal `.weight` labels and loads fine. From that point on it is a completely standard `LlamaForCausalLM` — no special handling needed for anything else.

> ✅ **This will not be an issue in Stentor2-12M.** The final release will save a clean checkpoint with standard key names that loads with `AutoModelForCausalLM.from_pretrained()` as normal. This is purely a preview artifact.

---

## ⚠️ Important Limitations

- **Not Instruction-Tuned:** This is a base model. It will often ignore prompts, continue in unexpected directions, or respond off-topic. The chat template in the tokenizer config is present for structural compatibility, not because the model knows how to use it.
- **No Safety Tuning:** No RLHF, no constitutional AI, no content filtering. Use with appropriate caution.
- **Limited World Knowledge:** ~12M parameters cannot store meaningful world knowledge. Do not treat outputs as factual.
- **Context Window:** Hard limit of 1,024 tokens. The model was trained exclusively on 1,024-token packed sequences; longer contexts are untested and likely to degrade.
- **English Only:** The TokenMonster `english-8000-consistent-v1` vocabulary is English-specific. Non-English text will tokenize very poorly.
- **Custom Tokenizer:** This model uses a TokenMonster adapter, **not** a standard Hugging Face fast tokenizer. The `tokenizer.json` format differs from typical models. Make sure `tokenmonster` is installed before loading.
- **`skip_special_tokens` Not Supported:** The TokenMonster tokenizer does **not** support the `skip_special_tokens` argument in its decode method. Calling `tokenizer.decode(ids, skip_special_tokens=True)` will raise an error. Strip special tokens manually if needed — see the [Tokenizer section](#the-tokenizer-tokenmonster) for details.
- **Preview Quality:** Further architectural improvements have already been identified. This is not the final Stentor2 model.
- **Shared Tensor Warning:** When saving or loading this model, you may see: `Removed shared tensor {'lm_head.weight'} while saving`. This is expected behavior from tied word embeddings and is safe to ignore.

---

## Model Architecture — Full Specification

Stentor2-12M-Preview is a `LlamaForCausalLM` model. All architecture values below were derived directly from the training script and validated against the logged parameter counts.

### Core Configuration

| Component | Value | Derivation |
|---|---|---|
| **Architecture** | `LlamaForCausalLM` | Hard-coded in training script |
| **Hidden Size** | 256 | Inferred: embedding_params (2,064,384) ÷ vocab_size (8,064) = 256 ✓ |
| **Intermediate Size (FFN)** | 768 | Hidden × 3 (verified via total param count) |
| **Num Hidden Layers** | 12 | Verified via total param count formula |
| **Num Attention Heads** | 4 | Hidden ÷ head_dim = 256 ÷ 64 = 4 |
| **Num Key/Value Heads** | 4 | Full MHA (no GQA at this scale) |
| **Head Dimension** | 64 | Enforced by training script: `head_dim must be 64` |
| **Vocab Size** | 8,064 | TokenMonster 8K base + 62 padding tokens (multiple of 128) |
| **Max Position Embeddings** | 1,024 | `block_size` default in training script |
| **Hidden Activation** | SiLU | LlamaForCausalLM default |
| **Positional Encoding** | RoPE | `rope_theta = 10,000.0` |
| **RMS Norm Epsilon** | 1e-5 | Default in training script |
| **Tie Word Embeddings** | True | Shared embedding / LM head weights |
| **Attention Implementation** | SDPA | PyTorch Scaled Dot Product Attention |
| **Attention Pattern** | Full causal | No sliding window, no sparse patterns |

### Parameter Count Breakdown

The total parameter count can be reproduced exactly using the following formula from the training script:

```python
def estimate_llama_params(vocab_size, hidden_size, intermediate_size,
                          num_hidden_layers, num_attention_heads, num_key_value_heads):
    kv_dim = int(hidden_size * num_key_value_heads / num_attention_heads)
    # Q, K projections (hidden→hidden) + V, O projections (hidden→hidden for full MHA)
    attn = 2 * hidden_size * hidden_size + 2 * hidden_size * kv_dim
    # Gate, Up, Down projections
    mlp  = 3 * hidden_size * intermediate_size
    # Input norm + post-attention norm per layer
    norm = 2 * hidden_size
    # Embedding table + final RMS norm
    total = vocab_size * hidden_size + num_hidden_layers * (attn + mlp + norm) + hidden_size
    return total
```

Plugging in Stentor2 values:

```
kv_dim  = 256 * 4 / 4 = 256
attn    = 2×256×256 + 2×256×256 = 131,072 + 131,072 = 262,144
mlp     = 3×256×768 = 589,824
norm    = 2×256 = 512
per_layer = 262,144 + 589,824 + 512 = 852,480

embedding = 8,064 × 256  = 2,064,384
layers    = 12 × 852,480 = 10,229,760
final_norm = 256

total = 2,064,384 + 10,229,760 + 256 = 12,294,400 ✓
```

| Component | Parameters | % of Total |
|---|---|---|
| Embedding Table (tied with LM Head) | 2,064,384 | 16.8% |
| Transformer Layers × 12 | 10,229,760 | 83.2% |
| — Attention (per layer × 12) | 3,145,728 | 25.6% |
| — FFN/MLP (per layer × 12) | 7,077,888 | 57.5% |
| — Layer Norms (per layer × 12) | 6,144 | 0.05% |
| Final RMS Norm | 256 | 0.002% |
| **Total** | **12,294,400** | **100%** |

### Architecture Constraints Enforced by Training Script

The training pipeline enforces several hard constraints that directly shaped the final architecture:

1. **Head dimension must be exactly 64.** The script raises a `SystemExit` if `hidden_size / num_attention_heads ≠ 64`. This is a T4 hardware efficiency constraint — 64 is the optimal head dim for the T4's tensor core utilization.

2. **KV heads ≤ attention heads, and attention heads divisible by KV heads.** Standard MHA constraint (no GQA at this scale).

3. **Vocabulary padded to nearest multiple of 128.** `pad_vocab_to_multiple=128` for hardware alignment.

---

## The Tokenizer: TokenMonster

Stentor2 uses a custom tokenizer adapter wrapping the **TokenMonster** `english-8000-consistent-v1` vocabulary, rather than a standard BPE tokenizer from the Hugging Face ecosystem.

### What Is TokenMonster?

TokenMonster ([alasdairforsythe/tokenmonster](https://huggingface.co/alasdairforsythe/tokenmonster)) is an alternative tokenization approach optimized for compact English vocabulary sizes. The `english-8000-consistent-v1` vocabulary is a purpose-built ~8,000-token English vocabulary designed for efficiency at small model scales.

### ⚠️ `skip_special_tokens` Is Not Supported

The TokenMonster tokenizer **does not support the `skip_special_tokens` argument** in its decode method. If you call `tokenizer.decode(ids, skip_special_tokens=True)` you will get an error. Always decode without it and strip special tokens manually if needed:

```python
# ✅ Correct
text = tokenizer.decode(output_ids)

# ❌ This will raise an error
text = tokenizer.decode(output_ids, skip_special_tokens=True)
```

If you are using HuggingFace's `TextIteratorStreamer` or any wrapper that internally passes `skip_special_tokens=True` to the tokenizer, you will need to patch or replace that wrapper. The demo Space and the loader script both handle this correctly already.

### Tokenizer Efficiency vs. v1

You may notice that this tokenizer produces **more tokens per word** compared to Stentor v1. This is expected and by design. The v1 models used a 32,768-token Mistral BPE vocabulary, which encodes common English words as single tokens very efficiently. Stentor2 uses an 8,064-token TokenMonster vocabulary — smaller vocabulary means more tokens per word on average. This is the direct tradeoff for freeing up ~4.2M parameters for the transformer layers. The ~43.8% perplexity improvement shows the tradeoff was worth it.

### Vocabulary Construction

The tokenizer pipeline proceeds as follows:

1. Base vocabulary is loaded from `alasdairforsythe/tokenmonster` → `vocabs/english-8000-consistent-v1.vocab` via `hf_hub_download`.
2. Special tokens are added: `</s>` (EOS), `<s>` (BOS), `<pad>` (set equal to EOS).
3. A default chat template is injected for structural compatibility.
4. The vocabulary is padded to the nearest multiple of 128 using dummy tokens `<|extra_0|>`, `<|extra_1|>`, ..., resulting in a final vocabulary size of **8,064 tokens**.

```
Base TokenMonster vocab:  ~8,002 tokens (approx)
+ padding to 128-multiple: +62 tokens
= Final vocab size:         8,064 tokens
```

### The TokenMonsterTokenizerAdapter

The training script wraps the TokenMonster vocabulary in a custom `TokenMonsterTokenizerAdapter` class that provides a HuggingFace-compatible interface. Key implementation details:

- **Tokenization:** Calls `vocab.tokenize(batch)` — batch or single-string input
- **Decoding:** Calls `vocab.decode(token_ids)`
- **No padding during tokenization itself** — padding is handled by the data collator
- **EOS appended** to each training sample in the tokenization function
- **`is_fast = True`** flag set to satisfy the training script's fast-tokenizer requirement
- **`save_pretrained`** saves a `tokenmonster.vocab` binary + `tokenizer_config.json` + `special_tokens_map.json`

### Tokenizer Configuration

```json
{
  "tokenizer_type": "tokenmonster",
  "vocab_file": "tokenmonster.vocab",
  "model_max_length": 1024,
  "eos_token": "</s>",
  "bos_token": "<s>",
  "pad_token": "</s>",
  "vocab_size": 8064
}
```

### Chat Template

A simple chat template is injected during tokenizer setup for structural compatibility with chat formatting tools, though the base model is **not trained to follow it**:

```jinja
{% for message in messages %}
<|{{ message['role'] }}|>
{{ message['content'] }}
{% endfor %}
{% if add_generation_prompt %}<|assistant|>
{% endif %}
```

### Loading the Tokenizer in Inference

Because the tokenizer is a custom type, standard `AutoTokenizer.from_pretrained` may require the `tokenmonster` Python package:

```bash
pip install tokenmonster
```

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "StentorLabs/Stentor2-12M-Preview",
    trust_remote_code=True  # may be needed depending on version
)
```

---

## Training Infrastructure

### Hardware

| Component | Specification |
|---|---|
| GPU Count | 2× NVIDIA Tesla T4 |
| VRAM per GPU | 15.64 GB |
| Total VRAM | ~31.3 GB |
| Platform | Kaggle Notebooks (free tier) |
| Accelerator Library | HuggingFace Accelerate |
| Active Processes | 1 (single-process despite 2 GPUs; T4 recipe runs on device 0) |

> **Note on Dual-GPU Setup:** The training environment was configured with 2× T4 GPUs and the Accelerate library successfully initialized the dual-GPU pipeline. However, the training run executed as a single process (`num_processes: 1`), meaning only one GPU was used for the actual compute. The second GPU was available but not utilized for this run. The `device_map="auto"` infrastructure was fully primed.

### Software Stack

| Package | Role |
|---|---|
| PyTorch | Core tensor operations and autograd |
| HuggingFace Transformers | Model architecture (LlamaForCausalLM) |
| HuggingFace Accelerate | Training loop and device management |
| HuggingFace Datasets | Streaming data loading |
| bitsandbytes | INT8 quantization primitives |
| tokenmonster | Custom vocabulary |
| safetensors | Model serialization |

---

## Training Hyperparameters — Complete Reference

The following table represents the exact configuration used for this training run, sourced directly from the training script defaults and confirmed against the training logs.

### Core Training Parameters

| Hyperparameter | Value | Notes |
|---|---|---|
| `learning_rate` | 2e-4 | AdamW LR for all parameters |
| `weight_decay` | 0.01 | Applied to non-embedding, non-norm, non-bias params |
| `max_grad_norm` | 1.0 | Gradient clipping threshold |
| `optimizer` | AdamW | With `betas=(0.9, 0.95)`, `eps=1e-8` |
| `scheduler` | Cosine | Cosine decay with linear warmup |
| `warmup_ratio` | 0.05 | → 732 warmup steps |
| `stable_ratio` | 0.8 | → 11,719 stable steps (cosine) |
| `token_budget` | 240,000,000 | Hard stop at 240M tokens seen |
| `max_train_steps` | 14,649 | Computed from token budget |
| `seed` | 42 | Reproducibility seed |
| `mixed_precision` | fp16 | All activations/gradients in FP16 |

### Batch & Sequence Parameters

| Hyperparameter | Value | Notes |
|---|---|---|
| `per_device_train_batch_size` | 4 | Per GPU per gradient accumulation step |
| `per_device_eval_batch_size` | 4 | Evaluation batch size |
| `gradient_accumulation_steps` | 4 | Effective optimizer steps every 4 forward passes |
| `total_batch_size` | 16 | `per_device × processes × grad_accum = 4×1×4` |
| `block_size` | 1,024 | Sequence length; training packed to this size |
| `tokens_per_optimizer_step` | 16,384 | `total_batch_size × block_size` |

### Evaluation & Checkpointing

| Hyperparameter | Value |
|---|---|
| `eval_steps` | 375 | Eval every 375 optimizer steps |
| `save_every_minutes` | 30 | Time-based checkpoint cadence |
| `save_total_limit` | 2 | Keep only the 2 most recent checkpoints |
| `save_epochs` | 1 | Save at end of each epoch |
| `logging_steps` | 125 | Console log every 125 optimizer steps |
| `max_eval_samples` | 2,000 | Validation set size |

### AdamW Optimizer — Detailed

The optimizer uses a decoupled parameter group strategy:

- **Decay group:** All `nn.Linear` weight matrices (excludes bias, norm weights, embedding)
  - `weight_decay = 0.01`
- **No-decay group:** Bias terms, normalization parameters, embedding parameters
  - `weight_decay = 0.0`
- **Betas:** `(0.9, 0.95)` — the 0.95 β₂ is a modern LLM default (vs the 0.999 PyTorch default)
- **Epsilon:** `1e-8`
- **Fused kernel:** Enabled if available (`torch.optim.AdamW(fused=True)` when CUDA is present)

### Learning Rate Schedule

The cosine schedule with warmup proceeds through three phases:

```
Phase 1 — Warmup (steps 0–732):
  LR ramps linearly from 0 → 2e-4

Phase 2 — Stable / Cosine Decay (steps 732–14,649):
  LR follows cosine curve from 2e-4 → 0

Phase 3 — (N/A for cosine; WSD decay phase only applies if scheduler=wsd)
```

Implemented via HuggingFace `get_cosine_schedule_with_warmup`.

---

## The T4 Mixed-Precision Recipe — Deep Dive

The most technically interesting aspect of Stentor2's training pipeline is its custom **T4 Mixed-Precision Recipe** — a bespoke approach to stable mixed-precision training on NVIDIA Tesla T4 GPUs, which lack BF16 support and have known numerical instability issues with FP16 on certain operations.

This recipe involves four distinct techniques applied simultaneously:

### 1. INT8 Simulated-Quantization Linear (49 modules)

All non-critical transformer linear layers are wrapped in a custom `Int8LinearT4` module that performs **quantization-aware training (QAT)** with a straight-through estimator (STE).

**How it works:**
- The module stores a **FP32 master weight** (`weight_master`) — gradients always flow back to this full-precision copy
- On each forward pass, the weight is **quantized to INT8** (simulated, not actual int8 memory layout) and then **dequantized back to FP16** for the matmul
- Both weights and activations are independently quantized using a per-row/per-token absolute-max scale: `scale = abs(x).amax(dim=-1, keepdim=True).clamp_min(1e-8) / 127.0`
- **Stochastic rounding** is used during training (disabled at eval): instead of `round(x)`, each fractional part is probabilistically rounded up or down — this reduces systematic quantization bias
- The STE ensures gradients pass through the non-differentiable rounding operation unchanged

**Why this matters:** The quantization error acts as a regularizer and forces the model to learn representations that are robust to 8-bit precision — a desirable property for downstream deployment on quantized hardware.

```
INT8 QAT forward pass (simplified):
  scale  = |W|.row_max / 127
  W_q    = round(W / scale).clamp(-127, 127)   ← stochastic
  W_dq   = W_q × scale                          ← dequantize
  W_ste  = W + (W_dq - W).detach()             ← STE: gradient sees full W
  output = x_ste @ W_ste.T + bias
```

### 2. FP32 Critical Layers (5 layers: first 2 + last 3)

The first 2 and last 3 transformer layers are designated as **critical layers** and run entirely in FP32:

- Their weights are cast to `.float()` at setup time
- Their `forward()` method is monkey-patched to cast all inputs to FP32 before the call and cast outputs back to the original dtype afterward
- `torch.amp.autocast("cuda", enabled=False)` context is used to prevent autocast from re-downcasting inside the layer

**Rationale:** The first layers are responsible for embedding projection and initial feature extraction; instability here corrupts the entire forward pass. The last layers handle final token prediction; numerical errors here directly impact loss. Running these in FP32 provides a stability floor at minimal compute cost.

### 3. FP32 Normalization Layers (25 modules)

All RMSNorm and LayerNorm modules are monkey-patched to run their computation in FP32 regardless of input dtype:

```python
def _fp32_norm_forward(hidden_states, *args, **kwargs):
    input_dtype = hidden_states.dtype
    output = original_forward(hidden_states.float().contiguous(), *args, **kwargs)
    return output.clone().to(input_dtype)
```

The `.clone()` call is critical: it prevents returning graph-managed buffers that can be overwritten across CUDAGraph replay steps under `torch.compile`. The inputs are also `.contiguous()` to prevent strided-tensor issues in FP32 norm ops.

**Why 25 modules:** With 12 transformer layers × 2 norms each (input norm + post-attention norm) + 1 final norm = 25 total.

**This is why `torch.compile` is disabled.** The FP32 norm wrappers are incompatible with CUDAGraph replay, which `torch.compile` uses under `reduce-overhead` mode. Enabling both would cause silent correctness errors.

### 4. FP32 Attention Softmax (12 modules)

Each attention module's `forward()` is monkey-patched to replace `torch.nn.functional.softmax` with a version that upcasts FP16/BF16 inputs to FP32 before computing the softmax, then downcasts the result:

```python
def _softmax_fp32(input_tensor, *args, **kwargs):
    if input_tensor.dtype in (torch.float16, torch.bfloat16):
        output = original_softmax(input_tensor.float(), *args, **kwargs)
        return output.to(input_tensor.dtype)
    return original_softmax(input_tensor, *args, **kwargs)
```

**Why this matters:** Softmax over large attention weight matrices in FP16 frequently produces NaN or Inf values due to numerical overflow in the exp() operation. Running the softmax itself in FP32 eliminates this instability entirely, which is essential for stable long-context attention (1024 tokens).

### T4 Recipe Summary Table

| Technique | Count | Scope |
|---|---|---|
| INT8 QAT linear modules | 49 | All non-critical linear layers |
| FP32 critical layers | 5 | Layers {0, 1, 9, 10, 11} |
| FP32 norm modules | 25 | All RMSNorm / LayerNorm |
| FP32 softmax modules | 12 | All attention modules |

### Gradient Checkpointing

Gradient checkpointing is enabled using the `non_reentrant=True` path (preferred for modern PyTorch) to reduce activation memory. `model.config.use_cache = False` is set to prevent KV cache allocation during training. `model.enable_input_require_grads()` is called to ensure gradients can flow through checkpoint boundaries.

---

## Data Pipeline

### Dataset

The model was trained exclusively on **FineWeb-Edu** ([HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu)) — a large-scale web corpus filtered for educational content quality. Cosmopedia v2 was available in the pipeline (configurable via `--cosmopedia_weight`) but the default weight of 0.0 means it was not used in this run.

**Total tokens processed:** 240,001,024 (budget-limited run)

### Streaming Mode

The dataset was loaded in **streaming mode** (`streaming=True`), meaning:
- No data was pre-downloaded or pre-tokenized to disk
- Samples were tokenized on-the-fly during training
- `num_workers=0` was enforced (IterableDataset + multiprocessing causes deadlocks in notebook environments)
- Shuffle buffer of 20,000 samples was applied

### Text Preprocessing

Each raw text sample undergoes the following cleaning pipeline before tokenization:

```python
def clean_text(text: str) -> str:
    text = unicodedata.normalize("NFKC", text)   # normalize unicode
    lines = [line.strip() for line in text.splitlines() if line.strip()]
    text = " ".join(lines)                         # collapse newlines
    text = " ".join(text.split())                  # normalize whitespace
    return text
```

**Why these specific steps:**

- **NFKC normalization** maps visually equivalent Unicode characters to a single canonical form (e.g., full-width `Ａ` → `A`, ligature `ﬁ` → `fi`, superscript `²` → `2`). This is the standard choice for LLM preprocessing — used in T5 (Raffel et al., 2020, [arXiv:1910.10683](https://arxiv.org/abs/1910.10683)), BERT (Devlin et al., 2019, [arXiv:1810.04805](https://arxiv.org/abs/1810.04805)), and the Unicode standard itself (Unicode Technical Report #15). Without it, the model would see dozens of token IDs for what is semantically one character.

- **Whitespace collapse** (join lines, collapse spaces) ensures consistent tokenization of the same content regardless of how it was originally formatted. Web-scraped text commonly contains inconsistent line breaks, multiple spaces, and mixed newline styles. This is also standard practice in GPT-style pretraining pipelines. No ablation was performed on this step — it was adopted from established practice rather than experimentally derived.

### Tokenization

Each cleaned sample is tokenized using the TokenMonster adapter:
- `add_special_tokens=False` during tokenization
- EOS token (`</s>`) appended to every sample
- Attention mask generated (all 1s for real tokens)

### Sequence Packing

After tokenization, samples are **packed** into fixed 1,024-token blocks using a stateful packing function:

```
Sample 1:  [tok, tok, tok, ..., </s>]   (e.g., 347 tokens)
Sample 2:  [tok, tok, tok, ..., </s>]   (e.g., 891 tokens)
                                         ↓
Block 1:   [<sample1...>, <first 677 tokens of sample2>]   (1024 tokens)
Block 2:   [<remaining 214 tokens of sample2>, <sample3...>]
```

Packing eliminates all padding waste and ensures every training token is a real content token. The remainder buffer carries leftover tokens between batch iterations. At the end of the dataset, any leftover tokens are padded to 1,024 with the EOS token and labels masked (`-100`) for the padded positions.

**Labels** for packed sequences are identical to `input_ids` (causal LM: predict each token from all preceding tokens). There is no special boundary masking between packed samples in this pipeline — the model learns to cross document boundaries, which is standard practice.

### Validation Split

A held-out validation set of 2,000 samples was used for evaluation, drawn from the streaming dataset via `.take(2000)` before training data was streamed.

### Data Collation

The packed collator pads batches to the longest sequence in the batch (rounded up to the nearest multiple of 8 for hardware alignment):
- `input_ids`: padded with `pad_token_id`
- `labels`: padded with `-100` (ignored in loss computation)
- `attention_mask`: padded with `0`

---

## Weight Initialization

All parameters are initialized using a **truncated normal distribution** with `std=0.02` — the same initialization used in GPT-2 and most modern LLMs:

```python
def initialize_weights(model, std=0.02):
    for module in model.modules():
        if isinstance(module, (nn.Linear, nn.Embedding)):
            module.weight.data.normal_(mean=0.0, std=std)
            if module.bias is not None:
                module.bias.data.zero_()
        elif "layernorm" in type(module).__name__.lower() or \
             "rmsnorm"   in type(module).__name__.lower():
            if module.weight is not None:
                module.weight.data.fill_(1.0)   # scale initialized to 1
            if module.bias is not None:
                module.bias.data.zero_()
```

**Key points:**
- Linear layers: normal(0, 0.02)
- Embeddings: normal(0, 0.02) — same as linear
- RMSNorm scale weights: initialized to 1.0 (identity transform at start)
- All biases: zero

This initialization is applied **before** the T4 recipe is applied. The T4 recipe then copies `nn.Linear.weight` into `Int8LinearT4.weight_master` as FP32, preserving the initialization.

---

## Evaluation & Results

### Training Curves

The charts below show validation loss and perplexity over the course of the training run. Both are plotted against optimizer steps. The best checkpoint (step 11,625) is visible as the lowest point before the slight uptick in the tail phase.

![Validation loss over training steps](loss_chart.png)

![Perplexity over training steps](perplexity_chart.png)

### Metrics

- **Validation Loss:** Cross-entropy loss over the held-out validation split (lower = better)
- **Perplexity (PPL):** `exp(loss)` — lower means the model is less "surprised" by unseen text

### Results Summary

| Checkpoint | Step | Eval Loss | Perplexity |
|---|---|---|---|
| Initial | 375 | 7.1108 | ~1,228 |
| Early | 1,500 | 5.4646 | ~236 |
| Mid | 3,375 | 4.6069 | ~100 |
| Mid-Late | 6,750 | 4.1789 | ~65 |
| Late | 9,375 | 4.0686 | ~58 |
| **Best Checkpoint** | **11,625** | **3.9145** | **~50.1** |
| Final Epoch | 14,649 | 4.0083 | 55.05 |

### Comparison to Stentor v1

| Model | Best Eval Loss | Best Perplexity | Improvement |
|---|---|---|---|
| Stentor-12M (v1) | 4.4887 | 89.01 | — |
| Stentor2-12M-Preview | 3.9145 | ~50.1 | **↓43.8% perplexity** |

The ~43.8% perplexity reduction is a close but not perfectly controlled comparison: v1 was trained on a mix of FineWeb-Edu and Cosmopedia v2, while Stentor2 was trained on FineWeb-Edu only. Both use educational-quality text at the same parameter count — an apple-to-apple-banana comparison. The vocabulary size, architecture configuration, and token budget (200M → 240M) all differ.

---

## Training Dynamics

The training run proceeded for a single epoch over 14,649 optimizer steps, consuming exactly 240,001,024 tokens (budget-limited). Several observations from the training curve are worth noting for researchers:

**Early Phase (steps 0–2,250):** Loss drops rapidly from ~8.36 → ~4.97. The model quickly learns basic token co-occurrence statistics. Best eval checkpoints update frequently (steps 375, 750, 1125, 1500, 1875, 2250).

**Middle Phase (steps 2,250–8,625):** Loss continues declining but with more noise. Individual batch losses oscillate significantly (3.7–5.5 range) while eval loss steadily improves. This is characteristic of a model encountering varied document types in a shuffled stream.

**Late Phase (steps 8,625–11,625):** Eval loss reaches its lowest point at step 11,625 (3.9145). The model's best checkpoint is saved here.

**Tail Phase (steps 11,625–14,649):** Eval loss increases slightly to 4.0083 at the final epoch eval. This is consistent with cosine schedule tail behavior — the learning rate approaches zero and the model may slightly overfit to recent batches or experience minor distribution drift near the end of the dataset.

---

## Use Cases & Intended Uses

> 🔬 **Reminder:** This is a **research artifact**. It is a base language model with no safety tuning, no instruction following, and no factual grounding. Every intended use below assumes a researcher or developer context, not an end user.

### Intended Uses

| Use Case | Suitability | Notes |
|---|---|---|
| Studying transformer training dynamics | ✅ High | Small enough to train/fine-tune on free compute |
| Tokenization efficiency research | ✅ High | 8K vs 32K vocab tradeoff is directly observable |
| Speculative decoding experiments | ✅ High | Fast enough to serve as a draft model |
| Benchmarking CPU/edge inference latency | ✅ High | ~12MB in FP16, runs on any hardware |
| Testing quantization/conversion pipelines | ✅ High | GGUF, ONNX, INT8 pipeline validation |
| Teaching material for LLM courses | ✅ High | Architecture is simple enough to trace by hand |
| LoRA / QLoRA fine-tuning experiments | ✅ Moderate | Base model only; start from scratch for any task |
| Text continuation / creative prompting | ✅ Moderate | Works best on short completions ≤60 tokens |
| Domain-specific fine-tuning research | ✅ Moderate | Small enough to iterate rapidly |
| Factual Q&A | ❌ Not suitable | Model has no reliable world knowledge |
| Production deployment | ❌ Not suitable | No safety tuning; preview quality only |
| Non-English text | ❌ Not suitable | TokenMonster vocab is English-only |
| Long-document tasks (>512 tokens of coherent output) | ❌ Not suitable | Coherence degrades quickly |

---

## Out-of-Scope Uses

The following uses are explicitly out of scope and should not be attempted:

- **User-facing applications of any kind** — This model has no safety filtering, no alignment, and no factual reliability. Deploying it in a context where a real user receives its output without expert review is inappropriate regardless of the domain.
- **Medical, legal, or financial advice** — Even if prompted carefully, 12M parameters cannot store or reason over specialized knowledge reliably. All outputs should be treated as potentially wrong.
- **Generating content about real people** — The model has no awareness of who real people are or what they have said/done. Outputs mentioning real people are likely to be fabricated.
- **Automated content pipelines** — Do not use this model to generate content at scale without human review. The output quality and coherence are not sufficient for unreviewed publication.
- **Non-English use** — The 8,064-token TokenMonster vocabulary is built exclusively for English. Prompts in other languages will be tokenized very poorly and outputs will be unreliable.
- **Instruction following** — This is a base model. It does not reliably follow instructions, answer questions, or complete structured tasks. Prompting it as if it were a chat assistant will not work.

---

## Ethical Considerations & Societal Impact

### Inherited Data Biases

Stentor2-12M-Preview was trained on FineWeb-Edu, a filtered subset of Common Crawl. Despite quality filtering, this data inherits the biases present in English-language web text:

- **Western-centric perspective** — Educational content on the web skews heavily toward Western, primarily American and European, viewpoints and examples.
- **English monolingualism** — The training data and vocabulary are both English-only. The model has no meaningful capability in other languages.
- **Demographic underrepresentation** — Groups that are underrepresented in English-language educational web content will be underrepresented in the model's outputs.
- **Temporal cutoff** — FineWeb-Edu's data has a cutoff; the model has no knowledge of recent events.

### No Safety Tuning

This model has received **no safety training of any kind** — no RLHF, no DPO, no constitutional AI, no content filtering. It is a raw base model that predicts the next token based on statistical patterns. It should not be used in any context where harmful outputs would cause real-world harm.

### Positive Societal Aspects

- **Democratizing AI research** — Trained entirely on free-tier Kaggle compute, this model demonstrates that meaningful LLM research does not require significant financial resources. Students and independent researchers can reproduce, study, and build on this work.
- **Transparency** — Full training hyperparameters, architecture details, and training script are published. This is a contribution to reproducible ML research.
- **Minimal environmental footprint** — ~4.4 hours of single-GPU compute. Estimated carbon footprint under 0.5 kg CO₂e.

### Responsible Use Reminder

If you use this model in research, please document clearly that it is an unaligned base model and include appropriate caveats when reporting results. Do not present outputs from this model as factual without verification.

---

## Inference Guide

> ⚠️ **All examples below use the custom loader.** See the [Known Loading Issue](#known-loading-issue--please-read) section for why `AutoModelForCausalLM.from_pretrained()` cannot be used directly. Use either Option A (call from repo) or Option B (local file) from the Quick Start section to get `model` and `tokenizer`, then the code below works identically either way.

### Basic Generation

```python
# Load using Option A or B from Quick Start first, then:
import torch

device = next(model.parameters()).device

def generate(prompt, max_new_tokens=50, temperature=0.9, top_p=0.65):
    input_ids      = torch.tensor([tokenizer.encode(prompt)], dtype=torch.long).to(device)
    attention_mask = torch.ones_like(input_ids)
    with torch.inference_mode():
        output = model.generate(
            input_ids,
            attention_mask=attention_mask,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=temperature,
            top_p=top_p,
            repetition_penalty=1.15,
            pad_token_id=tokenizer.pad_token_id,
        )
    new_ids = output[0][input_ids.shape[1]:].tolist()
    return tokenizer.decode(new_ids).strip()

print(generate("The history of computing began"))
```

### CPU (FP32)

```python
model, tokenizer = mod.load_stentor2(dtype=torch.float32)   # Option A
model, tokenizer = load_stentor2(dtype=torch.float32)        # Option B
model = model.to("cpu")
```

### GPU (FP16)

```python
model, tokenizer = mod.load_stentor2(dtype=torch.float16)   # Option A
model, tokenizer = load_stentor2(dtype=torch.float16)        # Option B
model = model.to("cuda")
```

### From a Local Checkpoint

```python
model, tokenizer = mod.load_stentor2("./path/to/local/checkpoint")   # Option A
model, tokenizer = load_stentor2("./path/to/local/checkpoint")        # Option B
```

---

## Real Model Responses

These are actual unedited outputs from the model. All examples use the custom loader described above.

---

**Prompt:** `Some sicknesses are`
**Settings:** max_new_tokens=50, temperature=0.7, top_p=0.65
**Output:**
> often associated with high blood pressure. The cause of depression is associated with a decrease in blood pressure, and may increase infections such as atrophy. The symptoms may also include: - The symptom

*(Stopped at the 50-token limit, not because the model ran out of ideas)*
**Stats:** 50 tokens · 1.06s · 47.2 t/s

---

**Prompt:** `In the early 20th century`
**Settings:** max_new_tokens=45, temperature=0.85, top_p=0.75
**Output:**
> , the Middle Ages had become popularized by many, thought to be the most prominent and most popular world. In the midst of the 20th century, a study of the Western Pyrami

*(Cut off by the token limit)*
**Stats:** 43 tokens · 0.91s · 47.4 t/s

---

**Prompt:** `In Egypt there were massive sand cones called`
**Settings:** max_new_tokens=10, temperature=0.65, top_p=0.6
**Output:**
> Pyramids (which

*(Cut off at 10 tokens — the model correctly identified Pyramids immediately)*
**Stats:** 10 tokens · 0.14s · 71.4 t/s

---

**Key observations from testing:**
- The model responds best to prompts that are the **beginning of a sentence or paragraph** — it is a text *continuer*, not a question *answerer*. Give it a strong opening and it will follow the pattern.
- Speed on CPU is approximately **47–71 t/s** depending on prompt length and hardware.
- Keeping `max_new_tokens` at 60 or below produces noticeably more coherent completions.
- The TokenMonster tokenizer is less efficient per word than the 32K BPE vocabulary used in v1 — this is expected given the smaller vocab size and is the direct cost of the ~43.8% perplexity improvement.

---

## Quantization

> ⚠️ **Critical note for this preview:** `AutoModelForCausalLM.from_pretrained()` with `BitsAndBytesConfig` does **not** work for this checkpoint due to the `weight_master` key issue described in the [Known Loading Issue](#known-loading-issue--please-read) section. You must load with the custom loader first, then apply quantization afterward. The standard `from_pretrained()` + `BitsAndBytesConfig` pattern will work normally in the final Stentor2-12M release.

Despite the model already being small (~49 MB in FP32, ~25 MB in FP16), quantization can further reduce memory for extremely constrained environments.

### FP16 — Recommended First Step

For GPU deployment, loading in FP16 halves memory to ~25 MB and is the simplest effective "quantization":

```python
model, tokenizer = mod.load_stentor2(dtype=torch.float16)  # Option A
model = model.to("cuda")
```

### Dynamic INT8 Quantization (CPU, PyTorch native — no extra install)

For CPU deployment, PyTorch's built-in dynamic quantization works after loading with the custom loader and requires no additional packages:

```python
import torch
from huggingface_hub import hf_hub_download
import importlib.util, sys

# Step 1: Load with custom loader
path = hf_hub_download(repo_id="StentorLabs/Stentor2-12M-Preview", filename="load_stentor2.py")
spec = importlib.util.spec_from_file_location("load_stentor2", path)
mod  = importlib.util.module_from_spec(spec)
sys.modules["load_stentor2"] = mod
spec.loader.exec_module(mod)

model, tokenizer = mod.load_stentor2(dtype=torch.float32)
model = model.to("cpu").eval()

# Step 2: Apply dynamic INT8 quantization (CPU only)
model_int8 = torch.quantization.quantize_dynamic(
    model,
    {torch.nn.Linear},
    dtype=torch.qint8,
)
# Approximate memory: ~12 MB — 75% reduction from FP32
# Note: dynamic quantization only affects inference; model stays on CPU
```

### Manual 8-bit via bitsandbytes (GPU)

For GPU deployment with bitsandbytes INT8, apply the conversion after loading:

```python
import torch
import bitsandbytes as bnb
from huggingface_hub import hf_hub_download
import importlib.util, sys

# Step 1: Load with custom loader
path = hf_hub_download(repo_id="StentorLabs/Stentor2-12M-Preview", filename="load_stentor2.py")
spec = importlib.util.spec_from_file_location("load_stentor2", path)
mod  = importlib.util.module_from_spec(spec)
sys.modules["load_stentor2"] = mod
spec.loader.exec_module(mod)

model, tokenizer = mod.load_stentor2(dtype=torch.float16)
model = model.to("cuda").eval()

# Step 2: Replace linear layers with INT8 equivalents
def replace_with_bnb_int8(module):
    for name, child in list(module.named_children()):
        if isinstance(child, torch.nn.Linear):
            new_layer = bnb.nn.Linear8bitLt(
                child.in_features,
                child.out_features,
                bias=child.bias is not None,
                has_fp16_weights=False,
                threshold=6.0,
            )
            new_layer.weight = bnb.nn.Int8Params(
                child.weight.data.cpu(),
                requires_grad=False,
            )
            if child.bias is not None:
                new_layer.bias = torch.nn.Parameter(child.bias.data)
            setattr(module, name, new_layer)
        else:
            replace_with_bnb_int8(child)

replace_with_bnb_int8(model)
# Approximate memory: ~12 MB (75% reduction from FP32 ~49 MB)
```

Requires: `pip install bitsandbytes`

> **Practical note:** Given that FP16 is already only ~25 MB and the model runs at 47–71 t/s on CPU, aggressive quantization may not be necessary for most use cases. Dynamic INT8 is most useful when targeting microcontrollers or very constrained embedded environments.

---

## Format Conversion

### Convert to GGUF (for llama.cpp)

```bash
# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
pip install -r requirements.txt

# Download model
huggingface-cli download StentorLabs/Stentor2-12M-Preview --local-dir stentor2-12m-preview

# Convert to GGUF (FP16)
python convert_hf_to_gguf.py stentor2-12m-preview/ \
  --outfile stentor2-12m-preview.gguf \
  --outtype f16

# Quantize to Q4_0 (optional, smallest file)
./llama-quantize stentor2-12m-preview.gguf stentor2-12m-preview-q4_0.gguf q4_0

# Run
./llama-cli -m stentor2-12m-preview-q4_0.gguf -p "The science of" -n 50
```

> **Note on GGUF + TokenMonster:** The custom TokenMonster tokenizer may require manual vocabulary mapping when using llama.cpp. The standard `convert_hf_to_gguf.py` script expects a HuggingFace tokenizer format. You may need to convert the vocabulary to a compatible format first.

### Convert to ONNX

```bash
pip install optimum[exporters]

optimum-cli export onnx \
  --model StentorLabs/Stentor2-12M-Preview \
  --task text-generation-with-past \
  stentor2-12m-onnx/
```

```python
from optimum.onnxruntime import ORTModelForCausalLM
from transformers import AutoTokenizer

model = ORTModelForCausalLM.from_pretrained("stentor2-12m-onnx")
tokenizer = AutoTokenizer.from_pretrained("StentorLabs/Stentor2-12M-Preview")

inputs = tokenizer("Hello world", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0]))
```

---

## Speculative Decoding

Stentor2-12M-Preview can serve as a fast **draft model** to accelerate inference from larger Llama-family target models.

```python
from huggingface_hub import hf_hub_download
import importlib.util, sys, torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load Stentor2 as draft model using the custom loader
path = hf_hub_download(repo_id="StentorLabs/Stentor2-12M-Preview", filename="load_stentor2.py")
spec = importlib.util.spec_from_file_location("load_stentor2", path)
mod  = importlib.util.module_from_spec(spec)
sys.modules["load_stentor2"] = mod
spec.loader.exec_module(mod)

draft_model, _ = mod.load_stentor2(dtype=torch.float16)
draft_model     = draft_model.to("cuda")

# Load target model normally
target_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-1B",
    torch_dtype=torch.float16,
    device_map="auto"
)
target_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")

prompt  = "Explain the concept of recursion"
inputs  = target_tokenizer(prompt, return_tensors="pt")

outputs = target_model.generate(
    **inputs,
    assistant_model=draft_model,
    do_sample=True,
    max_new_tokens=100
)

print(target_tokenizer.decode(outputs[0], skip_special_tokens=True))
```

**Important caveat:** Stentor2 uses a different vocabulary (8,064-token TokenMonster) than standard Llama models (32,000-token BPE). This vocabulary mismatch means the target model's acceptance rate may be lower than it would be with a vocabulary-compatible draft model. In practice, speedups depend heavily on how similar the generated text distribution is between draft and target.

For best results with speculative decoding, a vocabulary-matched draft model is preferable. If you need a drop-in speculative draft for a standard Llama target, Stentor v1 (with its 32,768-token Mistral vocabulary) may provide better token acceptance rates despite its higher perplexity.

---

## Bias, Risks & Limitations

### Known Limitations

The following limitations were observed and confirmed through hands-on testing:

- **Prompt Relevance:** Outputs are frequently off-topic for complex prompts. The model is pattern-completing, not comprehending.
- **Factual Accuracy:** All factual claims from this model should be treated as unreliable. 12M parameters cannot store meaningful world knowledge.
- **Context Boundary:** Hard limit of 1,024 tokens. Sequences approaching this limit may degrade in coherence.
- **Short Output Window for Coherence:** Even within the 1,024-token context limit, outputs beyond ~60 tokens tend to wander off-topic or become repetitive. Keeping `max_new_tokens` at 60 or below is strongly recommended.
- **English Bias:** The TokenMonster English vocabulary is optimized for English. Other languages will tokenize to many rare/unknown tokens and likely produce poor output.
- **Training Data Bias:** Inherits biases present in FineWeb-Edu filtered web data — primarily English-language, Western-centric educational content.
- **Hallucination:** Like all LLMs, this model may confidently produce plausible-sounding but entirely fabricated content.
- **No Alignment:** No RLHF, no DPO, no constitutional training. Raw base model behavior.
- **Preview Status:** This is not the final Stentor2 architecture. Known improvements are pending.
- **Tokenizer Efficiency:** The 8K TokenMonster vocabulary produces more tokens per word than standard 32K BPE vocabularies. This is expected given the architecture tradeoff and is not a bug.

### Shared Tensor Warning

When saving or reloading this model, you will see:

```
Removed shared tensor {'lm_head.weight'} while saving.
```

This is expected. The model uses `tie_word_embeddings=True`, meaning `model.embed_tokens.weight` and `model.lm_head.weight` point to the same tensor. The safetensors format removes the duplicate during serialization and reconstructs it on load. This is safe and produces no accuracy difference.

> This is a separate and unrelated issue from the `weight_master` loading problem. See the [Known Loading Issue](#known-loading-issue--please-read) section for that.

---

## What's Next

This is a preview. The training run for Stentor2-12M-Preview revealed several clear paths to further improvement that have not yet been implemented. Those improvements are the focus of the next training run, and when that model is ready, it will be released as **Stentor2-12M**.

If you find bugs, unexpected behavior, or have benchmarks or use cases worth sharing, please open a discussion on the model repository — community input before the final release is welcome.

> 🚫 **There will be no Stentor2-30M-Preview.** This preview exists to share the architectural direction of the Stentor2 family, not to establish a preview release cadence for every size. The next public drop from StentorLabs will be the finished Stentor2 model.

---

## Environmental Impact

| Factor | Value |
|---|---|
| Hardware | 2× NVIDIA Tesla T4 (1 active) |
| Active Training Duration | ~4.37 hours |
| Cloud Provider | Kaggle (free tier) |
| Compute Region | Western USA |
| Estimated Carbon | Minimal (< 0.5 kg CO₂e estimated) |

Training on free-tier cloud compute demonstrates that meaningful SLM research is accessible to independent researchers and students without significant hardware investment or carbon cost.

---

## Citation

If you use this model in research or a project, please cite it as follows. Note that this is a HuggingFace model card, not an arXiv paper, so there is no arXiv ID — the `howpublished` URL is the canonical reference.

```bibtex
@misc{izumoto2026stentor2_12m_preview,
  title        = {Stentor2-12M-Preview},
  author       = {Kai Izumoto},
  year         = {2026},
  publisher    = {StentorLabs},
  howpublished = {\url{https://huggingface.co/StentorLabs/Stentor2-12M-Preview}},
  note         = {Preview checkpoint of the Stentor2 model family.
                  12.3M parameter LlamaForCausalLM base model trained on
                  FineWeb-Edu with a TokenMonster 8K vocabulary.
                  Apache 2.0 license.}
}
```

---

## Related Work

This section compares Stentor2-12M-Preview to other publicly available models in the sub-50M parameter range, and to relevant research that informed design decisions.

### Comparable Sub-50M Models

| Model | Parameters | Perplexity | Vocab | Training Data | Notes |
|---|---|---|---|---|---|
| **Stentor2-12M-Preview** (this model) | 12.3M | ~50.1 (FineWeb-Edu val) | 8,064 | FineWeb-Edu 240M tokens | Base model, TokenMonster vocab |
| Stentor-12M (v1) | 12.0M | 89.01 (FineWeb-Edu val) | 32,768 | FineWeb-Edu + Cosmopedia 200M | Baseline this model improves on |
| Stentor-30M (v1) | 30.4M | 33.02 (FineWeb-Edu val) | 32,768 | FineWeb-Edu + Cosmopedia 600M | Larger v1 model |
| TinyStories-33M | ~33M | ~varies | ~50K | TinyStories (synthetic) | Eldan & Li, 2023 — focused on story generation |
| TinyStories-1M | ~1M | very high | ~50K | TinyStories (synthetic) | Demonstrates 1M param story capability |
| Pythia-14M | 14M | ~varies (Pile) | 50,254 | The Pile 300B tokens | EleutherAI; well-studied scaling baseline |
| Pythia-70M | 70M | ~varies (Pile) | 50,254 | The Pile 300B tokens | Closest Pythia model above this size |
| BabyLlama | 58M | ~varies | ~32K | TinyStories + Wikitext | BabyLM challenge submission |

> **Comparison caveats:** Perplexity numbers are not directly comparable across models — different validation sets, vocabularies, and tokenizers all affect the number. The table is a rough orientation, not a rigorous benchmark. Stentor2's perplexity is measured on the FineWeb-Edu validation split using its own 8K TokenMonster tokenizer.

**Key differentiators of Stentor2 vs. comparable models:**
- **Vocabulary efficiency focus** — The deliberate reduction to 8K tokens to maximize non-embedding parameter budget is a distinguishing design choice not seen in most small models.
- **T4-specific training recipe** — The INT8 QAT + FP32 critical layer + FP32 norm combination is a novel stability recipe specifically designed for consumer-grade GPU training.
- **Educational data** — Unlike TinyStories models (trained on synthetic children's stories) or Pythia (trained on the general-domain Pile), Stentor2 is trained on quality-filtered educational web text.

### Related Research Papers

| Paper | Relevance |
|---|---|
| [TinyStories](https://arxiv.org/abs/2305.07759) — Eldan & Li, 2023 | Demonstrates meaningful language generation from 1M–33M parameter models; closest comparator in scale |
| [Pythia](https://arxiv.org/abs/2304.01373) — Biderman et al., 2023 | Systematic study of small model scaling; Pythia-14M is a well-documented baseline |
| [Scaling Laws](https://arxiv.org/abs/2001.08361) — Kaplan et al., 2020 | Foundational work on compute-optimal training; informs token budget decisions |
| [Chinchilla](https://arxiv.org/abs/2203.15556) — Hoffmann et al., 2022 | Revised scaling laws; 240M tokens for 12M params is approximately compute-optimal under this analysis |
| [Model Cards](https://arxiv.org/abs/1810.03993) — Mitchell et al., 2018 | Methodology underlying this model card |
| [RoPE](https://arxiv.org/abs/2104.09864) — Su et al., 2021 | Positional encoding used in this model |
| [Speculative Decoding](https://arxiv.org/abs/2211.17192) — Leviathan et al., 2023 | Primary use case for a fast draft model like Stentor2 |
| [T5](https://arxiv.org/abs/1910.10683) — Raffel et al., 2020 | Source of NFKC text normalization approach used in data pipeline |

---

## Related Resources

### StentorLabs Models
- [Stentor-30M](https://huggingface.co/StentorLabs/Stentor-30M) — Larger v1 base model
- [Stentor-12M](https://huggingface.co/StentorLabs/Stentor-12M) — v1 baseline this model improves upon
- [Stentor-30M-Instruct](https://huggingface.co/StentorLabs/Stentor-30M-Instruct) — Instruction-tuned v1 model
- [Stentor-12M-Instruct](https://huggingface.co/StentorLabs/Stentor-12M-Instruct) — Instruction-tuned v1 model
- [StentorLabs Collection](https://huggingface.co/StentorLabs) — All models from StentorLabs

### Referenced Tools & Datasets
- [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) — Training data
- [TokenMonster](https://huggingface.co/alasdairforsythe/tokenmonster) — Tokenizer vocabulary
- [HuggingFace Accelerate](https://github.com/huggingface/accelerate) — Training framework
- [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) — Quantization library
- [mradermacher GGUF quantizations of Stentor-30M](https://huggingface.co/mradermacher/Stentor-30M-GGUF) — Community quantizations of v1

---

## Model Card Contact

Questions, benchmarks, or feedback: [StentorLabs@gmail.com](mailto:StentorLabs@gmail.com) or open a [discussion](https://huggingface.co/StentorLabs/Stentor2-12M-Preview/discussions).

---

<p align="center">
  Made with ❤️ by <a href="https://huggingface.co/StentorLabs">StentorLabs</a><br>
  <i>Democratizing AI through accessible, efficient models</i>
</p>