]
```
Packing eliminates all padding waste and ensures every training token is a real content token. The remainder buffer carries leftover tokens between batch iterations. At the end of the dataset, any leftover tokens are padded to 1,024 with the EOS token and labels masked (`-100`) for the padded positions.
**Labels** for packed sequences are identical to `input_ids` (causal LM: predict each token from all preceding tokens). There is no special boundary masking between packed samples in this pipeline β the model learns to cross document boundaries, which is standard practice.
### Validation Split
A held-out validation set of 2,000 samples was used for evaluation, drawn from the streaming dataset via `.take(2000)` before training data was streamed.
### Data Collation
The packed collator pads batches to the longest sequence in the batch (rounded up to the nearest multiple of 8 for hardware alignment):
- `input_ids`: padded with `pad_token_id`
- `labels`: padded with `-100` (ignored in loss computation)
- `attention_mask`: padded with `0`
---
## Weight Initialization
All parameters are initialized using a **truncated normal distribution** with `std=0.02` β the same initialization used in GPT-2 and most modern LLMs:
```python
def initialize_weights(model, std=0.02):
for module in model.modules():
if isinstance(module, (nn.Linear, nn.Embedding)):
module.weight.data.normal_(mean=0.0, std=std)
if module.bias is not None:
module.bias.data.zero_()
elif "layernorm" in type(module).__name__.lower() or \
"rmsnorm" in type(module).__name__.lower():
if module.weight is not None:
module.weight.data.fill_(1.0) # scale initialized to 1
if module.bias is not None:
module.bias.data.zero_()
```
**Key points:**
- Linear layers: normal(0, 0.02)
- Embeddings: normal(0, 0.02) β same as linear
- RMSNorm scale weights: initialized to 1.0 (identity transform at start)
- All biases: zero
This initialization is applied **before** the T4 recipe is applied. The T4 recipe then copies `nn.Linear.weight` into `Int8LinearT4.weight_master` as FP32, preserving the initialization.
---
## Evaluation & Results
### Training Curves
The charts below show validation loss and perplexity over the course of the training run. Both are plotted against optimizer steps. The best checkpoint (step 11,625) is visible as the lowest point before the slight uptick in the tail phase.


### Metrics
- **Validation Loss:** Cross-entropy loss over the held-out validation split (lower = better)
- **Perplexity (PPL):** `exp(loss)` β lower means the model is less "surprised" by unseen text
### Results Summary
| Checkpoint | Step | Eval Loss | Perplexity |
|---|---|---|---|
| Initial | 375 | 7.1108 | ~1,228 |
| Early | 1,500 | 5.4646 | ~236 |
| Mid | 3,375 | 4.6069 | ~100 |
| Mid-Late | 6,750 | 4.1789 | ~65 |
| Late | 9,375 | 4.0686 | ~58 |
| **Best Checkpoint** | **11,625** | **3.9145** | **~50.1** |
| Final Epoch | 14,649 | 4.0083 | 55.05 |
### Comparison to Stentor v1
| Model | Best Eval Loss | Best Perplexity | Improvement |
|---|---|---|---|
| Stentor-12M (v1) | 4.4887 | 89.01 | β |
| Stentor2-12M-Preview | 3.9145 | ~50.1 | **β43.8% perplexity** |
The ~43.8% perplexity reduction is a close but not perfectly controlled comparison: v1 was trained on a mix of FineWeb-Edu and Cosmopedia v2, while Stentor2 was trained on FineWeb-Edu only. Both use educational-quality text at the same parameter count β an apple-to-apple-banana comparison. The vocabulary size, architecture configuration, and token budget (200M β 240M) all differ.
---
## Training Dynamics
The training run proceeded for a single epoch over 14,649 optimizer steps, consuming exactly 240,001,024 tokens (budget-limited). Several observations from the training curve are worth noting for researchers:
**Early Phase (steps 0β2,250):** Loss drops rapidly from ~8.36 β ~4.97. The model quickly learns basic token co-occurrence statistics. Best eval checkpoints update frequently (steps 375, 750, 1125, 1500, 1875, 2250).
**Middle Phase (steps 2,250β8,625):** Loss continues declining but with more noise. Individual batch losses oscillate significantly (3.7β5.5 range) while eval loss steadily improves. This is characteristic of a model encountering varied document types in a shuffled stream.
**Late Phase (steps 8,625β11,625):** Eval loss reaches its lowest point at step 11,625 (3.9145). The model's best checkpoint is saved here.
**Tail Phase (steps 11,625β14,649):** Eval loss increases slightly to 4.0083 at the final epoch eval. This is consistent with cosine schedule tail behavior β the learning rate approaches zero and the model may slightly overfit to recent batches or experience minor distribution drift near the end of the dataset.
---
## Use Cases & Intended Uses
> π¬ **Reminder:** This is a **research artifact**. It is a base language model with no safety tuning, no instruction following, and no factual grounding. Every intended use below assumes a researcher or developer context, not an end user.
### Intended Uses
| Use Case | Suitability | Notes |
|---|---|---|
| Studying transformer training dynamics | β
High | Small enough to train/fine-tune on free compute |
| Tokenization efficiency research | β
High | 8K vs 32K vocab tradeoff is directly observable |
| Speculative decoding experiments | β
High | Fast enough to serve as a draft model |
| Benchmarking CPU/edge inference latency | β
High | ~12MB in FP16, runs on any hardware |
| Testing quantization/conversion pipelines | β
High | GGUF, ONNX, INT8 pipeline validation |
| Teaching material for LLM courses | β
High | Architecture is simple enough to trace by hand |
| LoRA / QLoRA fine-tuning experiments | β
Moderate | Base model only; start from scratch for any task |
| Text continuation / creative prompting | β
Moderate | Works best on short completions β€60 tokens |
| Domain-specific fine-tuning research | β
Moderate | Small enough to iterate rapidly |
| Factual Q&A | β Not suitable | Model has no reliable world knowledge |
| Production deployment | β Not suitable | No safety tuning; preview quality only |
| Non-English text | β Not suitable | TokenMonster vocab is English-only |
| Long-document tasks (>512 tokens of coherent output) | β Not suitable | Coherence degrades quickly |
---
## Out-of-Scope Uses
The following uses are explicitly out of scope and should not be attempted:
- **User-facing applications of any kind** β This model has no safety filtering, no alignment, and no factual reliability. Deploying it in a context where a real user receives its output without expert review is inappropriate regardless of the domain.
- **Medical, legal, or financial advice** β Even if prompted carefully, 12M parameters cannot store or reason over specialized knowledge reliably. All outputs should be treated as potentially wrong.
- **Generating content about real people** β The model has no awareness of who real people are or what they have said/done. Outputs mentioning real people are likely to be fabricated.
- **Automated content pipelines** β Do not use this model to generate content at scale without human review. The output quality and coherence are not sufficient for unreviewed publication.
- **Non-English use** β The 8,064-token TokenMonster vocabulary is built exclusively for English. Prompts in other languages will be tokenized very poorly and outputs will be unreliable.
- **Instruction following** β This is a base model. It does not reliably follow instructions, answer questions, or complete structured tasks. Prompting it as if it were a chat assistant will not work.
---
## Ethical Considerations & Societal Impact
### Inherited Data Biases
Stentor2-12M-Preview was trained on FineWeb-Edu, a filtered subset of Common Crawl. Despite quality filtering, this data inherits the biases present in English-language web text:
- **Western-centric perspective** β Educational content on the web skews heavily toward Western, primarily American and European, viewpoints and examples.
- **English monolingualism** β The training data and vocabulary are both English-only. The model has no meaningful capability in other languages.
- **Demographic underrepresentation** β Groups that are underrepresented in English-language educational web content will be underrepresented in the model's outputs.
- **Temporal cutoff** β FineWeb-Edu's data has a cutoff; the model has no knowledge of recent events.
### No Safety Tuning
This model has received **no safety training of any kind** β no RLHF, no DPO, no constitutional AI, no content filtering. It is a raw base model that predicts the next token based on statistical patterns. It should not be used in any context where harmful outputs would cause real-world harm.
### Positive Societal Aspects
- **Democratizing AI research** β Trained entirely on free-tier Kaggle compute, this model demonstrates that meaningful LLM research does not require significant financial resources. Students and independent researchers can reproduce, study, and build on this work.
- **Transparency** β Full training hyperparameters, architecture details, and training script are published. This is a contribution to reproducible ML research.
- **Minimal environmental footprint** β ~4.4 hours of single-GPU compute. Estimated carbon footprint under 0.5 kg COβe.
### Responsible Use Reminder
If you use this model in research, please document clearly that it is an unaligned base model and include appropriate caveats when reporting results. Do not present outputs from this model as factual without verification.
---
## Inference Guide
> β οΈ **All examples below use the custom loader.** See the [Known Loading Issue](#known-loading-issue--please-read) section for why `AutoModelForCausalLM.from_pretrained()` cannot be used directly. Use either Option A (call from repo) or Option B (local file) from the Quick Start section to get `model` and `tokenizer`, then the code below works identically either way.
### Basic Generation
```python
# Load using Option A or B from Quick Start first, then:
import torch
device = next(model.parameters()).device
def generate(prompt, max_new_tokens=50, temperature=0.9, top_p=0.65):
input_ids = torch.tensor([tokenizer.encode(prompt)], dtype=torch.long).to(device)
attention_mask = torch.ones_like(input_ids)
with torch.inference_mode():
output = model.generate(
input_ids,
attention_mask=attention_mask,
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=temperature,
top_p=top_p,
repetition_penalty=1.15,
pad_token_id=tokenizer.pad_token_id,
)
new_ids = output[0][input_ids.shape[1]:].tolist()
return tokenizer.decode(new_ids).strip()
print(generate("The history of computing began"))
```
### CPU (FP32)
```python
model, tokenizer = mod.load_stentor2(dtype=torch.float32) # Option A
model, tokenizer = load_stentor2(dtype=torch.float32) # Option B
model = model.to("cpu")
```
### GPU (FP16)
```python
model, tokenizer = mod.load_stentor2(dtype=torch.float16) # Option A
model, tokenizer = load_stentor2(dtype=torch.float16) # Option B
model = model.to("cuda")
```
### From a Local Checkpoint
```python
model, tokenizer = mod.load_stentor2("./path/to/local/checkpoint") # Option A
model, tokenizer = load_stentor2("./path/to/local/checkpoint") # Option B
```
---
## Real Model Responses
These are actual unedited outputs from the model. All examples use the custom loader described above.
---
**Prompt:** `Some sicknesses are`
**Settings:** max_new_tokens=50, temperature=0.7, top_p=0.65
**Output:**
> often associated with high blood pressure. The cause of depression is associated with a decrease in blood pressure, and may increase infections such as atrophy. The symptoms may also include: - The symptom
*(Stopped at the 50-token limit, not because the model ran out of ideas)*
**Stats:** 50 tokens Β· 1.06s Β· 47.2 t/s
---
**Prompt:** `In the early 20th century`
**Settings:** max_new_tokens=45, temperature=0.85, top_p=0.75
**Output:**
> , the Middle Ages had become popularized by many, thought to be the most prominent and most popular world. In the midst of the 20th century, a study of the Western Pyrami
*(Cut off by the token limit)*
**Stats:** 43 tokens Β· 0.91s Β· 47.4 t/s
---
**Prompt:** `In Egypt there were massive sand cones called`
**Settings:** max_new_tokens=10, temperature=0.65, top_p=0.6
**Output:**
> Pyramids (which
*(Cut off at 10 tokens β the model correctly identified Pyramids immediately)*
**Stats:** 10 tokens Β· 0.14s Β· 71.4 t/s
---
**Key observations from testing:**
- The model responds best to prompts that are the **beginning of a sentence or paragraph** β it is a text *continuer*, not a question *answerer*. Give it a strong opening and it will follow the pattern.
- Speed on CPU is approximately **47β71 t/s** depending on prompt length and hardware.
- Keeping `max_new_tokens` at 60 or below produces noticeably more coherent completions.
- The TokenMonster tokenizer is less efficient per word than the 32K BPE vocabulary used in v1 β this is expected given the smaller vocab size and is the direct cost of the ~43.8% perplexity improvement.
---
## Quantization
> β οΈ **Critical note for this preview:** `AutoModelForCausalLM.from_pretrained()` with `BitsAndBytesConfig` does **not** work for this checkpoint due to the `weight_master` key issue described in the [Known Loading Issue](#known-loading-issue--please-read) section. You must load with the custom loader first, then apply quantization afterward. The standard `from_pretrained()` + `BitsAndBytesConfig` pattern will work normally in the final Stentor2-12M release.
Despite the model already being small (~49 MB in FP32, ~25 MB in FP16), quantization can further reduce memory for extremely constrained environments.
### FP16 β Recommended First Step
For GPU deployment, loading in FP16 halves memory to ~25 MB and is the simplest effective "quantization":
```python
model, tokenizer = mod.load_stentor2(dtype=torch.float16) # Option A
model = model.to("cuda")
```
### Dynamic INT8 Quantization (CPU, PyTorch native β no extra install)
For CPU deployment, PyTorch's built-in dynamic quantization works after loading with the custom loader and requires no additional packages:
```python
import torch
from huggingface_hub import hf_hub_download
import importlib.util, sys
# Step 1: Load with custom loader
path = hf_hub_download(repo_id="StentorLabs/Stentor2-12M-Preview", filename="load_stentor2.py")
spec = importlib.util.spec_from_file_location("load_stentor2", path)
mod = importlib.util.module_from_spec(spec)
sys.modules["load_stentor2"] = mod
spec.loader.exec_module(mod)
model, tokenizer = mod.load_stentor2(dtype=torch.float32)
model = model.to("cpu").eval()
# Step 2: Apply dynamic INT8 quantization (CPU only)
model_int8 = torch.quantization.quantize_dynamic(
model,
{torch.nn.Linear},
dtype=torch.qint8,
)
# Approximate memory: ~12 MB β 75% reduction from FP32
# Note: dynamic quantization only affects inference; model stays on CPU
```
### Manual 8-bit via bitsandbytes (GPU)
For GPU deployment with bitsandbytes INT8, apply the conversion after loading:
```python
import torch
import bitsandbytes as bnb
from huggingface_hub import hf_hub_download
import importlib.util, sys
# Step 1: Load with custom loader
path = hf_hub_download(repo_id="StentorLabs/Stentor2-12M-Preview", filename="load_stentor2.py")
spec = importlib.util.spec_from_file_location("load_stentor2", path)
mod = importlib.util.module_from_spec(spec)
sys.modules["load_stentor2"] = mod
spec.loader.exec_module(mod)
model, tokenizer = mod.load_stentor2(dtype=torch.float16)
model = model.to("cuda").eval()
# Step 2: Replace linear layers with INT8 equivalents
def replace_with_bnb_int8(module):
for name, child in list(module.named_children()):
if isinstance(child, torch.nn.Linear):
new_layer = bnb.nn.Linear8bitLt(
child.in_features,
child.out_features,
bias=child.bias is not None,
has_fp16_weights=False,
threshold=6.0,
)
new_layer.weight = bnb.nn.Int8Params(
child.weight.data.cpu(),
requires_grad=False,
)
if child.bias is not None:
new_layer.bias = torch.nn.Parameter(child.bias.data)
setattr(module, name, new_layer)
else:
replace_with_bnb_int8(child)
replace_with_bnb_int8(model)
# Approximate memory: ~12 MB (75% reduction from FP32 ~49 MB)
```
Requires: `pip install bitsandbytes`
> **Practical note:** Given that FP16 is already only ~25 MB and the model runs at 47β71 t/s on CPU, aggressive quantization may not be necessary for most use cases. Dynamic INT8 is most useful when targeting microcontrollers or very constrained embedded environments.
---
## Format Conversion
### Convert to GGUF (for llama.cpp)
```bash
# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
pip install -r requirements.txt
# Download model
huggingface-cli download StentorLabs/Stentor2-12M-Preview --local-dir stentor2-12m-preview
# Convert to GGUF (FP16)
python convert_hf_to_gguf.py stentor2-12m-preview/ \
--outfile stentor2-12m-preview.gguf \
--outtype f16
# Quantize to Q4_0 (optional, smallest file)
./llama-quantize stentor2-12m-preview.gguf stentor2-12m-preview-q4_0.gguf q4_0
# Run
./llama-cli -m stentor2-12m-preview-q4_0.gguf -p "The science of" -n 50
```
> **Note on GGUF + TokenMonster:** The custom TokenMonster tokenizer may require manual vocabulary mapping when using llama.cpp. The standard `convert_hf_to_gguf.py` script expects a HuggingFace tokenizer format. You may need to convert the vocabulary to a compatible format first.
### Convert to ONNX
```bash
pip install optimum[exporters]
optimum-cli export onnx \
--model StentorLabs/Stentor2-12M-Preview \
--task text-generation-with-past \
stentor2-12m-onnx/
```
```python
from optimum.onnxruntime import ORTModelForCausalLM
from transformers import AutoTokenizer
model = ORTModelForCausalLM.from_pretrained("stentor2-12m-onnx")
tokenizer = AutoTokenizer.from_pretrained("StentorLabs/Stentor2-12M-Preview")
inputs = tokenizer("Hello world", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0]))
```
---
## Speculative Decoding
Stentor2-12M-Preview can serve as a fast **draft model** to accelerate inference from larger Llama-family target models.
```python
from huggingface_hub import hf_hub_download
import importlib.util, sys, torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load Stentor2 as draft model using the custom loader
path = hf_hub_download(repo_id="StentorLabs/Stentor2-12M-Preview", filename="load_stentor2.py")
spec = importlib.util.spec_from_file_location("load_stentor2", path)
mod = importlib.util.module_from_spec(spec)
sys.modules["load_stentor2"] = mod
spec.loader.exec_module(mod)
draft_model, _ = mod.load_stentor2(dtype=torch.float16)
draft_model = draft_model.to("cuda")
# Load target model normally
target_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-1B",
torch_dtype=torch.float16,
device_map="auto"
)
target_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
prompt = "Explain the concept of recursion"
inputs = target_tokenizer(prompt, return_tensors="pt")
outputs = target_model.generate(
**inputs,
assistant_model=draft_model,
do_sample=True,
max_new_tokens=100
)
print(target_tokenizer.decode(outputs[0], skip_special_tokens=True))
```
**Important caveat:** Stentor2 uses a different vocabulary (8,064-token TokenMonster) than standard Llama models (32,000-token BPE). This vocabulary mismatch means the target model's acceptance rate may be lower than it would be with a vocabulary-compatible draft model. In practice, speedups depend heavily on how similar the generated text distribution is between draft and target.
For best results with speculative decoding, a vocabulary-matched draft model is preferable. If you need a drop-in speculative draft for a standard Llama target, Stentor v1 (with its 32,768-token Mistral vocabulary) may provide better token acceptance rates despite its higher perplexity.
---
## Bias, Risks & Limitations
### Known Limitations
The following limitations were observed and confirmed through hands-on testing:
- **Prompt Relevance:** Outputs are frequently off-topic for complex prompts. The model is pattern-completing, not comprehending.
- **Factual Accuracy:** All factual claims from this model should be treated as unreliable. 12M parameters cannot store meaningful world knowledge.
- **Context Boundary:** Hard limit of 1,024 tokens. Sequences approaching this limit may degrade in coherence.
- **Short Output Window for Coherence:** Even within the 1,024-token context limit, outputs beyond ~60 tokens tend to wander off-topic or become repetitive. Keeping `max_new_tokens` at 60 or below is strongly recommended.
- **English Bias:** The TokenMonster English vocabulary is optimized for English. Other languages will tokenize to many rare/unknown tokens and likely produce poor output.
- **Training Data Bias:** Inherits biases present in FineWeb-Edu filtered web data β primarily English-language, Western-centric educational content.
- **Hallucination:** Like all LLMs, this model may confidently produce plausible-sounding but entirely fabricated content.
- **No Alignment:** No RLHF, no DPO, no constitutional training. Raw base model behavior.
- **Preview Status:** This is not the final Stentor2 architecture. Known improvements are pending.
- **Tokenizer Efficiency:** The 8K TokenMonster vocabulary produces more tokens per word than standard 32K BPE vocabularies. This is expected given the architecture tradeoff and is not a bug.
### Shared Tensor Warning
When saving or reloading this model, you will see:
```
Removed shared tensor {'lm_head.weight'} while saving.
```
This is expected. The model uses `tie_word_embeddings=True`, meaning `model.embed_tokens.weight` and `model.lm_head.weight` point to the same tensor. The safetensors format removes the duplicate during serialization and reconstructs it on load. This is safe and produces no accuracy difference.
> This is a separate and unrelated issue from the `weight_master` loading problem. See the [Known Loading Issue](#known-loading-issue--please-read) section for that.
---
## What's Next
This is a preview. The training run for Stentor2-12M-Preview revealed several clear paths to further improvement that have not yet been implemented. Those improvements are the focus of the next training run, and when that model is ready, it will be released as **Stentor2-12M**.
If you find bugs, unexpected behavior, or have benchmarks or use cases worth sharing, please open a discussion on the model repository β community input before the final release is welcome.
> π« **There will be no Stentor2-30M-Preview.** This preview exists to share the architectural direction of the Stentor2 family, not to establish a preview release cadence for every size. The next public drop from StentorLabs will be the finished Stentor2 model.
---
## Environmental Impact
| Factor | Value |
|---|---|
| Hardware | 2Γ NVIDIA Tesla T4 (1 active) |
| Active Training Duration | ~4.37 hours |
| Cloud Provider | Kaggle (free tier) |
| Compute Region | Western USA |
| Estimated Carbon | Minimal (< 0.5 kg COβe estimated) |
Training on free-tier cloud compute demonstrates that meaningful SLM research is accessible to independent researchers and students without significant hardware investment or carbon cost.
---
## Citation
If you use this model in research or a project, please cite it as follows. Note that this is a HuggingFace model card, not an arXiv paper, so there is no arXiv ID β the `howpublished` URL is the canonical reference.
```bibtex
@misc{izumoto2026stentor2_12m_preview,
title = {Stentor2-12M-Preview},
author = {Kai Izumoto},
year = {2026},
publisher = {StentorLabs},
howpublished = {\url{https://huggingface.co/StentorLabs/Stentor2-12M-Preview}},
note = {Preview checkpoint of the Stentor2 model family.
12.3M parameter LlamaForCausalLM base model trained on
FineWeb-Edu with a TokenMonster 8K vocabulary.
Apache 2.0 license.}
}
```
---
## Related Work
This section compares Stentor2-12M-Preview to other publicly available models in the sub-50M parameter range, and to relevant research that informed design decisions.
### Comparable Sub-50M Models
| Model | Parameters | Perplexity | Vocab | Training Data | Notes |
|---|---|---|---|---|---|
| **Stentor2-12M-Preview** (this model) | 12.3M | ~50.1 (FineWeb-Edu val) | 8,064 | FineWeb-Edu 240M tokens | Base model, TokenMonster vocab |
| Stentor-12M (v1) | 12.0M | 89.01 (FineWeb-Edu val) | 32,768 | FineWeb-Edu + Cosmopedia 200M | Baseline this model improves on |
| Stentor-30M (v1) | 30.4M | 33.02 (FineWeb-Edu val) | 32,768 | FineWeb-Edu + Cosmopedia 600M | Larger v1 model |
| TinyStories-33M | ~33M | ~varies | ~50K | TinyStories (synthetic) | Eldan & Li, 2023 β focused on story generation |
| TinyStories-1M | ~1M | very high | ~50K | TinyStories (synthetic) | Demonstrates 1M param story capability |
| Pythia-14M | 14M | ~varies (Pile) | 50,254 | The Pile 300B tokens | EleutherAI; well-studied scaling baseline |
| Pythia-70M | 70M | ~varies (Pile) | 50,254 | The Pile 300B tokens | Closest Pythia model above this size |
| BabyLlama | 58M | ~varies | ~32K | TinyStories + Wikitext | BabyLM challenge submission |
> **Comparison caveats:** Perplexity numbers are not directly comparable across models β different validation sets, vocabularies, and tokenizers all affect the number. The table is a rough orientation, not a rigorous benchmark. Stentor2's perplexity is measured on the FineWeb-Edu validation split using its own 8K TokenMonster tokenizer.
**Key differentiators of Stentor2 vs. comparable models:**
- **Vocabulary efficiency focus** β The deliberate reduction to 8K tokens to maximize non-embedding parameter budget is a distinguishing design choice not seen in most small models.
- **T4-specific training recipe** β The INT8 QAT + FP32 critical layer + FP32 norm combination is a novel stability recipe specifically designed for consumer-grade GPU training.
- **Educational data** β Unlike TinyStories models (trained on synthetic children's stories) or Pythia (trained on the general-domain Pile), Stentor2 is trained on quality-filtered educational web text.
### Related Research Papers
| Paper | Relevance |
|---|---|
| [TinyStories](https://arxiv.org/abs/2305.07759) β Eldan & Li, 2023 | Demonstrates meaningful language generation from 1Mβ33M parameter models; closest comparator in scale |
| [Pythia](https://arxiv.org/abs/2304.01373) β Biderman et al., 2023 | Systematic study of small model scaling; Pythia-14M is a well-documented baseline |
| [Scaling Laws](https://arxiv.org/abs/2001.08361) β Kaplan et al., 2020 | Foundational work on compute-optimal training; informs token budget decisions |
| [Chinchilla](https://arxiv.org/abs/2203.15556) β Hoffmann et al., 2022 | Revised scaling laws; 240M tokens for 12M params is approximately compute-optimal under this analysis |
| [Model Cards](https://arxiv.org/abs/1810.03993) β Mitchell et al., 2018 | Methodology underlying this model card |
| [RoPE](https://arxiv.org/abs/2104.09864) β Su et al., 2021 | Positional encoding used in this model |
| [Speculative Decoding](https://arxiv.org/abs/2211.17192) β Leviathan et al., 2023 | Primary use case for a fast draft model like Stentor2 |
| [T5](https://arxiv.org/abs/1910.10683) β Raffel et al., 2020 | Source of NFKC text normalization approach used in data pipeline |
---
## Related Resources
### StentorLabs Models
- [Stentor-30M](https://huggingface.co/StentorLabs/Stentor-30M) β Larger v1 base model
- [Stentor-12M](https://huggingface.co/StentorLabs/Stentor-12M) β v1 baseline this model improves upon
- [Stentor-30M-Instruct](https://huggingface.co/StentorLabs/Stentor-30M-Instruct) β Instruction-tuned v1 model
- [Stentor-12M-Instruct](https://huggingface.co/StentorLabs/Stentor-12M-Instruct) β Instruction-tuned v1 model
- [StentorLabs Collection](https://huggingface.co/StentorLabs) β All models from StentorLabs
### Referenced Tools & Datasets
- [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) β Training data
- [TokenMonster](https://huggingface.co/alasdairforsythe/tokenmonster) β Tokenizer vocabulary
- [HuggingFace Accelerate](https://github.com/huggingface/accelerate) β Training framework
- [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) β Quantization library
- [mradermacher GGUF quantizations of Stentor-30M](https://huggingface.co/mradermacher/Stentor-30M-GGUF) β Community quantizations of v1
---
## Model Card Contact
Questions, benchmarks, or feedback: [StentorLabs@gmail.com](mailto:StentorLabs@gmail.com) or open a [discussion](https://huggingface.co/StentorLabs/Stentor2-12M-Preview/discussions).
---
Made with β€οΈ by StentorLabs
Democratizing AI through accessible, efficient models