We are still completely blown away. In just about a week since we dropped our Supra-50M Instruct model, the open-source community took it and ran with it. Thanks to your incredible support, we hit Page 1 of Trending Models in Text Generation, Page 4 across ALL categories on Hugging Face, crossed 7,000+ downloads, and even got featured in a YouTube deep-dive! For a non-profit, 100% open-source garage project, this is unreal. Thank you.
But we aren't stopping there. The massive interest in small language models (SLMs) proves that the world wants highly efficient, reproducible computing. To push the boundaries of what tiny "brains" can do, SupraLabs is launching a massive, fully open systematic research initiative. We want to find the exact engineering sweet spots for SLMs, and we are open-sourcing every single pipeline, log, and weight along the way.
Here is the roadmap of the core experiments we are spinning up right now.
Experiment 1: The Ultimate Data-Mix Showdown
Everyone knows data quality is king, but what is the absolute best data recipe when your parameter budget is ultra-tight? We are pitting the top open-source datasets against each other to find the perfect synergy.
- The Setup: We are training an ultra-lean 5M parameter Llama model using Hugging Face Transformers.
- The Data: Exactly 100 Million tokens total per run, testing four configurations:
1. 100%FineWeb-Edu
2. 100%DCLM-Edu
3. 100%Cosmopedia-v2
4. Custom algorithmic token-level mixes of all three.
The Goal: Find out if highly structured synthetic data outpaces heavily curated web scraps at the 5M scale, or if a hybrid mix yields the ultimate downstream generalizability.
• Language & Perplexity → wikitext, lambada
• Commonsense & Logic → hellaswag, piqa, winogrande, boolq
• Science & Knowledge → sciq, openbookqa, arc_easy, arc_challenge
• Grammar & Syntax → blimp
Experiment 2: Scaling Law Realities for Tiny Models
Chinchilla scaling laws tell us how to scale compute and data optimally for billion-parameter giants. But do those rules shatter when you scale down to the absolute edge? We are conducting a dedicated scaling study to map out the returns on parameter expansion.
- The Setup: Keeping dataset size fixed at exactly 2 Billion tokens of
FineWeb-Edu (sample-10BT). - The Core Matrix: We will train four distinct Llama architectures: 10M, 25M, 50M, and 100M parameters.
The Goal: Identify the exact point of diminishing returns. Does a 25M model fully utilize 2B tokens, or does the 100M model show a massive performance leap on the exact same token footprint? We want to chart the efficiency frontier.
Experiment 3: Is One Epoch Really All You Need for SLMs?
The standard convention for LLMs is "one epoch and move on" to avoid overfitting, popularized by several landmark papers. But small models training on high-quality educational data might be a completely different beast. Can they chew on the same high-signal data multiple times?
- The Setup: A 10M parameter Llama model trained on exactly 500 Million tokens of
FineWeb-Edu. - The Epoch Matrix: We are running 5 identical setups, changing only the epoch count: 1 Epoch vs. 2, 3, 4, and 5 Epochs.
The Goal: Pinpoint exactly where overfitting begins for an SLM. If performance on lm-eval keeps scaling up past epoch 2 or 3 without destroying perplexity, it could mean data-scarcity solutions for edge AI are much easier than we think.
Expanding the Frontier: More Ideas We're Testing
While configuring our cluster for the three core studies above, we realized we have a golden opportunity to squeeze in even more architectural answers. We have officially added these four bonus dimensions to our upcoming research pipeline:
A. The Tokenizer Bottleneck
Modern tokenizers use massive vocabularies (like Llama 3's 128k). In a 10M parameter model, a huge vocabulary means the embedding layer eats up almost all your parameters, leaving nothing for the actual transformer layers. We will run identical 10M models comparing the Llama 3 tokenizer (128k), Llama 2 tokenizer (32k), and a custom-built 8k/16k vocabulary to see where the parameter balance lies.
B. Depth vs. Width (Architecture Tweaks)
If you have a strict budget of 25M parameters, how should you spend them? We're testing a "deep and narrow" configuration (e.g., 24 layers, smaller hidden dimensions) against a "shallow and wide" setup (e.g., 6 layers, massive hidden dimensions) to evaluate which layout reasons better on standard benchmarks.
C. The Sequence Length Penalty
Does forcing a longer context window ruin a tiny model's general capability? We will train identical models across 512, 1024, and 2048 context windows to see if extending context capacity directly penalizes the model's core knowledge density.
D. LR Schedule Optimization for Ultra-Short Runs
Standard cosine decay schedules are meant for trillions of tokens. For short 1B–2B token runs, we will experiment with aggressive linear decays and constant learning rates with sudden drops to establish the absolute fastest convergence paths for indie researchers.
Everything Will Be Open. Everything.
SupraLabs is entirely non-profit, and our commitment to open science means we won't just publish a PDF with pretty graphs. When these runs complete, we will be releasing:
- Every single checkpoint and weight file on Hugging Face.
- Complete, unedited
lm-evallogs and raw data points. - Our training configurations and custom setup code so anyone can replicate our work on their own hardware.
We're getting the compute nodes warmed up as you read this. Stay tuned for the raw data drops—we're about to find out exactly how much power we can pack into these tiny architectures.
Codebase, configs, and automation tools will be linked there as the runs kick off.
SupraLabs_