Safetensors
qwen3
WilhelmT commited on
Commit
881b4e8
·
verified ·
1 Parent(s): c4dab37

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -35,3 +35,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  assets/assets/FlashHead.png filter=lfs diff=lfs merge=lfs -text
37
  tokenizer.json filter=lfs diff=lfs merge=lfs -text
 
 
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  assets/assets/FlashHead.png filter=lfs diff=lfs merge=lfs -text
37
  tokenizer.json filter=lfs diff=lfs merge=lfs -text
38
+ assets/FlashHead.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,186 +1,5 @@
1
- ---
2
- license: other
3
- license_name: embedl-models-community-licence-1.0
4
- license_link: https://github.com/embedl/embedl-models/blob/main/LICENSE
5
- ---
6
-
7
-
8
- # Qwen3-0.6B-FlashHead
9
-
10
- ![My model banner](assets/FlashHead.png)
11
-
12
- **Optimized version of Qwen3-0.6B using FlashHead, Embedl’s efficient replacement for the language model head, reducing size while preserving accuracy.**
13
- Designed for **low-latency inference** on **NVIDIA RTX GPUs**, leveraging:
14
-
15
- - FlashHead
16
- - Custom vLLM generation via `embedl-models`
17
-
18
- FlashHead matches the Qwen3-0.6B baseline within rounding error on common benchmarks (MMLU-Pro, HellaSwag, GSM8K, etc.) and, combined with quantization, delivers SOTA on-device latency.
19
-
20
- ### Quickstart
21
-
22
- Launch a chat window with commands for /reset and /exit with
23
-
24
- ```shell
25
- pip install embedl-models
26
- python3 -m embedl.models.vllm.demo --model embedl/Qwen3-0.6B-FlashHead
27
- ```
28
-
29
- ---
30
-
31
- ## Model Details
32
- | **Field** | **Value** |
33
- |------------|------------|
34
- | **Base Model** | Qwen3-0.6B |
35
- | **Input / Output** | Text → Text |
36
- | **Release Date** | 2025-12-08 |
37
- | **Version** | 1.0 |
38
- | **Optimizations** | FlashHead LM Head|
39
- | **Developers** | Embedl |
40
- | **Licenses** | Upstream: Apache 2.0. <br>Optimized components: Embedl Models Community Licence v1.0 *(no redistribution)* |
41
- | **Intended Use** | Text generation, reasoning, assistant-style interaction, and general-purpose NLP on NVIDIA RTX GPUs |
42
-
43
- ---
44
-
45
- ## Optimizations
46
-
47
- - **FlashHead LM Head** - lightweight replacement for the dense LM head, significantly improving throughput.
48
- - **Custom Runtime Integration** - compatible with **vLLM (0.10.2)** via the `embedl-models` package.
49
-
50
- ---
51
-
52
-
53
- ## Installation
54
-
55
- ```bash
56
- pip install embedl-models
57
- ```
58
-
59
- The `embedl-models` package is required, it provides the optimized FlashHead implementation and quantized model runtime.
60
-
61
- ---
62
-
63
- ## Usage Examples
64
- **Note (vLLM context length):** `max_model_len=131072` may fail on GPUs without enough free VRAM for the KV cache. If you see a KV cache memory error, lower `max_model_len` (or increase `gpu_memory_utilization`).
65
-
66
- ### vLLM Inference
67
-
68
- ```python
69
- from vllm import SamplingParams
70
- from transformers import AutoTokenizer
71
- from embedl.models.vllm import LLM
72
-
73
- model_id = "embedl/Qwen3-0.6B-FlashHead"
74
-
75
- if __name__ == "__main__":
76
- tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
77
-
78
- messages = [{"role": "user", "content": "Write a haiku about coffee."}]
79
- text = tokenizer.apply_chat_template(
80
- messages,
81
- tokenize=False,
82
- add_generation_prompt=True,
83
- enable_thinking=True,
84
- )
85
-
86
- sampling = SamplingParams(
87
- max_tokens=1024,
88
- temperature=0.6,
89
- top_p=0.95,
90
- top_k=20,
91
- )
92
-
93
- llm = LLM(model=model_id, trust_remote_code=True, max_model_len=131072)
94
- output = llm.generate([text], sampling)
95
- print(output[0].outputs[0].text)
96
- ```
97
-
98
- ---
99
-
100
- ### Interactive REPL Example
101
-
102
- The `run_repl()` coroutine launches an **interactive, streaming chat interface** using the vLLM backend with FlashHead enabled.
103
- It maintains an in-memory chat history and supports simple commands such as `/exit` to quit and `/reset` to clear context.
104
-
105
- ```python
106
- import asyncio
107
- from embedl.models.vllm.demo import run_repl
108
-
109
- model_id = "embedl/Qwen3-0.6B-FlashHead"
110
-
111
- if __name__ == "__main__":
112
- asyncio.run(
113
- run_repl(
114
- model=model_id,
115
- max_model_len=131072
116
- )
117
- )
118
- ```
119
- ---
120
-
121
- ---
122
-
123
- ## ⚠️ Important Warning: Hugging Face Transformers Support
124
-
125
- > **FlashHead is currently not applied when using the Hugging Face `transformers` pipeline.**
126
- > Generation through `transformers` will fall back to the standard dense LM head, **disabling FlashHead acceleration**.
127
- >
128
- > For now, **we strongly recommend using the vLLM integration** (`embedl.models.vllm.LLM`) to ensure FlashHead is active and optimized for low-latency inference.
129
- >
130
- > Full support for the Hugging Face `transformers` pipeline with FlashHead integration will be released **in the coming days**.
131
-
132
- ---
133
-
134
- ## Limitations
135
-
136
- - Limited to **vLLM 0.10.2** (pinned dependency)
137
- - **Batch size = 1** (real-time generation)
138
- - Currently optimized for **NVIDIA RTX GPUs**
139
-
140
- ---
141
-
142
- ## Roadmap
143
-
144
- Planned improvements:
145
-
146
- - Advanced mixed precision quantization
147
- - Huggingface transformers generation
148
- - vLLM CLI benchmarking for detailed latency evaluation
149
- - `lm-eval-harness` integration for detailed accuracy evaluation
150
- - Upstream support in **Transformers** and **vLLM**
151
- - Compatibility with **GGUF**, **MLC**, **Llama.cpp**, **Ollama**, etc.
152
- - Broader model coverage (larger models, VLMs, VLAs)
153
-
154
- ---
155
-
156
- ## License
157
-
158
- - **Upstream:** Apache Licence 2.0.
159
- - **Optimized Components:** Embedl Models Community Licence v1.0 *(no redistribution)*
160
-
161
- ---
162
-
163
- ## Contact
164
-
165
- **Enterprise & Commercial Inquiries**
166
- [sales@embedl.com](mailto:sales@embedl.com)
167
-
168
- **Technical Issues & Early Access**
169
- [https://github.com/embedl/embedl-models](https://github.com/embedl/embedl-models)
170
-
171
- **More Information & Model Releases**
172
- [https://embedl.com](https://embedl.com)
173
-
174
- ---
175
-
176
- ### Partner & Developer Opportunities
177
-
178
- If you are evaluating on-device inference, building products on SLMs, or exploring custom model optimization, reach out for:
179
-
180
- - Embedl SDK - AI optimization tools & profiling
181
- - Embedl HUB - benchmarking platform
182
- - Engineering support for on-prem/edge deployments
183
- - Migration guidance (Llama / Qwen / Gemma)
184
- - Early access & partner co-marketing opportunities
185
-
186
- Contact: [sales@embedl.com](mailto:sales@embedl.com)
 
1
+ ---
2
+ license: other
3
+ license_name: embedl-models-community-licence-1.0
4
+ license_link: https://github.com/embedl/embedl-models/blob/main/LICENSE
5
+ ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
assets/FlashHead.png ADDED

Git LFS Details

  • SHA256: 307acba273a4c07eb62ea72ec170f29e7c14b58600064b3a2638c2f396f508c9
  • Pointer size: 131 Bytes
  • Size of remote file: 188 kB
config.json CHANGED
@@ -1,10 +1,15 @@
1
  {
2
  "architectures": [
3
- "Qwen3ForCausalLM"
4
  ],
5
  "attention_bias": false,
6
  "attention_dropout": 0.0,
 
 
 
 
7
  "bos_token_id": 151643,
 
8
  "dtype": "bfloat16",
9
  "eos_token_id": 151645,
10
  "flash_head_cache_dir": "flash_head_assets",
@@ -46,7 +51,8 @@
46
  ],
47
  "max_position_embeddings": 40960,
48
  "max_window_layers": 28,
49
- "model_type": "qwen3",
 
50
  "num_attention_heads": 16,
51
  "num_hidden_layers": 28,
52
  "num_key_value_heads": 8,
@@ -59,4 +65,4 @@
59
  "use_cache": true,
60
  "use_sliding_window": false,
61
  "vocab_size": 151936
62
- }
 
1
  {
2
  "architectures": [
3
+ "FlashHeadQwen3ForCausalLM"
4
  ],
5
  "attention_bias": false,
6
  "attention_dropout": 0.0,
7
+ "auto_map": {
8
+ "AutoConfig": "configuration_flash_head_qwen3.FlashHeadQwen3Config",
9
+ "AutoModelForCausalLM": "modeling_flash_head_qwen3.FlashHeadQwen3ForCausalLM"
10
+ },
11
  "bos_token_id": 151643,
12
+ "creation_time": 1766068259.7505608,
13
  "dtype": "bfloat16",
14
  "eos_token_id": 151645,
15
  "flash_head_cache_dir": "flash_head_assets",
 
51
  ],
52
  "max_position_embeddings": 40960,
53
  "max_window_layers": 28,
54
+ "model_type": "flash_head_qwen3",
55
+ "n_clusters": 1,
56
  "num_attention_heads": 16,
57
  "num_hidden_layers": 28,
58
  "num_key_value_heads": 8,
 
65
  "use_cache": true,
66
  "use_sliding_window": false,
67
  "vocab_size": 151936
68
+ }
configuration_flash_head_qwen3.py ADDED
@@ -0,0 +1 @@
 
 
1
+ from embedl.models.qwen.modeling_flash_head import FlashHeadQwen3Config
modeling_flash_head_qwen3.py ADDED
@@ -0,0 +1 @@
 
 
1
+ from embedl.models.qwen.modeling_flash_head import FlashHeadQwen3ForCausalLM