Upload folder using huggingface_hub

Browse files

Files changed (6) hide show

.gitattributes +1 -0
README.md +5 -186
assets/FlashHead.png +3 -0
config.json +9 -3
configuration_flash_head_qwen3.py +1 -0
modeling_flash_head_qwen3.py +1 -0

.gitattributes CHANGED Viewed

@@ -35,3 +35,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 assets/assets/FlashHead.png filter=lfs diff=lfs merge=lfs -text
 tokenizer.json filter=lfs diff=lfs merge=lfs -text

 *tfevents* filter=lfs diff=lfs merge=lfs -text
 assets/assets/FlashHead.png filter=lfs diff=lfs merge=lfs -text
 tokenizer.json filter=lfs diff=lfs merge=lfs -text
+assets/FlashHead.png filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,186 +1,5 @@
----
-license: other
-license_name: embedl-models-community-licence-1.0
-license_link: https://github.com/embedl/embedl-models/blob/main/LICENSE
----
-# Qwen3-0.6B-FlashHead
-![My model banner](assets/FlashHead.png)
-**Optimized version of Qwen3-0.6B using FlashHead, Embedl’s efficient replacement for the language model head, reducing size while preserving accuracy.**
-Designed for **low-latency inference** on **NVIDIA RTX GPUs**, leveraging:
-- FlashHead
-- Custom vLLM generation via `embedl-models`
-FlashHead matches the Qwen3-0.6B baseline within rounding error on common benchmarks (MMLU-Pro, HellaSwag, GSM8K, etc.) and, combined with quantization, delivers SOTA on-device latency.
-### Quickstart
-Launch a chat window with commands for /reset and /exit with
-```shell
-pip install embedl-models
-python3 -m embedl.models.vllm.demo --model embedl/Qwen3-0.6B-FlashHead
-```
----
-## Model Details
-| **Field** | **Value** |
-|------------|------------|
-| **Base Model** | Qwen3-0.6B |
-| **Input / Output** | Text → Text |
-| **Release Date** | 2025-12-08 |
-| **Version** | 1.0 |
-| **Optimizations** | FlashHead LM Head|
-| **Developers** | Embedl |
-| **Licenses** | Upstream: Apache 2.0. <br>Optimized components: Embedl Models Community Licence v1.0 *(no redistribution)* |
-| **Intended Use** | Text generation, reasoning, assistant-style interaction, and general-purpose NLP on NVIDIA RTX GPUs |
----
-## Optimizations
-- **FlashHead LM Head** - lightweight replacement for the dense LM head, significantly improving throughput.
-- **Custom Runtime Integration** - compatible with **vLLM (0.10.2)** via the `embedl-models` package.
----
-## Installation
-```bash
-pip install embedl-models
-```
-The `embedl-models` package is required, it provides the optimized FlashHead implementation and quantized model runtime.
----
-## Usage Examples
-**Note (vLLM context length):** `max_model_len=131072` may fail on GPUs without enough free VRAM for the KV cache. If you see a KV cache memory error, lower `max_model_len` (or increase `gpu_memory_utilization`).
-### vLLM Inference
-```python
-from vllm import SamplingParams
-from transformers import AutoTokenizer
-from embedl.models.vllm import LLM
-model_id = "embedl/Qwen3-0.6B-FlashHead"
-if __name__ == "__main__":
-    tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
-    messages = [{"role": "user", "content": "Write a haiku about coffee."}]
-    text = tokenizer.apply_chat_template(
-        messages,
-        tokenize=False,
-        add_generation_prompt=True,
-        enable_thinking=True,
-    )
-    sampling = SamplingParams(
-        max_tokens=1024,
-        temperature=0.6,
-        top_p=0.95,
-        top_k=20,
-    )
-    llm = LLM(model=model_id, trust_remote_code=True, max_model_len=131072)
-    output = llm.generate([text], sampling)
-    print(output[0].outputs[0].text)
-```
----
-### Interactive REPL Example
-The `run_repl()` coroutine launches an **interactive, streaming chat interface** using the vLLM backend with FlashHead enabled.
-It maintains an in-memory chat history and supports simple commands such as `/exit` to quit and `/reset` to clear context.
-```python
-import asyncio
-from embedl.models.vllm.demo import run_repl
-model_id = "embedl/Qwen3-0.6B-FlashHead"
-if __name__ == "__main__":
-    asyncio.run(
-        run_repl(
-            model=model_id,
-            max_model_len=131072
-        )
-    )
-```
----
----
-## ⚠️ Important Warning: Hugging Face Transformers Support
-> **FlashHead is currently not applied when using the Hugging Face `transformers` pipeline.**
-> Generation through `transformers` will fall back to the standard dense LM head, **disabling FlashHead acceleration**.
->
-> For now, **we strongly recommend using the vLLM integration** (`embedl.models.vllm.LLM`) to ensure FlashHead is active and optimized for low-latency inference.
->
-> Full support for the Hugging Face `transformers` pipeline with FlashHead integration will be released **in the coming days**.
----
-## Limitations
-- Limited to **vLLM 0.10.2** (pinned dependency)
-- **Batch size = 1** (real-time generation)
-- Currently optimized for **NVIDIA RTX GPUs**
----
-## Roadmap
-Planned improvements:
-- Advanced mixed precision quantization
-- Huggingface transformers generation
-- vLLM CLI benchmarking for detailed latency evaluation
-- `lm-eval-harness` integration for detailed accuracy evaluation
-- Upstream support in **Transformers** and **vLLM**
-- Compatibility with **GGUF**, **MLC**, **Llama.cpp**, **Ollama**, etc.
-- Broader model coverage (larger models, VLMs, VLAs)
----
-## License
-- **Upstream:** Apache Licence 2.0.
-- **Optimized Components:** Embedl Models Community Licence v1.0 *(no redistribution)*
----
-## Contact
-**Enterprise & Commercial Inquiries**
-[sales@embedl.com](mailto:sales@embedl.com)
-**Technical Issues & Early Access**
-[https://github.com/embedl/embedl-models](https://github.com/embedl/embedl-models)
-**More Information & Model Releases**
-[https://embedl.com](https://embedl.com)
----
-### Partner & Developer Opportunities
-If you are evaluating on-device inference, building products on SLMs, or exploring custom model optimization, reach out for:
-- Embedl SDK - AI optimization tools & profiling
-- Embedl HUB - benchmarking platform
-- Engineering support for on-prem/edge deployments
-- Migration guidance (Llama / Qwen / Gemma)
-- Early access & partner co-marketing opportunities
-Contact: [sales@embedl.com](mailto:sales@embedl.com)

+---
+license: other
+license_name: embedl-models-community-licence-1.0
+license_link: https://github.com/embedl/embedl-models/blob/main/LICENSE
+---

assets/FlashHead.png ADDED Viewed

Git LFS Details

SHA256: 307acba273a4c07eb62ea72ec170f29e7c14b58600064b3a2638c2f396f508c9
Pointer size: 131 Bytes
Size of remote file: 188 kB

config.json CHANGED Viewed

@@ -1,10 +1,15 @@
 {
   "architectures": [
-    "Qwen3ForCausalLM"
   ],
   "attention_bias": false,
   "attention_dropout": 0.0,
   "bos_token_id": 151643,
   "dtype": "bfloat16",
   "eos_token_id": 151645,
   "flash_head_cache_dir": "flash_head_assets",
@@ -46,7 +51,8 @@
   ],
   "max_position_embeddings": 40960,
   "max_window_layers": 28,
-  "model_type": "qwen3",
   "num_attention_heads": 16,
   "num_hidden_layers": 28,
   "num_key_value_heads": 8,
@@ -59,4 +65,4 @@
   "use_cache": true,
   "use_sliding_window": false,
   "vocab_size": 151936
-}

 {
   "architectures": [
+    "FlashHeadQwen3ForCausalLM"
   ],
   "attention_bias": false,
   "attention_dropout": 0.0,
+  "auto_map": {
+    "AutoConfig": "configuration_flash_head_qwen3.FlashHeadQwen3Config",
+    "AutoModelForCausalLM": "modeling_flash_head_qwen3.FlashHeadQwen3ForCausalLM"
+  },
   "bos_token_id": 151643,
+  "creation_time": 1766068259.7505608,
   "dtype": "bfloat16",
   "eos_token_id": 151645,
   "flash_head_cache_dir": "flash_head_assets",
   ],
   "max_position_embeddings": 40960,
   "max_window_layers": 28,
+  "model_type": "flash_head_qwen3",
+  "n_clusters": 1,
   "num_attention_heads": 16,
   "num_hidden_layers": 28,
   "num_key_value_heads": 8,
   "use_cache": true,
   "use_sliding_window": false,
   "vocab_size": 151936
+}

configuration_flash_head_qwen3.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ from embedl.models.qwen.modeling_flash_head import FlashHeadQwen3Config

modeling_flash_head_qwen3.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ from embedl.models.qwen.modeling_flash_head import FlashHeadQwen3ForCausalLM