thank you! - running on llama.cpp on M2 Max now :-)
Thanks for that. It's the biggest model I can run at some speed on an M2 Max 96gb v/ram. The 107B-A7B4 is about the biggest I can make use of comfortably. (next bigger like MiniMax-M2.7 even at smallest quant the A10B drops the TG to ~20 tok/s - too slow)
After lots of prodding, agents (Codex, OpenCode/GLM/DeepSeek/Kimi) adapted a llama.cpp fork to run a 4-bit ,gguf quant for me. This is great, I'm loving this. :-)
Uploaded the model one 4-bit quant here
https://huggingface.co/ljupco/Ling-2.6-flash-GGUF
The code to run it is this llama.cpp branch here
https://github.com/ljubomirj/llama.cpp/tree/LJ-Ling-2.6-flash-r2
It holds the speed S_TG tok/s quite well with increasing depth, this makes it usable.
build/bin/llama-batched-bench -m ~/llama.cpp/models/Ling-2.6-flash-IQ4_NL-bailing_hybrid-20260505-LJ.gguf -npp 512,1024,2048,4096,8192,16384,32768 -ntg 128 -npl 1 -c 36000
main: n_kv_max = 36096, n_batch = 2048, n_ubatch = 512, flash_attn = -1, is_pp_shared = 0, is_tg_separate = 0, n_gpu_layers = -1, n_threads = 8, n_threads_batch = 8
| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
|---|---|---|---|---|---|---|---|---|---|
| 512 | 128 | 1 | 640 | 1.201 | 426.29 | 2.829 | 45.25 | 4.030 | 158.83 |
| 1024 | 128 | 1 | 1152 | 2.804 | 365.22 | 3.682 | 34.76 | 6.486 | 177.62 |
| 2048 | 128 | 1 | 2176 | 6.085 | 336.54 | 3.691 | 34.67 | 9.777 | 222.56 |
| 4096 | 128 | 1 | 4224 | 12.587 | 325.41 | 3.794 | 33.74 | 16.381 | 257.87 |
| 8192 | 128 | 1 | 8320 | 26.703 | 306.78 | 4.023 | 31.82 | 30.726 | 270.78 |
| 16384 | 128 | 1 | 16512 | 58.853 | 278.39 | 4.358 | 29.37 | 63.211 | 261.22 |
| 32768 | 128 | 1 | 32896 | 134.525 | 243.58 | 4.932 | 25.95 | 139.457 | 235.89 |
llama_perf_context_print: load time = 4333.31 ms
llama_perf_context_print: prompt eval time = 242936.45 ms / 65040 tokens ( 3.74 ms per token, 267.72 tokens per second)
llama_perf_context_print: eval time = 27298.86 ms / 896 runs ( 30.47 ms per token, 32.82 tokens per second)
llama_perf_context_print: total time = 274404.15 ms / 65936 tokens
llama_perf_context_print: graphs reused = 889
Wow this is amazing @ljupco ! Thanks SOO much for sharing your wonderful work. It is quite a revelation moment as an open model builder. :)