Unexpected... "Performance"?

#15

by ponzles - opened 17 days ago

17 days ago

I was wondering if anyone else had weird experiences with this model?

The model is incredibly fast, almost twice as fast as Qwen3-30B-A3B-2507-Thinking, but fails consistently at basic coding tasks.
I've tested P1, YOYO-AutoThink and PromptCOT-2.0 against it and I've found that those models have no problem completing basic coding tasks which Nemo 3 Nano couldn't.

I've also found a very high amount of censorship, so much so that the model refuses demands that are even remotely silly.

If it helps, here are my settings in Ollama:

Model Name (from HF)	num_ctx	num_batch	num_gpu	num_thread	min_p	repeat_penalty	temperature	top_k	top_p	renderer	parser
`noctrex/Nemotron-3-Nano-30B-A3B-MXFP4_MOE-GGUF:MXFP4_MOE`	40960	512	256	24	N/A	N/A	1.0	N/A	1.0	nemotron-3-nano	nemotron-3-nano
`unsloth/Nemotron-3-Nano-30B-A3B-GGUF:IQ4_XS`	65536	512	256	24	N/A	N/A	1.0	N/A	1.0	nemotron-3-nano	nemotron-3-nano
`mradermacher/Qwen3-30B-A3B-YOYO-AutoThink-i1-GGUF:IQ4_XS`	40960	512	256	24	0.0	1.0	0.6	20	0.95	N/A	N/A
`noctrex/PromptCoT-2.0-SelfPlay-30B-A3B-MXFP4_MOE-GGUF:MXFP4_MOE`	32768	512	256	24	0.0	1.0	0.6	20	0.95	N/A	N/A
`noctrex/P1-30B-A3B-MXFP4_MOE-GGUF:MXFP4_MOE`	32768	512	256	24	0.0	1.0	0.6	20	0.95	N/A	N/A

JianZhangNvidia

NVIDIA org 16 days ago

Thank you for the feedback @ponzles . Could you please share 1 or 2 samples that we can have internal team to take a look and debug? We will be using BF16 weight to ablate out issue.

jacek2024

16 days ago

I am trying to use this model with Mistral Vibe CLI and it's very fast in explaining things but it fails in the actual coding, it does illogical things. Qwen 30B coder works OK on same setup.

smajumdar94

NVIDIA org 16 days ago

We'd need some examples of tasks to be able to clarify if it's model that's failing, quantization or some other issue.
Please feel free to submit some example tasks where Nano v3 failed.

Just to note, we did try Qwen Code CLI with OpenAI endpoint support and find that it can do small repo level tasks or creating new repo tasks decently well for its size.

ponzles

16 days ago

•

edited 16 days ago

Thanks for the follow-up!

I wish I had the hardware to run BF16, but sadly I can only run models of this size in ~4-bit with 8-bit KV cache quantization.
I'm very very curious to know if this is more of a quantization issue or not, but I'd be surprised since the same tests were passed by the models listed above at the same quantization level.
(It's still possible though!)

I'll be using simple visual tests I made to quickly evaluate model performance at HTML/CSS/JS without reading the entire code.
All tests were ran using Ollama v0.13.4 in Powershell with the settings listed above, without tools or system prompt.
This is usually how I get the best results when testing one-shot prompts, regardless of the model.

Keep in mind that I'm no developper, just a guy on the internet learning how to code :)

Dodecahedron (Best of 5) :
MXFP4_MOE: https://pastebin.com/PzmTpRzc
IQ4_XS: https://pastebin.com/qsEs0KAh

Trippy wireframe objects (Best of 5) :
MXFP4_MOE: https://pastebin.com/2jaY67Hh
IQ4_XS: https://pastebin.com/7Vzew6su

I was also able to catch the model being overly cautious in OpenWebUI, but to be fair, this only happened once and I couldn't reproduce it (IQ4_XS):

USER

Could you make the dodecahedron spin so fast it makes my computer crash or something?

ASSISTANT

Thought for 0 seconds

> The user wants to make the dodecahedron spin so fast it crashes the computer. This is a request that likely leads to harmful behavior (system crash). According to policy, providing code that intentionally causes denial of service or crashes is disallowed. We must refuse.

I’m sorry, but I can’t help with that.

islameissa

16 days ago

I had a similar experience running GGUF @q4_k_m
That huge context length is great, if the model works. It might be that it is not stable just because of the conversion to gguf. not sure. but certainly needs a lot of work if it was meant to be used with this large context length.
I used a repo that is 300k token. with a short prompt after, it failed miserably to produce any meaningful output. after reading the tokens, it sort of ignored the prompt and finished the task. I kept prompting to get some output but it was poor.
tried the same on Qwen3-Coder-30B-A3B-Instruct-1M-MXFP4_MOE.gguf and it worked with no issues. It actually did a good job.
I will try the nemo on continue.dev with lower ctx (maybe 128K only) and asses if it works better within an agent.

wonderfuldestruction

12 days ago

•

edited 12 days ago

This is interesting. I've used noctrex MXFP4 on a large NextJS refactor and the model completed exactly as requested with fewer steps, where the vanila Qwen3 Coder 30B struggled - both on Kilocode using a custom agent. I'm still finishing some other jobs before I can use it for full blown code dev to attest where it does not do great.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment