Allucination parsing JSON responses containing numerical data such as 1234.00

#10
by jeisonvendetta - opened

While testing the model, I've noticed that it has trouble parsing JSON responses containing numerical data such as
1234.00, but only when those data is in tool_response

Google org

Hey @jeisonvendetta ,

Thank you for reporting this issue. Floating point numbers with trailing zeros can sometimes behave unpredictably, but it's unusual for this to occur strictly within a tool_response block. This might be related either to tokenisation quirks or how the reasoning parser handles JSON data.

To help us reproduce and diagnose the behavior, could you please provide a bit more detail about your setup? Specifically it would be helpful to know:

  1. The exact system prompt and tool schema you are using.
  2. The raw tool_response string being injected back into the context.
  3. The inference framework you’re using (Hugging Face Transformers, LiteRT, vLLM).
  4. Your generation parameters, such as temperature, top-p, and whether skip_special_tokens is enabled.

With these details, we can run a reproduction on our side and identify whether this is a model side issue or an artifact of JSON parsing.

I wasn't able to extract the conversation from the conversation flow.
But I've already found the cause:
Most output errors occur on the GPU when Think is enabled. The model generates tokens like
1.0,293 or 5,90000 and wrong dates like 202026 20333

But this behavior doesn't occur on the CPU. I suppose it's due to quantization.

We are currently using the Litert-LM with the Conversation API.

We have pushed the model to its maximum capacity on the CPU using chained skill instructions and instructing the model to think harder, and it’s actually performing very well, just impressive.

In this video, we ask the model to load a skill that contains instructions for using tools from other skills.

First video.

  • get historical data (OHLCV data)
  • web search (Tavily)
  • file manager (a custom command-line tool that allows you to use grep, regex, and edit lines in files, as well as read, write, and modify them)

Second video.
We ask it to do exactly the same thing, but in a different language and using a different set of symbols... but with temperature set to 0 and topK set to 1. At these values, it performs better and follows instructions without thinking, which takes less time.

If we could achieve the same level of precision on the GPU as on the CPU, we could do much more with more complex instructions.

jeisonvendetta changed discussion status to closed
jeisonvendetta changed discussion status to open

We're hitting this too on LiteRT-LM 0.10.x. Filed a broader field report here: https://github.com/google-ai-edge/LiteRT-LM/issues/2202

The latest model with MTP is less stable on both the GPU and CPU in terms of tool calling and numerical precision.

The model in this commit is more stable and accurate on the CPU
https://huggingface.co/litert-community/gemma-4-E4B-it-litert-lm/commit/9695417f248178c63a9f318c6e0c56cb917cb837

We hope they can fix it, or release a notebook on how to quantize properly to generate quantized .litetltm

The latest model with MTP is less stable on both the GPU and CPU in terms of tool calling and numerical precision.

The model in this commit is more stable and accurate on the CPU
https://huggingface.co/litert-community/gemma-4-E4B-it-litert-lm/commit/9695417f248178c63a9f318c6e0c56cb917cb837

We hope they can fix it, or release a notebook on how to quantize properly to generate quantized .litetltm

Thanks for the pointer to the MTP rollout — went and tested it directly. Sharing the numbers because the result was not what either of us expected.

Setup: pulled the pre-MTP .litertlm from HF commit 7fa1d78 (April 1) and re-ran my full eval (357 single-turn cases × 5 styles + 29 multi-turn scenarios, fixture + post-execution verification) against the same Pixel 10a / Mali-G715 / GPU backend / sampler config. Same rig, same device, models swapped.

Topline:

MTP Pre-MTP
Overall pass 71.4% 72.3%
bad_json 8.9% 8.4%
wrong_args 4.9% 4.5%
Latency p50 / p95 9.3s / 26.4s 8.8s / 19.7s

High-arity zones (where I'd expect MTP to hurt most if it were the cause):

MTP Pre-MTP
notes-edit 47.4% 58.3%
notes-create 43.5% 48.1%
contacts 52.0% 52.0%
multi-intent 62.5% 62.5%

So:

  • Contacts and multi-intent — exactly the categories I figured would recover if the drafter heads were the issue — moved zero.
  • notes-edit / notes-create gained ~5–11 pp, but on n=24–27 that's inside run-to-run noise on this rig.
  • bad_json rate is essentially identical. Same within-category clustering (notes-create still 44% bad_json, edge still ~24%).
  • Latency p95 IS meaningfully better pre-MTP (−6.7s) — which makes sense, no drafter-verify overhead. Real win for that build, just orthogonal to the correctness story.

Reading: the failure modes I documented (number mangling, bad_json drops, arity-correlated collapse) appear to be structural — i.e. they survive the model rollback. Most likely they're driven by the GPU-precision issue tracked in HF discussion #10 plus the generation-budget-vs-arity ceiling, not by MTP.

The single biggest lever I've found in my own work is still the arity collapse — splitting a 4-arg tool into four 2-arg per-operation tools moved that category from 0% → 47%. About ~10× the impact of swapping which .litertlm build is loaded.

Posting the negative result in case anyone else was about to chase the same hypothesis. Happy to share the per-category JSON if useful.

Sign up or log in to comment