Instructions to use google/gemma-4-E2B-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use google/gemma-4-E2B-it with Transformers:
# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("google/gemma-4-E2B-it") model = AutoModelForImageTextToText.from_pretrained("google/gemma-4-E2B-it") - Notebooks
- Google Colab
- Kaggle
Allucination parsing JSON responses containing numerical data such as 1234.00
While testing the model, I've noticed that it has trouble parsing JSON responses containing numerical data such as
1234.00, but only when those data is in tool_response
Hey @jeisonvendetta ,
Thank you for reporting this issue. Floating point numbers with trailing zeros can sometimes behave unpredictably, but it's unusual for this to occur strictly within a tool_response block. This might be related either to tokenisation quirks or how the reasoning parser handles JSON data.
To help us reproduce and diagnose the behavior, could you please provide a bit more detail about your setup? Specifically it would be helpful to know:
- The exact system prompt and tool schema you are using.
- The raw
tool_responsestring being injected back into the context. - The inference framework you’re using (Hugging Face Transformers, LiteRT, vLLM).
- Your generation parameters, such as temperature, top-p, and whether
skip_special_tokensis enabled.
With these details, we can run a reproduction on our side and identify whether this is a model side issue or an artifact of JSON parsing.
I wasn't able to extract the conversation from the conversation flow.
But I've already found the cause:
Most output errors occur on the GPU when Think is enabled. The model generates tokens like
1.0,293 or 5,90000 and wrong dates like 202026 20333
But this behavior doesn't occur on the CPU. I suppose it's due to quantization.
We are currently using the Litert-LM with the Conversation API.
We have pushed the model to its maximum capacity on the CPU using chained skill instructions and instructing the model to think harder, and it’s actually performing very well, just impressive.
In this video, we ask the model to load a skill that contains instructions for using tools from other skills.
First video.
- get historical data (OHLCV data)
- web search (Tavily)
- file manager (a custom command-line tool that allows you to use grep, regex, and edit lines in files, as well as read, write, and modify them)
Second video.
We ask it to do exactly the same thing, but in a different language and using a different set of symbols... but with temperature set to 0 and topK set to 1. At these values, it performs better and follows instructions without thinking, which takes less time.
If we could achieve the same level of precision on the GPU as on the CPU, we could do much more with more complex instructions.
We're hitting this too on LiteRT-LM 0.10.x. Filed a broader field report here: https://github.com/google-ai-edge/LiteRT-LM/issues/2202
The latest model with MTP is less stable on both the GPU and CPU in terms of tool calling and numerical precision.
The model in this commit is more stable and accurate on the CPU
https://huggingface.co/litert-community/gemma-4-E4B-it-litert-lm/commit/9695417f248178c63a9f318c6e0c56cb917cb837
We hope they can fix it, or release a notebook on how to quantize properly to generate quantized .litetltm
The latest model with MTP is less stable on both the GPU and CPU in terms of tool calling and numerical precision.
The model in this commit is more stable and accurate on the CPU
https://huggingface.co/litert-community/gemma-4-E4B-it-litert-lm/commit/9695417f248178c63a9f318c6e0c56cb917cb837We hope they can fix it, or release a notebook on how to quantize properly to generate quantized .litetltm
Thanks for the pointer to the MTP rollout — went and tested it directly. Sharing the numbers because the result was not what either of us expected.
Setup: pulled the pre-MTP .litertlm from HF commit 7fa1d78 (April 1) and re-ran my full eval (357 single-turn cases × 5 styles + 29 multi-turn scenarios, fixture + post-execution verification) against the same Pixel 10a / Mali-G715 / GPU backend / sampler config. Same rig, same device, models swapped.
Topline:
| MTP | Pre-MTP | |
|---|---|---|
| Overall pass | 71.4% | 72.3% |
bad_json |
8.9% | 8.4% |
wrong_args |
4.9% | 4.5% |
| Latency p50 / p95 | 9.3s / 26.4s | 8.8s / 19.7s |
High-arity zones (where I'd expect MTP to hurt most if it were the cause):
| MTP | Pre-MTP | |
|---|---|---|
| notes-edit | 47.4% | 58.3% |
| notes-create | 43.5% | 48.1% |
| contacts | 52.0% | 52.0% |
| multi-intent | 62.5% | 62.5% |
So:
- Contacts and multi-intent — exactly the categories I figured would recover if the drafter heads were the issue — moved zero.
- notes-edit / notes-create gained ~5–11 pp, but on n=24–27 that's inside run-to-run noise on this rig.
bad_jsonrate is essentially identical. Same within-category clustering (notes-create still 44%bad_json, edge still ~24%).- Latency p95 IS meaningfully better pre-MTP (−6.7s) — which makes sense, no drafter-verify overhead. Real win for that build, just orthogonal to the correctness story.
Reading: the failure modes I documented (number mangling, bad_json drops, arity-correlated collapse) appear to be structural — i.e. they survive the model rollback. Most likely they're driven by the GPU-precision issue tracked in HF discussion #10 plus the generation-budget-vs-arity ceiling, not by MTP.
The single biggest lever I've found in my own work is still the arity collapse — splitting a 4-arg tool into four 2-arg per-operation tools moved that category from 0% → 47%. About ~10× the impact of swapping which .litertlm build is loaded.
Posting the negative result in case anyone else was about to chase the same hypothesis. Happy to share the per-category JSON if useful.