Instructions to use google/gemma-4-26B-A4B-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use google/gemma-4-26B-A4B-it with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="google/gemma-4-26B-A4B-it")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("google/gemma-4-26B-A4B-it")
model = AutoModelForImageTextToText.from_pretrained("google/gemma-4-26B-A4B-it")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
AMD Developer Cloud
Local Apps

vLLM

How to use google/gemma-4-26B-A4B-it with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "google/gemma-4-26B-A4B-it"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-4-26B-A4B-it",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/google/gemma-4-26B-A4B-it

SGLang

How to use google/gemma-4-26B-A4B-it with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "google/gemma-4-26B-A4B-it" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-4-26B-A4B-it",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "google/gemma-4-26B-A4B-it" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "google/gemma-4-26B-A4B-it",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use google/gemma-4-26B-A4B-it with Docker Model Runner:
```
docker model run hf.co/google/gemma-4-26B-A4B-it
```

fix: function calling formatting in chat template

#20

by RyanMullins - opened Apr 9

base: refs/heads/main

←

from: refs/pr/20

Discussion Files changed

+87

-15

RyanMullins

Google org Apr 9

No description provided.

fix: function calling formatting in chat template75802dbc

Sadmank

Apr 10

The updated one tested on llama.cpp doesn't use all tools properly, Gemini fixed it a bit: https://pastebin.com/raw/hnPGq0ht. Works better somehow?

kukalikuk

Apr 11

The updated one tested on llama.cpp doesn't use all tools properly, Gemini fixed it a bit: https://pastebin.com/raw/hnPGq0ht. Works better somehow?
Error rendering prompt with jinja template: "Unknown test: sequence".

Sadmank

Apr 11

The updated one tested on llama.cpp doesn't use all tools properly, Gemini fixed it a bit: https://pastebin.com/raw/hnPGq0ht. Works better somehow?
Error rendering prompt with jinja template: "Unknown test: sequence".

maybe you are using lm studio, try: https://pastebin.com/raw/qc1FTAcG

aldehir

Apr 11

•

edited Apr 11

When an assistant message contains content and tool_calls, the generated template ends with <turn|>\n but doesn't add <|turn>model\n even if add_generation_prompt = true.

Example ran on https://huggingface.co/spaces/huggingfacejs/chat-template-playground?modelId=google%2Fgemma-4-26B-A4B-it

Parameters

{
  "messages": [
    {
      "role": "user",
      "content": "What's the weather like in Austin today?"
    },
    {
      "role": "assistant",
      "content": "Let me call the weather tool",
      "reasoning_content": "The user is asking about the weather in Austin today. I should check the available tools to see if there's a way to get weather information. The `get_weather` tool seems appropriate for this task. It requires a `location` parameter. The user provided \"Austin\", which I can use. I'll assume the default unit is Fahrenheit unless specified otherwise, which is fine.\n\nPlan:\n1. Call `get_weather` with `location='Austin, TX'`.\n2. Present the result to the user.",
      "tool_calls": [
        {
          "type": "function",
          "function": {
            "name": "get_weather",
            "arguments": "{\"location\":\"Austin, TX\"}"
          },
          "id": "6wBvNMNHDg2cBAgr9Ob1N9i3tM2algb5"
        }
      ]
    },
    {
      "role": "tool",
      "content": "It is 78F and sunny",
      "tool_call_id": "6wBvNMNHDg2cBAgr9Ob1N9i3tM2algb5"
    }
  ],
  "add_generation_prompt": true
}

<|turn>user
What's the weather like in Austin today?<turn|>
<|turn>model
<|channel>thought
The user is asking about the weather in Austin today. I should check the available tools to see if there's a way to get weather information. The `get_weather` tool seems appropriate for this task. It requires a `location` parameter. The user provided "Austin", which I can use. I'll assume the default unit is Fahrenheit unless specified otherwise, which is fine.

Plan:
1. Call `get_weather` with `location='Austin, TX'`.
2. Present the result to the user.
<channel|><|tool_call>call:get_weather{{"location":"Austin, TX"}}<tool_call|><|tool_response>response:get_weather{value:<|"|>It is 78F and sunny<|"|>}<tool_response|>Let me call the weather tool<turn|>

What should happen here? Should the content be considered reasoning so it retains the agentic nature? Should <|turn>model\n be appended to signal the model's turn? At the moment, this generates weird behavior on the 26B variant because it starts to generate:

thought\n<channel|>

Instead of:

<|channel>thought\n<channel|>

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Cannot merge

This branch has merge conflicts in the following files:

chat_template.jinja

· Sign up or log in to comment