Granite Speech 4.0 1B โ€” GGUF

GGUF conversion of ibm-granite/granite-4.0-1b-speech for use with llama.cpp audio-multimodal support (see PR #22101).

File Quant Size
granite-speech-4.0-1b-Q8_0.gguf Q8_0 ~1.8 GB
mmproj-granite-speech-4.0-1b-f16.gguf f16 ~1.1 GB

The LLM is quantized to Q8_0; the audio encoder + QFormer projector (mmproj) stays at f16 because llama-quantize does not support the clip architecture used by audio projectors.

Usage

One-shot CLI

llama-mtmd-cli -hf staghado/granite-speech-4.0-1b-GGUF \
  --audio my.wav \
  -p "can you transcribe the speech into a written format?" \
  --jinja --temp 0

Server (OpenAI-compatible)

llama-server -hf staghado/granite-speech-4.0-1b-GGUF --jinja

Then POST to /v1/chat/completions:

{
  "messages": [{
    "role": "user",
    "content": [
      {"type": "text", "text": "can you transcribe the speech into a written format?"},
      {"type": "input_audio", "input_audio": {"data": "<base64 wav>", "format": "wav"}}
    ]
  }]
}

Audio input requirements: 16 kHz mono WAV (or MP3). From the GGUF's load_hparams: n_mel_bins=160, audio_n_fft=512, audio_window_len=400, audio_hop_len=160, audio_sample_rate=16000.

Conversion provenance

Produced with llama.cpp master (commit 2e97c5f96, build 9100) using:

python convert_hf_to_gguf.py granite-src --outtype f16 --outfile granite-speech-4.0-1b-f16.gguf
python convert_hf_to_gguf.py granite-src --outtype f16 --mmproj --outfile mmproj-granite-speech-4.0-1b-f16.gguf
llama-quantize granite-speech-4.0-1b-f16.gguf granite-speech-4.0-1b-Q8_0.gguf Q8_0

License

Inherits Apache 2.0 from the base model.

Downloads last month
212
GGUF
Model size
2B params
Architecture
granite
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for staghado/granite-speech-4.0-1b-GGUF

Quantized
(7)
this model