I’d first check the tokenizer files tbh.
True…
Hmm, in the simplest case, it’s possible that you called model.save_pretrained() but forgot to call tokenizer.save_pretrained(). This would result in a model folder missing only the tokenizer-related files, which would explain the symptoms. If you don’t get any warnings when converting the official Qwen 3.5, then this or a similar issue is likely the culprit.
However, since there are cases where Qwen 3.5 and GGUF do not work as expected, I think it’s best to suspect an issue specific to the Qwen 3.5 series. If converting models from other series to GGUF works fine, then a specific issue with this series is likely the cause.
convert_hf_to_gguf.py: “The BPE pre-tokenizer was not recognized” after fine-tuning Qwen3.5-4B
I would treat this as a tokenizer / GGUF metadata compatibility problem, not primarily a transformers problem.
Your traceback gets all the way to the vocabulary/tokenizer phase:
prepare_metadata()
set_vocab()
_set_vocab_gpt2()
get_vocab_base()
get_vocab_base_pre(tokenizer)
raise NotImplementedError(...)
That means convert_hf_to_gguf.py has already started processing the model and is now trying to encode the tokenizer contract into GGUF metadata. The failure happens because llama.cpp cannot recognize the BPE pre-tokenizer behavior loaded from your model folder.
The key line is:
chkhsh: 1444df51289cfa8063b96f0e62b1125440111bc79a52003ea14b6eac7016fd5f
That chkhsh is a tokenizer fingerprint. llama.cpp’s convert_hf_to_gguf_update.py generates hash-to-pre-tokenizer mappings for get_vocab_base_pre(). If your tokenizer produces a hash that is not in that mapping, the converter refuses to guess.
So the short version is:
The model weights may be fine. The folder you are converting contains a tokenizer configuration that your llama.cpp converter cannot map to a known GGUF pre-tokenizer type.
Why upgrading transformers did not fix it
Upgrading transformers helps when Transformers cannot load a model, config, or tokenizer. But this failure is inside llama.cpp’s conversion code.
The Hugging Face llama.cpp integration docs describe the conversion process as roughly:
- load
config.json with AutoConfig,
- load tokenizer information with
AutoTokenizer,
- select a converter class from the model architecture,
- map tensors,
- write GGUF weights, tokenizer metadata, and model metadata.
Your failure happens after the tokenizer is loaded, when llama.cpp tries to classify the tokenizer’s BPE pre-tokenization behavior.
So this is not simply:
Transformers is too old.
It is more like:
llama.cpp does not recognize the tokenizer behavior in <model_dir>.
That is also why running convert_hf_to_gguf_update.py may not help automatically. That script is mainly a converter-maintenance tool: it regenerates known pre-tokenizer hashes from models listed in the script. It does not magically repair a local fine-tuned folder whose tokenizer files changed or are incomplete.
Relevant source: convert_hf_to_gguf_update.py.
Background: what a “BPE pre-tokenizer” is
A tokenizer is not only a vocabulary file.
A simplified Hugging Face tokenizer pipeline is:
raw text
-> normalizer
-> pre-tokenizer
-> BPE model / merges
-> post-processor / special-token handling
-> token IDs
The Hugging Face Tokenizers docs describe the PreTokenizer as the component that splits text before the tokenizer model applies BPE/WordPiece/Unigram rules.
This matters because two tokenizers can have:
- the same vocabulary size,
- the same model architecture,
- similar-looking special tokens,
but still produce different token IDs if the pre-tokenizer differs.
Examples where pre-tokenization differences can matter:
"Hello world"
" Hello world"
"Hello\nworld"
"你好,世界"
"こんにちは世界"
"🙂🚀 café naïve"
"<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\n"
If llama.cpp wrote the wrong tokenizer.ggml.pre metadata, the resulting GGUF could load but tokenize prompts differently from Transformers. That can cause bad output, broken Unicode handling, broken chat markers, or high perplexity. So llama.cpp stops instead of guessing.
Good background references:
Why Qwen3.5 makes this easier to hit
Qwen3.5 support in llama.cpp is relatively recent and commit-sensitive.
There are recent llama.cpp issues around Qwen3.5 conversion support, including Qwen3_5ForCausalLM not being supported in some converter paths:
Your error is not exactly the same as those architecture errors, because your traceback reaches tokenizer handling. But the lesson is still important:
“Qwen-ish support exists” does not necessarily mean “my exact Qwen3.5 variant, my exact tokenizer files, and my exact llama.cpp commit are supported.”
Also, the official Qwen/Qwen3.5-4B repo contains several important tokenizer/config/processor files. The file list includes things like:
tokenizer.json
tokenizer_config.json
vocab.json
merges.txt
chat_template.jinja
preprocessor_config.json
video_preprocessor_config.json
config.json
See the repo file listing here: Qwen/Qwen3.5-4B/tree/main.
For Qwen3.5, I would treat tokenizer and processor files as part of the model contract, not as disposable side files.
Most likely causes, ranked
1. Your fine-tuned or merged folder has tokenizer drift
This is the most likely case if:
- the original base model converts with the same llama.cpp commit,
- your fine-tuned/merged folder fails,
- you did not intentionally add tokens,
- your training/export tool saved or regenerated tokenizer files.
Tokenizer drift means that files such as these differ from the base model:
tokenizer.json
tokenizer_config.json
vocab.json
merges.txt
special_tokens_map.json
added_tokens.json
chat_template.jinja
preprocessor_config.json
video_preprocessor_config.json
This can happen even if you never manually edited tokenizer files. Fine-tuning tools often call save_pretrained(), copy partial artifacts, rewrite tokenizer_config.json, alter chat templates, or omit files that the original base repo had.
If the tokenizer was not intentionally changed during training, the safest practical fix is often to copy the tokenizer-related files from the exact base model revision back into the merged folder.
2. Your llama.cpp checkout is too old for the exact Qwen3.5 path
If the original base model also fails with the same kind of error, then your fine-tune is probably not the main issue.
In that case, update llama.cpp itself, not just Python packages:
cd <llama_cpp_dir>
git pull --rebase
python -m pip install -U -r requirements.txt
python convert_hf_to_gguf_update.py
Then retry converting the base model.
Qwen3.5-related converter support has changed recently, so the exact llama.cpp commit matters.
3. You added or changed tokens during fine-tuning
If your training code did anything like:
tokenizer.add_tokens(...)
tokenizer.add_special_tokens(...)
model.resize_token_embeddings(len(tokenizer))
then copying base tokenizer files can be wrong.
Why? Because the model’s embedding matrix may now contain rows for new token IDs. If you overwrite the tokenizer with the base tokenizer, token IDs and embedding rows can disagree.
In that case, first verify:
len(tokenizer) == config.vocab_size == embedding rows
If these do not match, fix the merged Transformers folder before trying GGUF conversion.
Relevant background:
4. You are mixing Qwen3.5 base / instruct / text-only / multimodal artifacts
Qwen3.5-4B is not just an old-style plain text-only layout. Some Qwen3.5 workflows involve multimodal files, processor configs, chat templates, or separate projector handling.
Be careful not to mix files from:
Qwen/Qwen3.5-4B
Qwen/Qwen3.5-4B-Base
an Unsloth Qwen3.5 repo
a text-only derivative
a LoRA adapter folder
a merged full model folder
a GGUF repo
Use tokenizer files from the exact model and revision you trained from, not from a “nearby” Qwen model.
Useful references:
The decisive diagnostic test
Before editing anything, test the original base model with the same llama.cpp commit.
Step 1: record your environment
cd <llama_cpp_dir>
git rev-parse HEAD
python --version
python -m pip show transformers tokenizers huggingface_hub gguf sentencepiece protobuf
Also record:
base model: <base_model_name>
base revision: <base_model_revision_or_unknown>
fine-tuning method: <lora_qlora_full_finetune>
merged folder: <merged_model_dir>
did you add tokens: <yes_or_no>
did you change chat_template: <yes_or_no>
target: <text_only_or_multimodal>
Step 2: download the exact base model
If the base was Qwen/Qwen3.5-4B:
hf download Qwen/Qwen3.5-4B \
--local-dir <base_model_dir> \
--include "*.safetensors" \
--include "*.json" \
--include "*.txt" \
--include "*.jinja"
If you know the exact revision you trained from, pin it:
hf download Qwen/Qwen3.5-4B \
--revision <base_model_revision> \
--local-dir <base_model_dir> \
--include "*.safetensors" \
--include "*.json" \
--include "*.txt" \
--include "*.jinja"
Step 3: try converting the base model
python <llama_cpp_dir>/convert_hf_to_gguf.py \
<base_model_dir> \
--outtype bf16 \
--outfile <base_model_dir>/base-bf16.gguf
Use BF16/F16 for debugging. Do not make your first target a 4-bit quant.
The normal Qwen flow is:
Transformers folder -> high-precision GGUF -> quantized GGUF
See the official Qwen llama.cpp quantization guide: llama.cpp - Qwen.
How to interpret the result
| Result |
Meaning |
| Base model converts |
llama.cpp probably supports the base tokenizer. Your fine-tuned/merged folder likely drifted. |
| Base model fails with the same BPE pre-tokenizer hash |
Your llama.cpp checkout probably does not support that exact tokenizer state. |
| Base model fails with architecture error |
You likely need newer llama.cpp Qwen3.5 architecture support. |
| Base converts, fine-tuned model fails |
Compare and probably restore tokenizer files, unless you added tokens. |
This test is the most important one.
Compare tokenizer files
Run this against the base folder and your merged/fine-tuned folder:
from pathlib import Path
import hashlib
base = Path("<base_model_dir>")
ft = Path("<merged_model_dir>")
files = [
"tokenizer.json",
"tokenizer_config.json",
"vocab.json",
"merges.txt",
"chat_template.jinja",
"special_tokens_map.json",
"added_tokens.json",
"config.json",
"processor_config.json",
"preprocessor_config.json",
"video_preprocessor_config.json",
]
def sha(p):
if not p.exists():
return "MISSING"
return hashlib.sha256(p.read_bytes()).hexdigest()
for name in files:
b = sha(base / name)
f = sha(ft / name)
print(f"{name:32} {'same' if b == f else 'DIFF'}")
print(f" base: {b}")
print(f" ft: {f}")
Suspicious results if you did not add tokens:
tokenizer.json DIFF
tokenizer_config.json DIFF
vocab.json MISSING
merges.txt MISSING
added_tokens.json added or changed
special_tokens_map.json changed
chat_template.jinja missing or changed
processor/preprocessor files missing
Check whether tokenization actually changed
Hashes are useful, but direct token-ID comparison is even more concrete.
from transformers import AutoTokenizer
base_tok = AutoTokenizer.from_pretrained("<base_model_dir>", trust_remote_code=True)
ft_tok = AutoTokenizer.from_pretrained("<merged_model_dir>", trust_remote_code=True)
tests = [
"Hello world",
" Hello world",
"Hello\nworld",
"a b c",
"你好,世界",
"こんにちは世界",
"🙂🚀 café naïve",
"<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\n",
"def f(x):\n return x + 1",
]
for s in tests:
b = base_tok.encode(s, add_special_tokens=False)
f = ft_tok.encode(s, add_special_tokens=False)
print("\nTEXT:", repr(s))
print("same:", b == f)
if b != f:
print("base:", b[:100])
print("ft: ", f[:100])
If these differ and you did not intentionally change the tokenizer, that strongly points to tokenizer drift.
If you did not add tokens: likely fix
If all of this is true:
- base model converts,
- fine-tuned/merged model fails,
- you did not add tokens,
- tokenizer files differ,
then copy tokenizer/config support files from the exact base model revision into your merged folder.
cp <base_model_dir>/tokenizer.json <merged_model_dir>/
cp <base_model_dir>/tokenizer_config.json <merged_model_dir>/
cp <base_model_dir>/vocab.json <merged_model_dir>/
cp <base_model_dir>/merges.txt <merged_model_dir>/
cp <base_model_dir>/chat_template.jinja <merged_model_dir>/
cp <base_model_dir>/special_tokens_map.json <merged_model_dir>/ 2>/dev/null || true
cp <base_model_dir>/added_tokens.json <merged_model_dir>/ 2>/dev/null || true
cp <base_model_dir>/processor_config.json <merged_model_dir>/ 2>/dev/null || true
cp <base_model_dir>/preprocessor_config.json <merged_model_dir>/ 2>/dev/null || true
cp <base_model_dir>/video_preprocessor_config.json <merged_model_dir>/ 2>/dev/null || true
Then rerun conversion:
python <llama_cpp_dir>/convert_hf_to_gguf.py \
<merged_model_dir> \
--outtype bf16 \
--outfile <output_bf16_gguf>
After that succeeds, quantize:
<llama_cpp_dir>/build/bin/llama-quantize \
<output_bf16_gguf> \
<output_q4_k_m_gguf> \
Q4_K_M
This is the fix I would try first in your case, assuming no tokens were added.
If you added tokens: do not copy blindly
If you added tokens, check tokenizer/model consistency first:
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM
path = "<merged_model_dir>"
tok = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
cfg = AutoConfig.from_pretrained(path, trust_remote_code=True)
print("len(tokenizer):", len(tok))
print("config.vocab_size:", getattr(cfg, "vocab_size", None))
print("added vocab size:", len(tok.get_added_vocab()))
print("added vocab:", tok.get_added_vocab())
model = AutoModelForCausalLM.from_pretrained(
path,
torch_dtype="auto",
device_map="cpu",
trust_remote_code=True,
)
print("embedding rows:", model.get_input_embeddings().weight.shape[0])
if model.get_output_embeddings() is not None:
print("output rows:", model.get_output_embeddings().weight.shape[0])
You want:
len(tokenizer) == config.vocab_size == embedding rows
If that does not hold, fix the merged Transformers model first.
If the tokenizer is intentionally modified and internally consistent, then llama.cpp may genuinely need support for that tokenizer fingerprint. In that case, copying the base tokenizer would hide the real issue and may break the model.
If the original base model also fails
If the base model fails too, stop debugging the fine-tuned folder. Use a fresh current llama.cpp checkout:
git clone https://github.com/ggml-org/llama.cpp <llama_cpp_clean_dir>
cd <llama_cpp_clean_dir>
python -m pip install -U -r requirements.txt
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release
python convert_hf_to_gguf_update.py
Then retry:
python <llama_cpp_clean_dir>/convert_hf_to_gguf.py \
<base_model_dir> \
--outtype bf16 \
--outfile <base_model_dir>/base-bf16.gguf
If it still fails with the same chkhsh, then it is probably an upstream llama.cpp support issue for that exact tokenizer/model revision.
A good report should include:
base model: <base_model_name>
base revision: <base_model_revision>
fine-tuned model: <fine_tuned_model_or_local_only>
llama.cpp commit: <commit_hash>
python version: <python_version>
transformers version: <transformers_version>
tokenizers version: <tokenizers_version>
did you add tokens: <yes_or_no>
did you change chat_template: <yes_or_no>
did you merge LoRA: <yes_or_no>
target: <text_only_or_multimodal>
full converter command: <command>
chkhsh: 1444df51289cfa8063b96f0e62b1125440111bc79a52003ea14b6eac7016fd5f
Also include hashes:
sha256sum \
<merged_model_dir>/tokenizer.json \
<merged_model_dir>/tokenizer_config.json \
<merged_model_dir>/vocab.json \
<merged_model_dir>/merges.txt \
<merged_model_dir>/special_tokens_map.json \
<merged_model_dir>/added_tokens.json \
<merged_model_dir>/chat_template.jinja \
2>/dev/null
Why manual hash patching is risky
You may be tempted to edit convert_hf_to_gguf.py and add something like:
if chkhsh == "1444df51289cfa8063b96f0e62b1125440111bc79a52003ea14b6eac7016fd5f":
res = "qwen35"
or:
res = "qwen2"
I would not do that as the first fix.
The hash is only a fingerprint. The actual GGUF needs a correct tokenizer.ggml.pre value that llama.cpp can reproduce at runtime. If you map the hash to the wrong pre-tokenizer, the conversion may succeed but inference can be subtly broken.
This is worse than a clean failure.
Only consider a manual mapping if you can prove:
- the base tokenizer and fine-tuned tokenizer encode a broad set of test strings identically,
- the tokenizer JSON pre-tokenizer is equivalent to an existing llama.cpp pre-tokenizer,
- llama.cpp runtime tokenizer code supports that behavior,
- generated text and/or perplexity look sane after conversion.
Relevant source: convert_hf_to_gguf_update.py.
Conversion and quantization order
Do not debug this by jumping straight to Q4_K_M.
Use the standard two-step route:
Transformers model folder
-> high-precision GGUF: BF16/F16/F32
-> quantized GGUF: Q4_K_M, Q5_K_M, Q8_0, etc.
For Qwen models, the Qwen docs show converting first, often with --outtype bf16, then quantizing with llama-quantize. See: Qwen llama.cpp quantization guide.
Example:
python <llama_cpp_dir>/convert_hf_to_gguf.py \
<merged_model_dir> \
--outtype bf16 \
--outfile <model_bf16_gguf>
<llama_cpp_dir>/build/bin/llama-quantize \
<model_bf16_gguf> \
<model_q4_k_m_gguf> \
Q4_K_M
If quality matters, consider an importance matrix later, but only after the BF16/F16 GGUF conversion works.
Related issues and references
Useful references for this class of problem:
My best guess for your case
Given the exact traceback and the fact that upgrading transformers plus running convert_hf_to_gguf_update.py did not fix it, my best guess is:
Your fine-tuned/merged Qwen3.5-4B folder has tokenizer drift or missing tokenizer-side files.
The fix I would try first is:
- Convert the original base model with the same llama.cpp commit.
- If the base converts, compare tokenizer files.
- If you did not add tokens, copy the exact base tokenizer/config/processor files into the merged folder.
- Convert to BF16/F16 GGUF.
- Quantize only after conversion succeeds.
If the original base model also fails, then this is probably not your fine-tune. It is more likely a llama.cpp support issue for that exact Qwen3.5 tokenizer/model revision.
Short checklist
- Record
llama.cpp commit.
- Record
transformers, tokenizers, huggingface_hub, and Python versions.
- Confirm the exact base model and revision.
- Confirm whether tokens were added.
- Try converting the original base model.
- Compare
tokenizer.json, tokenizer_config.json, vocab.json, merges.txt, chat_template.jinja, and special-token files.
- If no tokens were added, restore tokenizer files from the exact base model revision.
- Convert to BF16/F16 GGUF first.
- Quantize to
Q4_K_M only after the high-precision GGUF conversion succeeds.
- Do not manually map the hash unless tokenizer equivalence and runtime support are verified.