ValueError loading Helsinki-NLP tokenizers

bogolese · March 11, 2026, 3:43pm

Python 3.12.3
torch==2.10.0
transformers=5.3.0

My aim is to load the tokenizer(s) for Helsinki-NLP/opus-mt-ru-en and Helsinki-NLP/opus-mt-zh-en, which I’ve been able to do for a couple years until I upgraded to Ubuntu 24, hence Python 3.12.

Do I need to downgrade to an older version of transformers?

Running with the example code from the Model Card I get the exception that is frustrating me:

Load model directly

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained(“Helsinki-NLP/opus-mt-ru-en”)
model = AutoModelForSeq2SeqLM.from_pretrained(“Helsinki-NLP/opus-mt-ru-en”)
Traceback (most recent call last):
File “/usr/local/lib/python3.12/dist-packages/IPython/core/interactiveshell.py”, line 3747, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File “”, line 4, in
tokenizer = AutoTokenizer.from_pretrained(“Helsinki-NLP/opus-mt-ru-en”)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/transformers/models/auto/tokenization_auto.py”, line 789, in from_pretrained
raise ValueError(
ValueError: Unrecognized configuration class <class ‘transformers.models.marian.configuration_marian.MarianConfig’> to build an AutoTokenizer.
Model type should be one of Aimv2Config, AlbertConfig, AlignConfig, AudioFlamingo3Config, AyaVisionConfig, BarkConfig, BartConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BlenderbotConfig, BlenderbotSmallConfig, BlipConfig, Blip2Config, BridgeTowerConfig, BrosConfig, CamembertConfig, CanineConfig, ChineseCLIPConfig, ClapConfig, CLIPConfig, CLIPSegConfig, ClvpConfig, LlamaConfig, CodeGenConfig, CohereConfig, Cohere2Config, ColQwen2Config, ConvBertConfig, CpmAntConfig, CTRLConfig, Data2VecAudioConfig, Data2VecTextConfig, DbrxConfig, DebertaConfig, DebertaV2Config, DeepseekVLConfig, DeepseekVLHybridConfig, DiaConfig, DistilBertConfig, DPRConfig, ElectraConfig, Emu3Config, ErnieConfig, EsmConfig, FalconMambaConfig, FastSpeech2ConformerConfig, FlaubertConfig, FlavaConfig, FlexOlmoConfig, Florence2Config, FNetConfig, FSMTConfig, FunnelConfig, FuyuConfig, GemmaConfig, Gemma2Config, Gemma3Config, Gemma3TextConfig, Gemma3nConfig, Gemma3nTextConfig, GitConfig, GlmConfig, Glm4Config, Glm4MoeConfig, Glm4MoeLiteConfig, Glm4vConfig, Glm4vMoeConfig, GlmImageConfig, GlmAsrConfig, GotOcr2Config, GPT2Config, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTNeoXJapaneseConfig, GPTJConfig, GraniteConfig, GraniteMoeConfig, GraniteMoeHybridConfig, GraniteMoeSharedConfig, GroundingDinoConfig, GroupViTConfig, HubertConfig, IBertConfig, IdeficsConfig, Idefics2Config, InstructBlipConfig, InstructBlipVideoConfig, InternVLConfig, Jais2Config, JambaConfig, JanusConfig, Kosmos2Config, LasrCTCConfig, LasrEncoderConfig, LayoutLMConfig, LayoutLMv2Config, LayoutLMv3Config, LayoutXLMConfig, LEDConfig, LightOnOcrConfig, LiltConfig, LlavaConfig, LlavaNextConfig, LongformerConfig, LukeConfig, LxmertConfig, M2M100Config, MambaConfig, Mamba2Config, MarianConfig, MarkupLMConfig, MBartConfig, MegatronBertConfig, MetaClip2Config, MgpstrConfig, MinistralConfig, Ministral3Config, MistralConfig, Mistral3Config, MixtralConfig, MMGroundingDinoConfig, MobileBertConfig, MPNetConfig, MptConfig, MraConfig, MT5Config, MusicgenConfig, MusicgenMelodyConfig, MvpConfig, NllbMoeConfig, VisionEncoderDecoderConfig, NystromformerConfig, OlmoConfig, Olmo2Config, Olmo3Config, OlmoHybridConfig, OlmoeConfig, OmDetTurboConfig, OneFormerConfig, OpenAIGPTConfig, OPTConfig, Ovis2Config, Owlv2Config, OwlViTConfig, PegasusConfig, PegasusXConfig, PerceiverConfig, PhiConfig, Phi3Config, Pix2StructConfig, PixtralVisionConfig, PLBartConfig, ProphetNetConfig, Qwen2Config, Qwen2_5OmniConfig, Qwen2_5_VLConfig, Qwen2AudioConfig, Qwen2MoeConfig, Qwen2VLConfig, Qwen3Config, Qwen3_5Config, Qwen3_5MoeConfig, Qwen3MoeConfig, Qwen3NextConfig, Qwen3OmniMoeConfig, Qwen3VLConfig, Qwen3VLMoeConfig, RagConfig, RecurrentGemmaConfig, ReformerConfig, RemBertConfig, RobertaConfig, RobertaPreLayerNormConfig, RoCBertConfig, RoFormerConfig, RwkvConfig, Sam3Config, Sam3VideoConfig, SeamlessM4TConfig, SeamlessM4Tv2Config, ShieldGemma2Config, SiglipConfig, Siglip2Config, Speech2TextConfig, SpeechT5Config, SplinterConfig, SqueezeBertConfig, StableLmConfig, Starcoder2Config, SwitchTransformersConfig, T5Config, T5GemmaConfig, TapasConfig, TrOCRConfig, TvpConfig, UdopConfig, UMT5Config, UniSpeechConfig, UniSpeechSatConfig, ViltConfig, VipLlavaConfig, VisualBertConfig, VitsConfig, VoxtralConfig, VoxtralRealtimeConfig, Wav2Vec2Config, Wav2Vec2BertConfig, Wav2Vec2ConformerConfig, WhisperConfig, XCLIPConfig, XGLMConfig, XLMConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, xLSTMConfig, XmodConfig, YosoConfig.

bogolese · March 11, 2026, 4:14pm

Okay, so I was able to get past this “opportunity” by using the transformers.MarianTokenizer class in lieu of the (SHOULD have worked) AutoTokenizer class.

But the mystery remains . . . WHY didn’t it work as advertised?!?!?

John6666 · March 12, 2026, 2:32am

But the mystery remains . . . WHY didn’t it work as advertised?!?!?

In Transformers, models and tokenizers may implicitly require backends outside of Transformers and PyTorch in some case.

In this case, uninstalling sentencepiece allowed me to reproduce the issue in my environment. Explicitly installing sentencepiece would be a quick workaround.

No, you probably do not need to downgrade transformers. Current Transformers supports Python 3.10+, so Python 3.12 itself is not the problem, and Marian is still a supported AutoTokenizer family in the docs. (PyPI)

What is happening is that Helsinki-NLP/opus-mt-* uses MarianTokenizer, and MarianTokenizer is SentencePiece-based. In current Transformers, the auto-mapping for Marian is effectively: use MarianTokenizer only if sentencepiece is available. The Marian tokenizer source also explicitly requires the sentencepiece backend. (GitHub)

So when sentencepiece is missing or not visible in the current runtime, AutoTokenizer.from_pretrained(...) can fail with that misleading error:

Unrecognized configuration class ... MarianConfig ...

even though MarianConfig is listed. That is a known class of bug/confusing behavior; there is a recent Transformers issue specifically about cryptic AutoTokenizer errors for SentencePiece tokenizers when sentencepiece is not installed. (GitHub)

That also explains why MarianTokenizer.from_pretrained(...) helped: it bypasses the auto-dispatch layer. If sentencepiece is available, direct MarianTokenizer works. If sentencepiece is truly missing, direct MarianTokenizer should fail too, but with a much clearer “install SentencePiece” error. (GitHub)

Use this fix:

pip install -U sentencepiece

Then restart the Python process / Jupyter kernel and retry:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-ru-en")
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-ru-en")

Or, as a workaround, use:

from transformers import MarianTokenizer
tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-ru-en")

You can also be explicit in v5:

tokenizer = AutoTokenizer.from_pretrained(
    "Helsinki-NLP/opus-mt-ru-en",
    backend="sentencepiece",
)

That only helps if sentencepiece is actually installed. (GitHub)

So the short explanation is:

Not “Python 3.12 broke Marian.”
Not necessarily “downgrade Transformers.”
Most likely: sentencepiece was missing or not visible in that runtime, and AutoTokenizer surfaced it as a confusing MarianConfig error. (GitHub)

Topic		Replies	Views
Error with new tokenizers (URGENT!) 🤗Tokenizers	16	52027	July 22, 2024
AutoTokenizer.from_pretrained('google/pegasus-cnn_dailymail') giving valueerror Couldn't instantiate the backend tokenizer 🤗Transformers	1	383	January 14, 2024
ValueError: Tokenizer class ByT5Tokenizer does not exist or is not currently imported Model cards	2	5576	June 7, 2021
Cannot download translation models in Colab Beginners	4	2906	June 18, 2021
Not able to install MarianMTModel, MarianTokenizer Beginners	0	745	April 30, 2021

ValueError loading Helsinki-NLP tokenizers

Load model directly

Related topics