ValueError loading Helsinki-NLP tokenizers

Python 3.12.3
torch==2.10.0
transformers=5.3.0

My aim is to load the tokenizer(s) for Helsinki-NLP/opus-mt-ru-en and Helsinki-NLP/opus-mt-zh-en, which I’ve been able to do for a couple years until I upgraded to Ubuntu 24, hence Python 3.12.

Do I need to downgrade to an older version of transformers?

Running with the example code from the Model Card I get the exception that is frustrating me:

Load model directly

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained(ā€œHelsinki-NLP/opus-mt-ru-enā€)
model = AutoModelForSeq2SeqLM.from_pretrained(ā€œHelsinki-NLP/opus-mt-ru-enā€)
Traceback (most recent call last):
File ā€œ/usr/local/lib/python3.12/dist-packages/IPython/core/interactiveshell.pyā€, line 3747, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File ā€œā€, line 4, in
tokenizer = AutoTokenizer.from_pretrained(ā€œHelsinki-NLP/opus-mt-ru-enā€)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ā€œ/usr/local/lib/python3.12/dist-packages/transformers/models/auto/tokenization_auto.pyā€, line 789, in from_pretrained
raise ValueError(
ValueError: Unrecognized configuration class <class ā€˜transformers.models.marian.configuration_marian.MarianConfig’> to build an AutoTokenizer.
Model type should be one of Aimv2Config, AlbertConfig, AlignConfig, AudioFlamingo3Config, AyaVisionConfig, BarkConfig, BartConfig, BertConfig, BertGenerationConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BlenderbotConfig, BlenderbotSmallConfig, BlipConfig, Blip2Config, BridgeTowerConfig, BrosConfig, CamembertConfig, CanineConfig, ChineseCLIPConfig, ClapConfig, CLIPConfig, CLIPSegConfig, ClvpConfig, LlamaConfig, CodeGenConfig, CohereConfig, Cohere2Config, ColQwen2Config, ConvBertConfig, CpmAntConfig, CTRLConfig, Data2VecAudioConfig, Data2VecTextConfig, DbrxConfig, DebertaConfig, DebertaV2Config, DeepseekVLConfig, DeepseekVLHybridConfig, DiaConfig, DistilBertConfig, DPRConfig, ElectraConfig, Emu3Config, ErnieConfig, EsmConfig, FalconMambaConfig, FastSpeech2ConformerConfig, FlaubertConfig, FlavaConfig, FlexOlmoConfig, Florence2Config, FNetConfig, FSMTConfig, FunnelConfig, FuyuConfig, GemmaConfig, Gemma2Config, Gemma3Config, Gemma3TextConfig, Gemma3nConfig, Gemma3nTextConfig, GitConfig, GlmConfig, Glm4Config, Glm4MoeConfig, Glm4MoeLiteConfig, Glm4vConfig, Glm4vMoeConfig, GlmImageConfig, GlmAsrConfig, GotOcr2Config, GPT2Config, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTNeoXJapaneseConfig, GPTJConfig, GraniteConfig, GraniteMoeConfig, GraniteMoeHybridConfig, GraniteMoeSharedConfig, GroundingDinoConfig, GroupViTConfig, HubertConfig, IBertConfig, IdeficsConfig, Idefics2Config, InstructBlipConfig, InstructBlipVideoConfig, InternVLConfig, Jais2Config, JambaConfig, JanusConfig, Kosmos2Config, LasrCTCConfig, LasrEncoderConfig, LayoutLMConfig, LayoutLMv2Config, LayoutLMv3Config, LayoutXLMConfig, LEDConfig, LightOnOcrConfig, LiltConfig, LlavaConfig, LlavaNextConfig, LongformerConfig, LukeConfig, LxmertConfig, M2M100Config, MambaConfig, Mamba2Config, MarianConfig, MarkupLMConfig, MBartConfig, MegatronBertConfig, MetaClip2Config, MgpstrConfig, MinistralConfig, Ministral3Config, MistralConfig, Mistral3Config, MixtralConfig, MMGroundingDinoConfig, MobileBertConfig, MPNetConfig, MptConfig, MraConfig, MT5Config, MusicgenConfig, MusicgenMelodyConfig, MvpConfig, NllbMoeConfig, VisionEncoderDecoderConfig, NystromformerConfig, OlmoConfig, Olmo2Config, Olmo3Config, OlmoHybridConfig, OlmoeConfig, OmDetTurboConfig, OneFormerConfig, OpenAIGPTConfig, OPTConfig, Ovis2Config, Owlv2Config, OwlViTConfig, PegasusConfig, PegasusXConfig, PerceiverConfig, PhiConfig, Phi3Config, Pix2StructConfig, PixtralVisionConfig, PLBartConfig, ProphetNetConfig, Qwen2Config, Qwen2_5OmniConfig, Qwen2_5_VLConfig, Qwen2AudioConfig, Qwen2MoeConfig, Qwen2VLConfig, Qwen3Config, Qwen3_5Config, Qwen3_5MoeConfig, Qwen3MoeConfig, Qwen3NextConfig, Qwen3OmniMoeConfig, Qwen3VLConfig, Qwen3VLMoeConfig, RagConfig, RecurrentGemmaConfig, ReformerConfig, RemBertConfig, RobertaConfig, RobertaPreLayerNormConfig, RoCBertConfig, RoFormerConfig, RwkvConfig, Sam3Config, Sam3VideoConfig, SeamlessM4TConfig, SeamlessM4Tv2Config, ShieldGemma2Config, SiglipConfig, Siglip2Config, Speech2TextConfig, SpeechT5Config, SplinterConfig, SqueezeBertConfig, StableLmConfig, Starcoder2Config, SwitchTransformersConfig, T5Config, T5GemmaConfig, TapasConfig, TrOCRConfig, TvpConfig, UdopConfig, UMT5Config, UniSpeechConfig, UniSpeechSatConfig, ViltConfig, VipLlavaConfig, VisualBertConfig, VitsConfig, VoxtralConfig, VoxtralRealtimeConfig, Wav2Vec2Config, Wav2Vec2BertConfig, Wav2Vec2ConformerConfig, WhisperConfig, XCLIPConfig, XGLMConfig, XLMConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, xLSTMConfig, XmodConfig, YosoConfig.

1 Like

Okay, so I was able to get past this ā€œopportunityā€ by using the transformers.MarianTokenizer class in lieu of the (SHOULD have worked) AutoTokenizer class.

But the mystery remains . . . WHY didn’t it work as advertised?!?!?

1 Like

But the mystery remains . . . WHY didn’t it work as advertised?!?!?

In Transformers, models and tokenizers may implicitly require backends outside of Transformers and PyTorch in some case.

In this case, uninstalling sentencepiece allowed me to reproduce the issue in my environment. Explicitly installing sentencepiece would be a quick workaround.


No, you probably do not need to downgrade transformers. Current Transformers supports Python 3.10+, so Python 3.12 itself is not the problem, and Marian is still a supported AutoTokenizer family in the docs. (PyPI)

What is happening is that Helsinki-NLP/opus-mt-* uses MarianTokenizer, and MarianTokenizer is SentencePiece-based. In current Transformers, the auto-mapping for Marian is effectively: use MarianTokenizer only if sentencepiece is available. The Marian tokenizer source also explicitly requires the sentencepiece backend. (GitHub)

So when sentencepiece is missing or not visible in the current runtime, AutoTokenizer.from_pretrained(...) can fail with that misleading error:

Unrecognized configuration class ... MarianConfig ...

even though MarianConfig is listed. That is a known class of bug/confusing behavior; there is a recent Transformers issue specifically about cryptic AutoTokenizer errors for SentencePiece tokenizers when sentencepiece is not installed. (GitHub)

That also explains why MarianTokenizer.from_pretrained(...) helped: it bypasses the auto-dispatch layer. If sentencepiece is available, direct MarianTokenizer works. If sentencepiece is truly missing, direct MarianTokenizer should fail too, but with a much clearer ā€œinstall SentencePieceā€ error. (GitHub)

Use this fix:

pip install -U sentencepiece

Then restart the Python process / Jupyter kernel and retry:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-ru-en")
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-ru-en")

Or, as a workaround, use:

from transformers import MarianTokenizer
tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-ru-en")

You can also be explicit in v5:

tokenizer = AutoTokenizer.from_pretrained(
    "Helsinki-NLP/opus-mt-ru-en",
    backend="sentencepiece",
)

That only helps if sentencepiece is actually installed. (GitHub)

So the short explanation is:

  • Not ā€œPython 3.12 broke Marian.ā€
  • Not necessarily ā€œdowngrade Transformers.ā€
  • Most likely: sentencepiece was missing or not visible in that runtime, and AutoTokenizer surfaced it as a confusing MarianConfig error. (GitHub)