Configuration Parsing Warning:In UNKNOWN_FILENAME: "auto_map.AutoTokenizer" must be a string

Pinyin-Code Masked LM

This repository contains a custom Transformers masked language model. Load it with trust_remote_code=True.

Dependencies

Install the runtime dependencies before loading the model:

pip install torch transformers safetensors sentencepiece pypinyin jieba

sentencepiece is required for AutoTokenizer. pypinyin is required for raw Mandarin-to-pinyin tokenization. jieba is required when use_jieba is true; this export was created with use_jieba=true.

Loading

from transformers import AutoConfig, AutoModel, AutoModelForMaskedLM, AutoTokenizer

model_path = "PATH_OR_REPO_ID"

config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
base_model = AutoModel.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained(model_path, trust_remote_code=True)

Evaluation

Configure external evaluators with:

  • model path: this local folder or Hugging Face repo ID
  • backend: masked_language_modeling
  • trust remote code: enabled

For BLiMP-style sentence-pair scoring, use pseudo-log-likelihood rather than left-to-right probability; this requires one forward pass per scored token.

The tokenizer accepts raw text through standard calls such as tokenizer(text), tokenizer(text, add_special_tokens=False), and tokenizer(texts, padding=True, truncation=True, return_tensors="pt"). It also accepts return_offsets_mapping=True for compatibility with completion-ranking evaluators that need suffix masks. The model supports output_hidden_states=True for representation extraction tasks.

This export sets patch_pathlib_utf8_open=true in config.json. When loaded with trust_remote_code=True, the config installs a narrow Windows compatibility shim so later text-mode Path.open("r") calls without an explicit encoding default to UTF-8. Set PINYIN_CODE_DISABLE_UTF8_OPEN_PATCH=1 before loading the model to disable that shim.

Export metadata:

  • transliteration: pinyin-code
  • training_model_type: bert
  • use_jieba: true
Downloads last month
60
Safetensors
Model size
34M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support