Configuration Parsing Warning:In UNKNOWN_FILENAME: "auto_map.AutoTokenizer" must be a string

Pinyin-Code Masked LM

This repository contains a custom Transformers masked language model. Load it with trust_remote_code=True.

Dependencies

Install the runtime dependencies before loading the model:

pip install torch transformers safetensors sentencepiece pypinyin jieba

sentencepiece is required for AutoTokenizer. pypinyin is required for raw Mandarin-to-pinyin tokenization. jieba is required when use_jieba is true; this export was created with use_jieba=true.

Loading

from transformers import AutoConfig, AutoModel, AutoModelForMaskedLM, AutoTokenizer

model_path = "PATH_OR_REPO_ID"

config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
base_model = AutoModel.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForMaskedLM.from_pretrained(model_path, trust_remote_code=True)

Evaluation

Configure external evaluators with:

model path: this local folder or Hugging Face repo ID
backend: masked_language_modeling
trust remote code: enabled

For BLiMP-style sentence-pair scoring, use pseudo-log-likelihood rather than left-to-right probability; this requires one forward pass per scored token.

The tokenizer accepts raw text through standard calls such as tokenizer(text), tokenizer(text, add_special_tokens=False), and tokenizer(texts, padding=True, truncation=True, return_tensors="pt"). It also accepts return_offsets_mapping=True for compatibility with completion-ranking evaluators that need suffix masks. The model supports output_hidden_states=True for representation extraction tasks.

This export sets patch_pathlib_utf8_open=true in config.json. When loaded with trust_remote_code=True, the config installs a narrow Windows compatibility shim so later text-mode Path.open("r") calls without an explicit encoding default to UTF-8. Set PINYIN_CODE_DISABLE_UTF8_OPEN_PATCH=1 before loading the model to disable that shim.

Export metadata:

transliteration: pinyin-code
training_model_type: bert
use_jieba: true

Downloads last month: 60

Safetensors

Model size

34M params

Tensor type

F32