Instructions to use dnivra26/muril-lang-id-v15 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use dnivra26/muril-lang-id-v15 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="dnivra26/muril-lang-id-v15")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("dnivra26/muril-lang-id-v15") model = AutoModelForSequenceClassification.from_pretrained("dnivra26/muril-lang-id-v15") - Notebooks
- Google Colab
- Kaggle
muril-lang-id-v15
Language identification for Indian banking chatbot messages. Detects 17 Indian languages plus English in both native script and Romanized form (Hinglish, Tanglish, Tenglish, Manglish, Kanglish, etc.), with an explicit undetermined class for out-of-distribution input.
Fine-tuned from google/muril-base-cased on a banking-flavoured mix that includes Bhasha-Abhijnaanam, Aksharantar, SST-2, Dakshina, Hinglish corpora, MASSIVE, OffensEval-Dravidian, Bitext retail-banking, plus synthetic banking sentences, banking noun phrases, greeting-led English sentences, conversational shorts, and OOD calibration sources (FLORES-200, Aksharantar non-target Indic, banking-style European, gibberish).
Labels
18 classes. Index → ISO code:
0 as 1 bn 2 en 3 gu 4 hi 5 kn 6 ks 7 ml 8 mr
9 ne 10 or 11 pa 12 sa 13 sd 14 ta 15 te 16 ur
17 undetermined (OOD / non-target / gibberish)
undetermined is a learned class, trained directly on non-target Indic, FLORES non-target, banking-style European, and synthetic gibberish. An energy-based gate (-logsumexp(logits) > -6.2) is a backup safety net for inputs that the learned class doesn't catch.
Test metrics
| Split | Accuracy | F1 (weighted) | F1 (macro) |
|---|---|---|---|
| test | 0.9735 | 0.9735 | 0.9684 |
Held-out test split from the assembled training mix (78 163 examples, stratified by label, seed=42).
Usage
The model is intended to be used inside a two-stage cascade — Unicode script analysis short-circuits any input dominated by an Indic native script (Devanagari, Tamil, Bengali, Gurmukhi, etc.), and only Latin / mixed-script input is sent to MuRIL. Reorder the stages and the semantics change.
Quick standalone use:
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
LABELS = [
"as","bn","en","gu","hi","kn","ks","ml","mr",
"ne","or","pa","sa","sd","ta","te","ur","undetermined",
]
ENERGY_THRESHOLD = -6.2 # per-checkpoint calibration; see below
tok = AutoTokenizer.from_pretrained("dnivra26/muril-lang-id-v15")
model = AutoModelForSequenceClassification.from_pretrained("dnivra26/muril-lang-id-v15").eval()
@torch.inference_mode()
def classify(text: str) -> tuple[str, float]:
inputs = tok(text, return_tensors="pt", truncation=True, max_length=128, padding=True)
logits = model(**inputs).logits.squeeze(0)
energy = -torch.logsumexp(logits, dim=0).item()
label = LABELS[int(torch.softmax(logits, dim=0).argmax())]
if energy > ENERGY_THRESHOLD or label == "undetermined":
return "undetermined", energy
if label == "ur": # Romanized Urdu/Hindi are linguistically identical
label = "hi"
return label, energy
print(classify("Loan approval")) # ('en', ~ -12.7)
print(classify("Hey, my card is blocked")) # ('en', ~ -12.9)
print(classify("namaste")) # ('hi', ~ -11.1)
print(classify("vanakkam")) # ('ta', ...)
print(classify("नमस्ते")) # ('hi', ...) — but native-script inputs should ideally be handled by a script-detect short-circuit upstream
For the full cascade including a Unicode-block dominant-script short-circuit, see the project repo (script analysis is sub-microsecond and 100% accurate for native-script input, so the transformer is only invoked on Latin / mixed-script messages).
Energy threshold
DEFAULT_ENERGY_THRESHOLD = -6.2 — calibrated on test_all.csv (n=1882, banking-flavoured): plateau optimum from −6.2 to −4.6 at 95.32% overall accuracy. The bare-noun-phrase and cased-EN training data added in v12-v15 push real banking-phrase energies to −10..−12, well below the threshold, so the gate primarily catches genuine OOD where the argmax is a target language but the logits aren't peaked.
v15 change log (relative to v11)
- v12: bare English banking noun phrases (3 000 rows × 4 case variants). Closes the gap where 2-4 word phrases like
"loan approval"/"EMI bounce"tripped the energy gate even though softmax preferreden. - v13: ALL CAPS variant in noun-phrase casers; case randomisation added to English conversational shorts. Closes
"LOAN APPROVAL"→mland capital-leading single-word greetings ("Hi","Hello","HEY") routing tohi. - v14 (skipped, not published): drop the en-only guard so capitalised Hindi-Roman shorts (
"Accha","Shukriya") route tohi. Held capital fixes but regressed the dominant lowercase shape ("accha"→en,"Dhanyavaad"→pa). - v15: keep the v14 greeting-sentences loader (2 500 short English greeting-led sentences with case variants — closes
"Hey how are you"and"Hey, my loan approval is pending"routing tohi); revert the en-only guard drop. Lowercase Hindi-Roman shorts back to correcthi; capitalised Hindi-Roman shorts revert to v13 behaviour (acceptable trade-off — capital is rare in real chat input).
Limitations
- Romanized Hindi/Urdu/Punjabi overlap is linguistic, not a bug. Romanized Urdu is post-mapped to Hindi at inference (Hindustani is one spoken language). Romanized Punjabi shares heavy lexical overlap with Hindustani.
- Native-script input should be handled by a Unicode script-detect short-circuit upstream, not sent to this model. The model handles native-script reasonably but a script-detect step is faster and 100 % accurate.
- Capitalised Hindi-Roman shorts like
"Accha","Shukriya"(rare in real chat) may revert to other Indic labels or get OOD-flagged. Lowercase forms classify correctly. - Energy threshold is a calibration knob. Tightening to
-7.0increasesundeterminedrate on short banking-only phrases; loosening reduces OOD recall. Sweep on a representative test set before changing.
Citation
If you use this model, please consider citing the underlying datasets (Bhasha-Abhijnaanam, Aksharantar, SST-2, Dakshina, Hinglish, MASSIVE, OffensEval-Dravidian, Bitext, FLORES-200) and the MuRIL paper (arxiv:2103.10730).
- Downloads last month
- 56
Model tree for dnivra26/muril-lang-id-v15
Base model
google/muril-base-cased