muril-lang-id-v15

Language identification for Indian banking chatbot messages. Detects 17 Indian languages plus English in both native script and Romanized form (Hinglish, Tanglish, Tenglish, Manglish, Kanglish, etc.), with an explicit undetermined class for out-of-distribution input.

Fine-tuned from google/muril-base-cased on a banking-flavoured mix that includes Bhasha-Abhijnaanam, Aksharantar, SST-2, Dakshina, Hinglish corpora, MASSIVE, OffensEval-Dravidian, Bitext retail-banking, plus synthetic banking sentences, banking noun phrases, greeting-led English sentences, conversational shorts, and OOD calibration sources (FLORES-200, Aksharantar non-target Indic, banking-style European, gibberish).

Labels

18 classes. Index → ISO code:

0 as  1 bn  2 en  3 gu  4 hi  5 kn  6 ks  7 ml  8 mr
9 ne  10 or  11 pa  12 sa  13 sd  14 ta  15 te  16 ur
17 undetermined  (OOD / non-target / gibberish)

undetermined is a learned class, trained directly on non-target Indic, FLORES non-target, banking-style European, and synthetic gibberish. An energy-based gate (-logsumexp(logits) > -6.2) is a backup safety net for inputs that the learned class doesn't catch.

Test metrics

Split	Accuracy	F1 (weighted)	F1 (macro)
test	0.9735	0.9735	0.9684

Held-out test split from the assembled training mix (78 163 examples, stratified by label, seed=42).

Usage

The model is intended to be used inside a two-stage cascade — Unicode script analysis short-circuits any input dominated by an Indic native script (Devanagari, Tamil, Bengali, Gurmukhi, etc.), and only Latin / mixed-script input is sent to MuRIL. Reorder the stages and the semantics change.

Quick standalone use:

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

LABELS = [
    "as","bn","en","gu","hi","kn","ks","ml","mr",
    "ne","or","pa","sa","sd","ta","te","ur","undetermined",
]
ENERGY_THRESHOLD = -6.2  # per-checkpoint calibration; see below

tok = AutoTokenizer.from_pretrained("dnivra26/muril-lang-id-v15")
model = AutoModelForSequenceClassification.from_pretrained("dnivra26/muril-lang-id-v15").eval()

@torch.inference_mode()
def classify(text: str) -> tuple[str, float]:
    inputs = tok(text, return_tensors="pt", truncation=True, max_length=128, padding=True)
    logits = model(**inputs).logits.squeeze(0)
    energy = -torch.logsumexp(logits, dim=0).item()
    label = LABELS[int(torch.softmax(logits, dim=0).argmax())]
    if energy > ENERGY_THRESHOLD or label == "undetermined":
        return "undetermined", energy
    if label == "ur":  # Romanized Urdu/Hindi are linguistically identical
        label = "hi"
    return label, energy

print(classify("Loan approval"))            # ('en', ~ -12.7)
print(classify("Hey, my card is blocked"))  # ('en', ~ -12.9)
print(classify("namaste"))                  # ('hi', ~ -11.1)
print(classify("vanakkam"))                 # ('ta', ...)
print(classify("नमस्ते"))                    # ('hi', ...) — but native-script inputs should ideally be handled by a script-detect short-circuit upstream

For the full cascade including a Unicode-block dominant-script short-circuit, see the project repo (script analysis is sub-microsecond and 100% accurate for native-script input, so the transformer is only invoked on Latin / mixed-script messages).

Energy threshold

DEFAULT_ENERGY_THRESHOLD = -6.2 — calibrated on test_all.csv (n=1882, banking-flavoured): plateau optimum from −6.2 to −4.6 at 95.32% overall accuracy. The bare-noun-phrase and cased-EN training data added in v12-v15 push real banking-phrase energies to −10..−12, well below the threshold, so the gate primarily catches genuine OOD where the argmax is a target language but the logits aren't peaked.

v15 change log (relative to v11)

v12: bare English banking noun phrases (3 000 rows × 4 case variants). Closes the gap where 2-4 word phrases like "loan approval" / "EMI bounce" tripped the energy gate even though softmax preferred en.
v13: ALL CAPS variant in noun-phrase casers; case randomisation added to English conversational shorts. Closes "LOAN APPROVAL" → ml and capital-leading single-word greetings ("Hi", "Hello", "HEY") routing to hi.
v14 (skipped, not published): drop the en-only guard so capitalised Hindi-Roman shorts ("Accha", "Shukriya") route to hi. Held capital fixes but regressed the dominant lowercase shape ("accha" → en, "Dhanyavaad" → pa).
v15: keep the v14 greeting-sentences loader (2 500 short English greeting-led sentences with case variants — closes "Hey how are you" and "Hey, my loan approval is pending" routing to hi); revert the en-only guard drop. Lowercase Hindi-Roman shorts back to correct hi; capitalised Hindi-Roman shorts revert to v13 behaviour (acceptable trade-off — capital is rare in real chat input).

Limitations

Romanized Hindi/Urdu/Punjabi overlap is linguistic, not a bug. Romanized Urdu is post-mapped to Hindi at inference (Hindustani is one spoken language). Romanized Punjabi shares heavy lexical overlap with Hindustani.
Native-script input should be handled by a Unicode script-detect short-circuit upstream, not sent to this model. The model handles native-script reasonably but a script-detect step is faster and 100 % accurate.
Capitalised Hindi-Roman shorts like "Accha", "Shukriya" (rare in real chat) may revert to other Indic labels or get OOD-flagged. Lowercase forms classify correctly.
Energy threshold is a calibration knob. Tightening to -7.0 increases undetermined rate on short banking-only phrases; loosening reduces OOD recall. Sweep on a representative test set before changing.

Citation

If you use this model, please consider citing the underlying datasets (Bhasha-Abhijnaanam, Aksharantar, SST-2, Dakshina, Hinglish, MASSIVE, OffensEval-Dravidian, Bitext, FLORES-200) and the MuRIL paper (arxiv:2103.10730).

Downloads last month: 56

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for dnivra26/muril-lang-id-v15

Base model

google/muril-base-cased

Finetuned

(56)

this model

Datasets used to train dnivra26/muril-lang-id-v15

Paper for dnivra26/muril-lang-id-v15

MuRIL: Multilingual Representations for Indian Languages

Paper • 2103.10730 • Published Mar 19, 2021 • 2