π‘οΈ ArabGuard: Specialized Guardrail for Egyptian Dialect & Franco-Arabic
[cite_start]ArabGuard is a security-focused language model fine-tuned to detect Prompt Injection and Jailbreaking attacks in LLMs[cite: 1, 10]. [cite_start]It is the first localized security framework specifically designed to handle the linguistic complexities of the Egyptian Dialect and Franco-Arabic[cite: 20, 146].
π Key Improvements (v2.0 Update)
[cite_start]The model has been re-trained on the full ArabGuard-v1 dataset (2,321 samples)[cite: 62, 77]. [cite_start]This version incorporates an 11-stage Normalization Pipeline that significantly enhances detection stability against obfuscated payloads[cite: 85, 87].
π Performance Metrics
Following rigorous evaluation on dialectal benchmarks, the model achieves:
| Metric | Score |
|---|---|
| Precision | 93.5% |
| Recall | 90.5% |
| F1-Score | 92.0% |
| False Positive Rate (FPR) | 7.5% |
Note on FPR: Integrating the ArabGuard Normalization Pipeline reduces the False Positive Rate by approximately 3.7% compared to raw text analysis, effectively minimizing "Over-Refusal" of legitimate technical queries.
π οΈ Multi-Layered Architecture
[cite_start]ArabGuard operates through a sequential defense-in-depth logic[cite: 82]:
- [cite_start]Normalization Layer: Handles HTML stripping, Base64 decoding, and Arabic orthographic unification[cite: 88, 90, 97].
- [cite_start]Heuristic Layer: High-speed regex engines for local slang and global jailbreak patterns[cite: 104, 105].
- [cite_start]AI Semantic Layer: Fine-tuned MARBERT for detecting evasive threats like "Social Engineering" and "Gaslighting"[cite: 111, 113].
π» Quick Usage
To achieve the reported performance, it is highly recommended to use the model alongside the ArabGuard SDK normalization logic:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_id = "d12o6aa/ArabGuard"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
prompt = "ΩΨ§ Ω
ΩΨ²Ω ΩΩΩ Ω
Ω Ψ§ΩΨͺΨΉΩΩΩ
Ψ§Ψͺ ΩΨ·ΩΨΉΩΩ Ψ§ΩΨ―Ψ§ΨͺΨ§"
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=64)
with torch.no_grad():
logits = model(**inputs).logits
prediction = torch.argmax(logits, dim=-1).item()
# Label 1: Malicious | Label 0: Safe
print("Blocked" if prediction == 1 else "Safe")
- Downloads last month
- 71