The Tatoxa System for Text Detoxification in Low-Resource Languages: The Case of Tatar
Abstract
Tatoxa is a state-of-the-art text detoxification system for the Tatar language that demonstrates superior performance over existing LLMs and highlights the challenges of cross-lingual transfer in low-resource settings.
Text detoxification, the automated detection and mitigation of abusive and harmful content, is essential for ensuring the safety of online communities and protecting users. However, low resource languages such as Tatar have received little research attention. In this paper we present Tatoxa, a novel state-of-the-art system for text detoxification in the Tatar language. Comparative experiments show that the proposed approach outperforms existing open source and proprietary commercial LLMs on key quality metrics. We also introduce a new dataset for text detoxification in Tatar, designed for fine tuning and evaluation in low resource settings. Finally, cross lingual transfer experiments indicate that transfer from other languages, including the culturally close Russian, performs significantly worse than training on native Tatar data even when a large Russian corpus is available.
Community
New detoxification framework for Tatar language
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- BhashaSetu: A Data-Centric Approach to Low-Resource Machine Translation (2026)
- Multilingual and Cross-Lingual Citation Needed Detection on Wikipedia for Lower-Resource Languages (2026)
- From Traditional Taggers to LLMs: A Comparative Study of POS Tagging for Medieval Romance Languages (2026)
- SiNFluD: Creating and Evaluating Figurative Language Dataset for Sindhi (2026)
- Mind the Pause: Disfluency-Aware Objective Tuning for Multilingual Speech Correction with LLMs (2026)
- SkMTEB: Slovak Massive Text Embedding Benchmark and Model Adaptation (2026)
- English-to-Prakrit Machine Translation via Multilingual Transfer Learning (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2606.26015 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper