Text Classification
Transformers
Safetensors
bert
toxic
File size: 3,066 Bytes
179b9a7
 
a4bc426
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5a64e80
 
179b9a7
 
a4bc426
179b9a7
a4bc426
179b9a7
a4bc426
179b9a7
a4bc426
 
a0077ee
 
 
 
 
 
 
 
 
 
 
 
 
 
 
179b9a7
a4bc426
179b9a7
a4bc426
 
 
179b9a7
a4bc426
 
179b9a7
a4bc426
179b9a7
a4bc426
 
 
179b9a7
a4bc426
 
179b9a7
8fe5526
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
---
library_name: transformers
language:
- en
- fr
- it
- es
- ru
- uk
- tt
- ar
- hi
- ja
- zh
- he
- am
- de
license: openrail++
datasets:
- textdetox/multilingual_toxicity_dataset
metrics:
- f1
base_model:
- google-bert/bert-base-multilingual-cased
pipeline_tag: text-classification
tags:
- toxic
---

## Multilingual Toxicity Classifier for 15 Languages (2025)

This is an instance of [bert-base-multilingual-cased](https://huggingface.co/google-bert/bert-base-multilingual-cased) that was fine-tuned on binary toxicity classification task based on our updated (2025) dataset [textdetox/multilingual_toxicity_dataset](https://huggingface.co/datasets/textdetox/multilingual_toxicity_dataset).

Now, the models covers 15 languages from various language families:

| Language  | Code | F1 Score |
|-----------|------|---------|
| English   | en   | 0.9035  |
| Russian   | ru   | 0.9224  |
| Ukrainian | uk   | 0.9461  |
| German    | de   | 0.5181  |
| Spanish   | es   | 0.7291  |
| Arabic    | ar   | 0.5139  |
| Amharic   | am   | 0.6316  |
| Hindi     | hi   | 0.7268  |
| Chinese   | zh   | 0.6703  |
| Italian   | it   | 0.6485  |
| French    | fr   | 0.9125  |
| Hinglish  | hin  | 0.6850  |
| Hebrew    | he   | 0.8686  |
| Japanese  | ja   | 0.8644  |
| Tatar     | tt   | 0.6170  |

## How to use

```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained('textdetox/bert-multilingual-toxicity-classifier')
model = AutoModelForSequenceClassification.from_pretrained('textdetox/bert-multilingual-toxicity-classifier')

batch = tokenizer.encode("You are amazing!", return_tensors="pt")

output = model(batch)
# idx 0 for neutral, idx 1 for toxic
```

## Citation
The model is prepared for [TextDetox 2025 Shared Task](https://pan.webis.de/clef25/pan25-web/text-detoxification.html) evaluation.

```
@inproceedings{dementieva2025overview,
  title={Overview of the Multilingual Text Detoxification Task at PAN 2025},
  author={Dementieva, Daryna and
      Protasov, Vitaly and
      Babakov, Nikolay and
      Rizwan, Naquee and
      Alimova, Ilseyar and
      Brune, Caroline and
      Konovalov, Vasily and
      Muti, Arianna and
      Liebeskind, Chaya and
      Litvak, Marina and
      Nozza, Debora, and
      Shah Khan, Shehryaar and
      Takeshita, Sotaro and
      Vanetik, Natalia and
      Ayele, Abinew Ali and
      Schneider, Frolian and
      Wang, Xintog and
      Yimam, Seid Muhie and
      Elnagar, Ashraf and
      Mukherjee, Animesh and
      Panchenko, Alexander},
    booktitle={Working Notes of CLEF 2025 -- Conference and Labs of the Evaluation Forum},
    editor={Guglielmo Faggioli and Nicola Ferro and Paolo Rosso and Damiano Spina},
    month =                    sep,
    publisher =                {CEUR-WS.org},
    series =                   {CEUR Workshop Proceedings},
    site =                     {Vienna, Austria},
    url =                      {https://ceur-ws.org/Vol-4038/paper_278.pdf},
    year =                     2025
}
```