MartinThoma/wili_2018
Viewer • Updated • 235k • 580 • 5
How to use SebOchs/canine-c-lang-id with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-classification", model="SebOchs/canine-c-lang-id") # Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("SebOchs/canine-c-lang-id")
model = AutoModelForSequenceClassification.from_pretrained("SebOchs/canine-c-lang-id")Canine model trained on WiLI-2018 dataset to identify the language of a text.
Dictionary to return English names for a label id:
import datasets
import pycountry
def int_to_lang():
dataset = datasets.load_dataset('wili_2018')
# names for languages not in iso-639-3 from wikipedia
non_iso_languages = {'roa-tara': 'Tarantino', 'zh-yue': 'Cantonese', 'map-bms': 'Banyumasan',
'nds-nl': 'Dutch Low Saxon', 'be-tarask': 'Belarusian'}
# create dictionary from data set labels to language names
lab_to_lang = {}
for i, lang in enumerate(dataset['train'].features['label'].names):
full_lang = pycountry.languages.get(alpha_3=lang)
if full_lang:
lab_to_lang[i] = full_lang.name
else:
lab_to_lang[i] = non_iso_languages[lang]
return lab_to_lang
@article{clark-etal-2022-canine,
title = "Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation",
author = "Clark, Jonathan H. and
Garrette, Dan and
Turc, Iulia and
Wieting, John",
journal = "Transactions of the Association for Computational Linguistics",
volume = "10",
year = "2022",
address = "Cambridge, MA",
publisher = "MIT Press",
url = "https://aclanthology.org/2022.tacl-1.5",
doi = "10.1162/tacl_a_00448",
pages = "73--91",
abstract = "Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model{'}s ability to adapt. In this paper, we present Canine, a neural encoder that operates directly on character sequences{---}without explicit tokenization or vocabulary{---}and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias. To use its finer-grained input effectively and efficiently, Canine combines downsampling, which reduces the input sequence length, with a deep transformer stack, which encodes context. Canine outperforms a comparable mBert model by 5.7 F1 on TyDi QA, a challenging multilingual benchmark, despite having fewer model parameters.",
}
@dataset{thoma_martin_2018_841984,
author = {Thoma, Martin},
title = {{WiLI-2018 - Wikipedia Language Identification
database}},
month = jan,
year = 2018,
publisher = {Zenodo},
version = {1.0.0},
doi = {10.5281/zenodo.841984},
url = {https://doi.org/10.5281/zenodo.841984}
}