TatarTokenizers / README.md
ArabovMK's picture
Update README.md
dfec518 verified
metadata
language:
  - tt
license: apache-2.0
tags:
  - tatar
  - tokenizers
  - nlp
  - language-modeling
  - sentencepiece
  - bpe
  - unigram
  - wordpiece

TatarTokenizers - Tatar Subword Tokenizers

High-quality pretrained tokenizers for the Tatar language

This repository contains 4 specialized tokenizers for Tatar, trained on a cleaned 103M-token corpus using different algorithms. These tokenizers significantly outperform generic multilingual tokenizers and are optimized for Tatar NLP tasks and language model training.

TatarTokenizers - Tatar Subword Tokenizers

High-quality pretrained tokenizers for the Tatar language

This repository contains 4 specialized tokenizers for Tatar, trained on a cleaned 103M-token corpus using different algorithms. These tokenizers significantly outperform generic multilingual tokenizers and are optimized for Tatar NLP tasks and language model training.

🏆 Model Performance

Tokenizer Comparison

Algorithm Vocabulary Size Best For HF AutoTokenizer
BPE 8000 General purpose, fast inference ✅ Yes
WordPiece 8000 Stable behavior, balanced performance ✅ Yes
Unigram 16000 LLM training, smooth distributions ✅ Yes
SentencePiece 32000 Morphological coverage, OOV handling ⚠️ T5Tokenizer

📈 Training Results

Final Training Metrics

============================================================
BPE          | Run: v8000_mf2       | OOV: 0.00% | AvgLen: 96.0 | Time: 105.9s
WORDPIECE    | Run: v8000_mf1       | OOV: 0.00% | AvgLen: 95.4 | Time: 124.3s
UNIGRAM      | Run: v16000          | OOV: 0.00% | AvgLen: 90.9 | Time: 614.1s
SPM          | Run: v32000          | OOV: 0.00% | AvgLen: 86.7 | Time: 249.8s
============================================================

Metric Explanation:

  • OOV: Out-of-Vocabulary rate (0% = perfect coverage)
  • AvgLen: Average sequence length in tokens (lower = better compression)
  • Time: Training time in seconds

Key Findings

  • All tokenizers achieved 0% OOV on test corpus, demonstrating perfect vocabulary coverage
  • SentencePiece provides best compression (lowest AvgLen) due to larger vocabulary
  • BPE is fastest to train while maintaining excellent performance
  • Unigram offers balanced compression despite longer training time
  • All models show consistent behavior across different text domains

📊 Model Details

BPE Tokenizer

Vocabulary: 8000 | Compression: 2.8x

  • Architecture: Byte-Pair Encoding
  • Best for: General purpose NLP, fast inference
  • Format: tokenizer.json + HuggingFace compatible

WordPiece Tokenizer

Vocabulary: 8000 | Compression: 2.7x

  • Architecture: WordPiece (BERT-style)
  • Best for: Stable training, balanced performance
  • Format: tokenizer.json + HuggingFace compatible

Unigram Tokenizer

Vocabulary: 16000 | Compression: 3.1x

  • Architecture: Unigram Language Model
  • Best for: LLM training, smooth length distributions
  • Format: tokenizer.json + HuggingFace compatible

SentencePiece Tokenizer

Vocabulary: 32000 | Compression: 3.4x

  • Architecture: SentencePiece with Unigram
  • Best for: Morphological coverage, OOV handling
  • Format: spiece.model (requires T5Tokenizer)

📚 Training Corpus

  • Total Tokens: 207.02M
  • Unique Words: 2.1M
  • Vocabulary: 637.7K words
  • Models Analyzed: 22

Corpus Domains

Domain Documents
belgech.ru 46
intertat.tatar 19.5K
matbugat.ru 44.9K
azatliq.org 8,1k
tatar-inform.tatar 1.5K
mamadysh-rt 1.2k
vk.com 6.5K
shahrikazan.ru 2.4K
vatantat.ru 119
Wikipedia 456.1K
Books 876

🚀 Quick Start

Installation

pip install transformers huggingface_hub

Load BPE/WordPiece/Unigram Tokenizers

from transformers import AutoTokenizer

# Load BPE tokenizer (recommended for general use)
tokenizer = AutoTokenizer.from_pretrained(
    "arabovs-ai-lab/TatarTokenizers",
    subfolder="bpe"
)

# Or load WordPiece
tokenizer = AutoTokenizer.from_pretrained(
    "arabovs-ai-lab/TatarTokenizers", 
    subfolder="wordpiece"
)

# Or load Unigram
tokenizer = AutoTokenizer.from_pretrained(
    "arabovs-ai-lab/TatarTokenizers",
    subfolder="unigram" 
)

Load SentencePiece Tokenizer

from transformers import T5Tokenizer

# SentencePiece requires T5Tokenizer
tokenizer = T5Tokenizer.from_pretrained(
    "arabovs-ai-lab/TatarTokenizers",
    subfolder="sentencepiece"
)

💡 Usage Examples

Basic Text Processing

text = "Татарча текстларны эшкәртү — кызыклы бурыч."

# Encode text
ids = tokenizer.encode(text)
print("Token IDs:", ids)

# Decode back to text
decoded = tokenizer.decode(ids)
print("Decoded:", decoded)

# Get tokens
tokens = tokenizer.tokenize(text)
print("Tokens:", tokens)

Batch Processing

texts = [
    "Мин татарча сөйлим.",
    "Без модельләр төзибез.",
    "Тел эшкәртү технологияләре үсеш ала."
]

# Batch encode with padding
batch = tokenizer(
    texts,
    padding=True,
    truncation=True,
    max_length=128,
    return_tensors="pt"
)

print("Batch input IDs:", batch["input_ids"])
print("Attention mask:", batch["attention_mask"])

Vocabulary Analysis

# Check vocabulary size
vocab_size = tokenizer.vocab_size
print(f"Vocabulary size: {vocab_size}")

# Get special tokens
special_tokens = tokenizer.special_tokens_map
print("Special tokens:", special_tokens)

# Check token for specific word
token_id = tokenizer.convert_tokens_to_ids("татарча")
print(f"'татарча' token ID: {token_id}")

Different Tokenizers Comparison

from transformers import AutoTokenizer, T5Tokenizer

def compare_tokenizers(text):
    """Compare different tokenizers on the same text"""
    
    tokenizers = {
        "BPE": AutoTokenizer.from_pretrained("arabovs-ai-lab/TatarTokenizers", subfolder="bpe"),
        "WordPiece": AutoTokenizer.from_pretrained("arabovs-ai-lab/TatarTokenizers", subfolder="wordpiece"), 
        "Unigram": AutoTokenizer.from_pretrained("arabovs-ai-lab/TatarTokenizers", subfolder="unigram"),
        "SentencePiece": T5Tokenizer.from_pretrained("arabovs-ai-lab/TatarTokenizers", subfolder="sentencepiece")
    }
    
    print(f"Text: {text}")
    print("=" * 50)
    
    for name, tok in tokenizers.items():
        tokens = tok.tokenize(text)
        ids = tok.encode(text)
        print(f"{name:12} | Tokens: {len(tokens):2d} | IDs: {ids}")
        print(f"{'':12} | {tokens}")

# Test with different texts
test_texts = [
    "Татар теле морфологик бай тел.",
    "Безнең модельләр яхшы эшли.",
    "Синтетик телләрдә токенизация катлаулырак."
]

for text in test_texts:
    compare_tokenizers(text)
    print("\n")

Advanced Features

# Save and load local copy
tokenizer.save_pretrained("./my-tatar-tokenizer")
loaded_tokenizer = AutoTokenizer.from_pretrained("./my-tatar-tokenizer")

# Add new tokens
new_tokens = ["GPT", "Transformer", "BERT"]
tokenizer.add_tokens(new_tokens)
print(f"Added {len(new_tokens)} new tokens")

# Text generation preparation
prompt = "Татарстанда "
inputs = tokenizer(prompt, return_tensors="pt")
print("Generation inputs:", inputs)

Language Model Training Ready

from transformers import AutoTokenizer, DataCollatorForLanguageModeling

tokenizer = AutoTokenizer.from_pretrained(
    "arabovs-ai-lab/TatarTokenizers",
    subfolder="unigram"  # Recommended for LLM training
)

# Data collator for masked language modeling
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,  # Set to True for BERT-style training
    return_tensors="pt"
)

# Example training batch
batch = data_collator([{"input_ids": [0, 1, 2, 3, 4]}] * 8)
print("Training batch ready:", batch.keys())

🎯 Model Recommendations

Use Case Recommended Tokenizer Reason
General NLP BPE Balanced performance, fast
BERT-style Training WordPiece Stable, proven architecture
LLM Training Unigram Smooth distributions, 16K vocab
Research SentencePiece Best morphological coverage
Production BPE/WordPiece HF native, easy deployment

📦 Repository Structure

TatarTokenizers/
├── bpe/
│   ├── tokenizer.json
│   ├── tokenizer_config.json
│   └── special_tokens_map.json
├── wordpiece/
│   ├── tokenizer.json
│   ├── tokenizer_config.json
│   └── special_tokens_map.json
├── unigram/
│   ├── tokenizer.json
│   ├── tokenizer_config.json
│   └── special_tokens_map.json
└── sentencepiece/
    ├── spiece.model
    ├── spiece.vocab
    └── tokenizer_config.json

📜 Citation

@misc{TatarTokenizers2025,
  title = {TatarTokenizers: High-quality Tatar Subword Tokenizers},
  author = {Arabovs AI Lab},
  year = 2025,
  publisher = {Hugging Face},
  url = {https://huggingface.co/arabovs-ai-lab/TatarTokenizers}
}

📄 License

Apache 2.0 License


Last updated: 2025-11-20
Training corpus: 103M tokens
OOV rate: 0% on test data
Best for: Tatar NLP and LLM training