--- language: - tt license: apache-2.0 tags: - tatar - tokenizers - nlp - language-modeling - sentencepiece - bpe - unigram - wordpiece --- # TatarTokenizers - Tatar Subword Tokenizers **High-quality pretrained tokenizers for the Tatar language** This repository contains 4 specialized tokenizers for Tatar, trained on a cleaned 103M-token corpus using different algorithms. These tokenizers significantly outperform generic multilingual tokenizers and are optimized for Tatar NLP tasks and language model training. # TatarTokenizers - Tatar Subword Tokenizers **High-quality pretrained tokenizers for the Tatar language** This repository contains 4 specialized tokenizers for Tatar, trained on a cleaned 103M-token corpus using different algorithms. These tokenizers significantly outperform generic multilingual tokenizers and are optimized for Tatar NLP tasks and language model training. ## 🏆 Model Performance ### Tokenizer Comparison | Algorithm | Vocabulary Size | Best For | HF AutoTokenizer | |-----------|-----------------|----------|------------------| | **BPE** | 8000 | General purpose, fast inference | ✅ Yes | | **WordPiece** | 8000 | Stable behavior, balanced performance | ✅ Yes | | **Unigram** | 16000 | LLM training, smooth distributions | ✅ Yes | | **SentencePiece** | 32000 | Morphological coverage, OOV handling | ⚠️ T5Tokenizer | ## 📈 Training Results ### Final Training Metrics ``` ============================================================ BPE | Run: v8000_mf2 | OOV: 0.00% | AvgLen: 96.0 | Time: 105.9s WORDPIECE | Run: v8000_mf1 | OOV: 0.00% | AvgLen: 95.4 | Time: 124.3s UNIGRAM | Run: v16000 | OOV: 0.00% | AvgLen: 90.9 | Time: 614.1s SPM | Run: v32000 | OOV: 0.00% | AvgLen: 86.7 | Time: 249.8s ============================================================ ``` **Metric Explanation:** - **OOV**: Out-of-Vocabulary rate (0% = perfect coverage) - **AvgLen**: Average sequence length in tokens (lower = better compression) - **Time**: Training time in seconds ### Key Findings - **All tokenizers achieved 0% OOV** on test corpus, demonstrating perfect vocabulary coverage - **SentencePiece provides best compression** (lowest AvgLen) due to larger vocabulary - **BPE is fastest to train** while maintaining excellent performance - **Unigram offers balanced compression** despite longer training time - **All models show consistent behavior** across different text domains ## 📊 Model Details ### BPE Tokenizer **Vocabulary**: 8000 | **Compression**: 2.8x - **Architecture**: Byte-Pair Encoding - **Best for**: General purpose NLP, fast inference - **Format**: `tokenizer.json` + HuggingFace compatible ### WordPiece Tokenizer **Vocabulary**: 8000 | **Compression**: 2.7x - **Architecture**: WordPiece (BERT-style) - **Best for**: Stable training, balanced performance - **Format**: `tokenizer.json` + HuggingFace compatible ### Unigram Tokenizer **Vocabulary**: 16000 | **Compression**: 3.1x - **Architecture**: Unigram Language Model - **Best for**: LLM training, smooth length distributions - **Format**: `tokenizer.json` + HuggingFace compatible ### SentencePiece Tokenizer **Vocabulary**: 32000 | **Compression**: 3.4x - **Architecture**: SentencePiece with Unigram - **Best for**: Morphological coverage, OOV handling - **Format**: `spiece.model` (requires T5Tokenizer) ## 📚 Training Corpus - **Total Tokens**: 207.02M - **Unique Words**: 2.1M - **Vocabulary**: 637.7K words - **Models Analyzed**: 22 ### Corpus Domains | Domain | Documents | |--------|-----------| | belgech.ru | 46 | | intertat.tatar | 19.5K | | matbugat.ru | 44.9K | | azatliq.org | 8,1k | | tatar-inform.tatar | 1.5K | | mamadysh-rt | 1.2k | | vk.com | 6.5K | | shahrikazan.ru | 2.4K | | vatantat.ru | 119 | | Wikipedia | 456.1K | | Books | 876 | ## 🚀 Quick Start ### Installation ```bash pip install transformers huggingface_hub ``` ### Load BPE/WordPiece/Unigram Tokenizers ```python from transformers import AutoTokenizer # Load BPE tokenizer (recommended for general use) tokenizer = AutoTokenizer.from_pretrained( "arabovs-ai-lab/TatarTokenizers", subfolder="bpe" ) # Or load WordPiece tokenizer = AutoTokenizer.from_pretrained( "arabovs-ai-lab/TatarTokenizers", subfolder="wordpiece" ) # Or load Unigram tokenizer = AutoTokenizer.from_pretrained( "arabovs-ai-lab/TatarTokenizers", subfolder="unigram" ) ``` ### Load SentencePiece Tokenizer ```python from transformers import T5Tokenizer # SentencePiece requires T5Tokenizer tokenizer = T5Tokenizer.from_pretrained( "arabovs-ai-lab/TatarTokenizers", subfolder="sentencepiece" ) ``` ## 💡 Usage Examples ### Basic Text Processing ```python text = "Татарча текстларны эшкәртү — кызыклы бурыч." # Encode text ids = tokenizer.encode(text) print("Token IDs:", ids) # Decode back to text decoded = tokenizer.decode(ids) print("Decoded:", decoded) # Get tokens tokens = tokenizer.tokenize(text) print("Tokens:", tokens) ``` ### Batch Processing ```python texts = [ "Мин татарча сөйлим.", "Без модельләр төзибез.", "Тел эшкәртү технологияләре үсеш ала." ] # Batch encode with padding batch = tokenizer( texts, padding=True, truncation=True, max_length=128, return_tensors="pt" ) print("Batch input IDs:", batch["input_ids"]) print("Attention mask:", batch["attention_mask"]) ``` ### Vocabulary Analysis ```python # Check vocabulary size vocab_size = tokenizer.vocab_size print(f"Vocabulary size: {vocab_size}") # Get special tokens special_tokens = tokenizer.special_tokens_map print("Special tokens:", special_tokens) # Check token for specific word token_id = tokenizer.convert_tokens_to_ids("татарча") print(f"'татарча' token ID: {token_id}") ``` ### Different Tokenizers Comparison ```python from transformers import AutoTokenizer, T5Tokenizer def compare_tokenizers(text): """Compare different tokenizers on the same text""" tokenizers = { "BPE": AutoTokenizer.from_pretrained("arabovs-ai-lab/TatarTokenizers", subfolder="bpe"), "WordPiece": AutoTokenizer.from_pretrained("arabovs-ai-lab/TatarTokenizers", subfolder="wordpiece"), "Unigram": AutoTokenizer.from_pretrained("arabovs-ai-lab/TatarTokenizers", subfolder="unigram"), "SentencePiece": T5Tokenizer.from_pretrained("arabovs-ai-lab/TatarTokenizers", subfolder="sentencepiece") } print(f"Text: {text}") print("=" * 50) for name, tok in tokenizers.items(): tokens = tok.tokenize(text) ids = tok.encode(text) print(f"{name:12} | Tokens: {len(tokens):2d} | IDs: {ids}") print(f"{'':12} | {tokens}") # Test with different texts test_texts = [ "Татар теле морфологик бай тел.", "Безнең модельләр яхшы эшли.", "Синтетик телләрдә токенизация катлаулырак." ] for text in test_texts: compare_tokenizers(text) print("\n") ``` ### Advanced Features ```python # Save and load local copy tokenizer.save_pretrained("./my-tatar-tokenizer") loaded_tokenizer = AutoTokenizer.from_pretrained("./my-tatar-tokenizer") # Add new tokens new_tokens = ["GPT", "Transformer", "BERT"] tokenizer.add_tokens(new_tokens) print(f"Added {len(new_tokens)} new tokens") # Text generation preparation prompt = "Татарстанда " inputs = tokenizer(prompt, return_tensors="pt") print("Generation inputs:", inputs) ``` ### Language Model Training Ready ```python from transformers import AutoTokenizer, DataCollatorForLanguageModeling tokenizer = AutoTokenizer.from_pretrained( "arabovs-ai-lab/TatarTokenizers", subfolder="unigram" # Recommended for LLM training ) # Data collator for masked language modeling data_collator = DataCollatorForLanguageModeling( tokenizer=tokenizer, mlm=False, # Set to True for BERT-style training return_tensors="pt" ) # Example training batch batch = data_collator([{"input_ids": [0, 1, 2, 3, 4]}] * 8) print("Training batch ready:", batch.keys()) ``` ## 🎯 Model Recommendations | Use Case | Recommended Tokenizer | Reason | |----------|---------------------|---------| | **General NLP** | BPE | Balanced performance, fast | | **BERT-style Training** | WordPiece | Stable, proven architecture | | **LLM Training** | Unigram | Smooth distributions, 16K vocab | | **Research** | SentencePiece | Best morphological coverage | | **Production** | BPE/WordPiece | HF native, easy deployment | ## 📦 Repository Structure ``` TatarTokenizers/ ├── bpe/ │ ├── tokenizer.json │ ├── tokenizer_config.json │ └── special_tokens_map.json ├── wordpiece/ │ ├── tokenizer.json │ ├── tokenizer_config.json │ └── special_tokens_map.json ├── unigram/ │ ├── tokenizer.json │ ├── tokenizer_config.json │ └── special_tokens_map.json └── sentencepiece/ ├── spiece.model ├── spiece.vocab └── tokenizer_config.json ``` ## 📜 Citation ```bibtex @misc{TatarTokenizers2025, title = {TatarTokenizers: High-quality Tatar Subword Tokenizers}, author = {Arabovs AI Lab}, year = 2025, publisher = {Hugging Face}, url = {https://huggingface.co/arabovs-ai-lab/TatarTokenizers} } ``` ## 📄 License Apache 2.0 License --- *Last updated: 2025-11-20* *Training corpus: 103M tokens* *OOV rate: 0% on test data* *Best for: Tatar NLP and LLM training*