--- language: - en - az license: apache-2.0 tags: - sentencepiece - unigram - tokenizer - azerbaijani - english - bilingual --- # Bilingual Azerbaijani-English Unigram Tokenizer (az-en-unigram-tokenizer-50k) This repository contains a SentencePiece Unigram tokenizer trained on a bilingual corpus of Azerbaijani and English text. It is designed for tasks involving both languages, such as training bilingual sentence embeddings, machine translation, or cross-lingual information retrieval. ## Tokenizer Details * **Type:** SentencePiece Unigram * **Languages:** Azerbaijani (az), English (en) * **Vocabulary Size:** Approximately 50,000 (actual size might be slightly larger due to special tokens, e.g., 50001). * **Training Data:** Trained on a parallel corpus of ~4.14 million sentence pairs (total ~8.28 million sentences). The corpus was balanced between Azerbaijani and English. * **Normalization:** NFKC Unicode normalization (standard for SentencePiece). * **Character Coverage:** 0.9995 (ensuring good coverage for Azerbaijani specific characters: ç, ö, ə, ü, ğ, ş). ### Special Tokens The tokenizer includes the following special tokens, which are standard for many transformer-based models: * `[UNK]` (Unknown Token): ID `0` * `[CLS]` (Classification Token / Start of Sequence): ID `1` * `[SEP]` (Separator Token / End of Sequence): ID `2` * `[MASK]` (Mask Token): ID `3` * `[PAD]` (Padding Token): ID `50000` (Note: This ID might be different if your tokenizer assigned it differently; please verify from your training output.) *Please verify the PAD token ID from your training script output (e.g., `PAD token: '[PAD]' (ID: XXXXX)`) and update it here if necessary.* ## Intended Use This tokenizer is intended to be used with transformer models for tasks that require processing of both Azerbaijani and English text. It can be particularly useful for: * Initializing the tokenizer for new bilingual (Azerbaijani-English) sentence transformer models. * Fine-tuning multilingual models on Azerbaijani-English data. * Pre-training new models from scratch on Azerbaijani and English text. ## How to Use You can use this tokenizer directly with the `transformers` library: ```python from transformers import AutoTokenizer tokenizer_id = "LocalDoc/az-en-unigram-tokenizer-50k" try: tokenizer = AutoTokenizer.from_pretrained(tokenizer_id) print(f"Tokenizer loaded successfully from {tokenizer_id}!") except Exception as e: print(f"Failed to load tokenizer. Make sure 'sentencepiece_model_pb2.py' is available or you have 'protobuf' installed if needed by the tokenizer loading mechanism.") print(f"Error: {e}") # As a fallback or for certain environments, you might need to ensure protobuf and sentencepiece_model_pb2.py are handled # For example, if the user's environment is minimal: # !pip install protobuf # !wget https://raw.githubusercontent.com/google/sentencepiece/master/python/src/sentencepiece/sentencepiece_model_pb2.py # tokenizer = AutoTokenizer.from_pretrained(tokenizer_id) # Example Azerbaijani text az_text = "Bu, Azərbaycan dilində bir test cümləsidir." encoded_az = tokenizer.encode(az_text) tokens_az = tokenizer.convert_ids_to_tokens(encoded_az) print(f"Azerbaijani Original: {az_text}") print(f"Azerbaijani Encoded IDs: {encoded_az}") print(f"Azerbaijani Tokens: {tokens_az}") # Example English text en_text = "This is a test sentence in English." encoded_en = tokenizer.encode(en_text) tokens_en = tokenizer.convert_ids_to_tokens(encoded_en) print(f"\nEnglish Original: {en_text}") print(f"English Encoded IDs: {encoded_en}") print(f"English Tokens: {tokens_en}") # Example with special tokens special_text = "[CLS] Bu bir cümlədir. [SEP] This is a sentence. [MASK]" encoded_special = tokenizer.encode(special_text) tokens_special = tokenizer.convert_ids_to_tokens(encoded_special) print(f"\nSpecial Text Original: {special_text}") print(f"Special Text Encoded IDs: {encoded_special}") print(f"Special Text Tokens: {tokens_special}") ``` ```plaintext Tokenizer loaded successfully from LocalDoc/az-en-unigram-tokenizer-50k! Azerbaijani Original: Bu, Azərbaycan dilində bir test cümləsidir. Azerbaijani Encoded IDs: [90, 4, 66, 2940, 30, 2248, 34485, 116, 5] Azerbaijani Tokens: ['▁Bu', ',', '▁Azərbaycan', '▁dilində', '▁bir', '▁test', '▁cümləsi', 'dir', '.'] English Original: This is a test sentence in English. English Encoded IDs: [283, 18, 14, 2248, 3841, 10, 2784, 5] English Tokens: ['▁This', '▁is', '▁a', '▁test', '▁sentence', '▁in', '▁English', '.'] Special Text Original: [CLS] Bu bir cümlədir. [SEP] This is a sentence. [MASK] Special Text Encoded IDs: [1, 90, 30, 10798, 116, 5, 15, 2, 283, 18, 14, 3841, 5, 15, 3] Special Text Tokens: ['[CLS]', '▁Bu', '▁bir', '▁cümlə', 'dir', '.', '▁', '[SEP]', '▁This', '▁is', '▁a', '▁sentence', '.', '▁', '[MASK]'] ``` # Tokenizer Training Procedure The tokenizer was trained using the **SentencePiece** library with the following key parameters: ## Input Data - A text file containing approximately **8.28 million sentences**, consisting of concatenated **Azerbaijani and English** texts. ## Model Configuration - **Model Type:** `unigram` - **Vocabulary Size:** `50000` - **Character Coverage:** `0.9995` ## Special Tokens The following special tokens were defined and included during training: | Token | Description | |----------------|-----------------| | `[UNK]` | Unknown token | | `[PAD]` | Padding token | | `[CLS]` | Beginning of sentence | | `[SEP]` | End of sentence | | `[MASK]` | Mask token (user-defined) | ### Token Definitions in SentencePiece: ```plaintext unk_piece: [UNK] pad_piece: [PAD] bos_piece: [CLS] eos_piece: [SEP] user_defined_symbols: [MASK] ``` ## Contact For more information, questions, or issues, please contact LocalDoc at [v.resad.89@gmail.com].