---
language:
- en
- az
license: apache-2.0 

tags:
- sentencepiece
- unigram
- tokenizer
- azerbaijani
- english
- bilingual
---

# Bilingual Azerbaijani-English Unigram Tokenizer (az-en-unigram-tokenizer-50k)

This repository contains a SentencePiece Unigram tokenizer trained on a bilingual corpus of Azerbaijani and English text. 
It is designed for tasks involving both languages, such as training bilingual sentence embeddings, machine translation, or cross-lingual information retrieval.

## Tokenizer Details

*   **Type:** SentencePiece Unigram
*   **Languages:** Azerbaijani (az), English (en)
*   **Vocabulary Size:** Approximately 50,000 (actual size might be slightly larger due to special tokens, e.g., 50001).
*   **Training Data:** Trained on a parallel corpus of ~4.14 million sentence pairs (total ~8.28 million sentences). The corpus was balanced between Azerbaijani and English.
*   **Normalization:** NFKC Unicode normalization (standard for SentencePiece).
*   **Character Coverage:** 0.9995 (ensuring good coverage for Azerbaijani specific characters: ç, ö, ə, ü, ğ, ş).

### Special Tokens

The tokenizer includes the following special tokens, which are standard for many transformer-based models:

*   `[UNK]` (Unknown Token): ID `0`
*   `[CLS]` (Classification Token / Start of Sequence): ID `1`
*   `[SEP]` (Separator Token / End of Sequence): ID `2`
*   `[MASK]` (Mask Token): ID `3`
*   `[PAD]` (Padding Token): ID `50000` (Note: This ID might be different if your tokenizer assigned it differently; please verify from your training output.)

*Please verify the PAD token ID from your training script output (e.g., `PAD token: '[PAD]' (ID: XXXXX)`) and update it here if necessary.*

## Intended Use

This tokenizer is intended to be used with transformer models for tasks that require processing of both Azerbaijani and English text. It can be particularly useful for:

*   Initializing the tokenizer for new bilingual (Azerbaijani-English) sentence transformer models.
*   Fine-tuning multilingual models on Azerbaijani-English data.
*   Pre-training new models from scratch on Azerbaijani and English text.

## How to Use

You can use this tokenizer directly with the `transformers` library:

```python
from transformers import AutoTokenizer

tokenizer_id = "LocalDoc/az-en-unigram-tokenizer-50k"

try:
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_id)
    print(f"Tokenizer loaded successfully from {tokenizer_id}!")
except Exception as e:
    print(f"Failed to load tokenizer. Make sure 'sentencepiece_model_pb2.py' is available or you have 'protobuf' installed if needed by the tokenizer loading mechanism.")
    print(f"Error: {e}")
    # As a fallback or for certain environments, you might need to ensure protobuf and sentencepiece_model_pb2.py are handled
    # For example, if the user's environment is minimal:
    # !pip install protobuf
    # !wget https://raw.githubusercontent.com/google/sentencepiece/master/python/src/sentencepiece/sentencepiece_model_pb2.py
    # tokenizer = AutoTokenizer.from_pretrained(tokenizer_id)


# Example Azerbaijani text
az_text = "Bu, Azərbaycan dilində bir test cümləsidir."
encoded_az = tokenizer.encode(az_text)
tokens_az = tokenizer.convert_ids_to_tokens(encoded_az)
print(f"Azerbaijani Original: {az_text}")
print(f"Azerbaijani Encoded IDs: {encoded_az}")
print(f"Azerbaijani Tokens: {tokens_az}")

# Example English text
en_text = "This is a test sentence in English."
encoded_en = tokenizer.encode(en_text)
tokens_en = tokenizer.convert_ids_to_tokens(encoded_en)
print(f"\nEnglish Original: {en_text}")
print(f"English Encoded IDs: {encoded_en}")
print(f"English Tokens: {tokens_en}")

# Example with special tokens
special_text = "[CLS] Bu bir cümlədir. [SEP] This is a sentence. [MASK]"
encoded_special = tokenizer.encode(special_text)
tokens_special = tokenizer.convert_ids_to_tokens(encoded_special)
print(f"\nSpecial Text Original: {special_text}")
print(f"Special Text Encoded IDs: {encoded_special}")
print(f"Special Text Tokens: {tokens_special}")
```

```plaintext
Tokenizer loaded successfully from LocalDoc/az-en-unigram-tokenizer-50k!
Azerbaijani Original: Bu, Azərbaycan dilində bir test cümləsidir.
Azerbaijani Encoded IDs: [90, 4, 66, 2940, 30, 2248, 34485, 116, 5]
Azerbaijani Tokens: ['▁Bu', ',', '▁Azərbaycan', '▁dilində', '▁bir', '▁test', '▁cümləsi', 'dir', '.']

English Original: This is a test sentence in English.
English Encoded IDs: [283, 18, 14, 2248, 3841, 10, 2784, 5]
English Tokens: ['▁This', '▁is', '▁a', '▁test', '▁sentence', '▁in', '▁English', '.']

Special Text Original: [CLS] Bu bir cümlədir. [SEP] This is a sentence. [MASK]
Special Text Encoded IDs: [1, 90, 30, 10798, 116, 5, 15, 2, 283, 18, 14, 3841, 5, 15, 3]
Special Text Tokens: ['[CLS]', '▁Bu', '▁bir', '▁cümlə', 'dir', '.', '▁', '[SEP]', '▁This', '▁is', '▁a', '▁sentence', '.', '▁', '[MASK]']
```

# Tokenizer Training Procedure

The tokenizer was trained using the **SentencePiece** library with the following key parameters:

## Input Data
- A text file containing approximately **8.28 million sentences**, consisting of concatenated **Azerbaijani and English** texts.

## Model Configuration
- **Model Type:** `unigram`
- **Vocabulary Size:** `50000`
- **Character Coverage:** `0.9995`

## Special Tokens
The following special tokens were defined and included during training:

| Token           | Description     |
|----------------|-----------------|
| `[UNK]`         | Unknown token   |
| `[PAD]`         | Padding token   |
| `[CLS]`         | Beginning of sentence |
| `[SEP]`         | End of sentence |
| `[MASK]`        | Mask token (user-defined) |


### Token Definitions in SentencePiece:
```plaintext
unk_piece: [UNK]
pad_piece: [PAD]
bos_piece: [CLS]
eos_piece: [SEP]
user_defined_symbols: [MASK]
```


## Contact

For more information, questions, or issues, please contact LocalDoc at [v.resad.89@gmail.com].