Custom Bangla Tokenizer
A specialized Bangla/Bengali tokenizer extracted from multilingual models and extended with missing characters.
Features
- Focused on Bangla text tokenization
- Support for characters like ঢ় and other missing characters
- Reduced vocabulary size: 21,607 tokens
- Compatible with Hugging Face Transformers
Usage
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("yasserius/bangla-tokenizer")
# Tokenize Bangla text
text = "আমি বাংলায় কথা বলি"
tokens = tokenizer.tokenize(text)
print(tokens)
Model Details
- Base model: Extracted from google/muril-base-cased
- Language: Bengali/Bangla
- Vocabulary size: 21,607
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support