Soro-TTS — Hausa 🇳🇬
Part of Soro-TTS, a multilingual text-to-speech system for Nigerian languages.
This checkpoint is a fine-tune of facebook/mms-tts-hau on the
google/WaxalNLP hau_tts subset.
Languages in the Soro-TTS suite
| Language | Model |
|---|---|
| Hausa | Shinzmann/soro-tts-hau |
| Igbo | Shinzmann/soro-tts-ibo |
| Yoruba | Shinzmann/soro-tts-yor |
Quick start
from transformers import VitsModel, AutoTokenizer
import torch, scipy.io.wavfile
model = VitsModel.from_pretrained("Shinzmann/soro-tts-hau")
tokenizer = AutoTokenizer.from_pretrained("Shinzmann/soro-tts-hau")
text = "Sannu da zuwa Najeriya, ƙasarmu mai albarka."
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
waveform = model(**inputs).waveform[0].numpy()
scipy.io.wavfile.write("out.wav", rate=model.config.sampling_rate, data=waveform)
Training data
Trained on the hau_tts configuration of WAXAL — studio-quality, phonetically balanced single-speaker recordings collected by Media Trust under Google Research's WAXAL initiative.
| Statistic | Value |
|---|---|
| Total audio | 13.12 hours |
| Training audio | 10.45 hours (1572 clips) |
| Validation audio | 1.27 hours |
| Test audio | 1.39 hours |
| Speakers (train) | 8 |
| % words containing diacritics | 0.0% |
| Sample rate | 16 kHz |
Architecture
VITS / MMS-TTS — a conditional VAE with adversarial training, a flow-based prior, and a HiFi-GAN-style decoder.
- Parameters: ~83M
- Sample rate: 16 kHz
- Base model:
facebook/mms-tts-hau(Pratap et al., 2023)
Training procedure
| Hyperparameter | Value |
|---|---|
| Epochs | 100 |
| Batch size | 128 |
| Learning rate | 2e-05 |
| Optimizer | AdamW (β₁=0.8, β₂=0.99) |
| Precision | bf16 |
| Loss weights | mel=35, kl=1.5, gen=1, fmaps=1, disc=3, duration=1 |
| Recipe | ylacombe/finetune-hf-vits |
Evaluation
Character Error Rate (CER) measured by transcribing synthesised audio with facebook/mms-1b-all ASR (target_lang=hau):
| Metric | n | Value |
|---|---|---|
| CER (ASR-based) | 20 | 24.51% |
This proxy metric measures intelligibility, not naturalness. Human MOS evaluation by native speakers is recommended for the latter.
Limitations and biases
- Single voice. WAXAL TTS is recorded by 1–2 professional voice actors per language. The model inherits that voice and accent.
- Domain. Training text covers news, narration, and read speech; conversational, code-switched, or highly informal text may be out of distribution.
- Tonal nuance. Hausa relies on tone marks for meaning. Inputs without proper diacritics will produce flat or incorrect prosody.
- Non-commercial. MMS-TTS base is CC BY-NC 4.0; this fine-tune inherits that license.
License
CC BY-NC 4.0 (inherited from facebook/mms-tts-hau). The WAXAL data itself is CC-BY-4.0.
This model is for research only and may not be used commercially.
Citation
@misc{soro_tts_hau_2026,
title = {{Soro-TTS: A Multilingual Text-to-Speech System for Nigerian Languages — Hausa}},
author = {{Soro-TTS authors}},
year = {{2026}},
url = {{https://huggingface.co/Shinzmann/soro-tts-hau}},
}
@article{pratap2023mms,
title = {{Scaling Speech Technology to 1{,}000+ Languages}},
author = {{Pratap, Vineel and Tjandra, Andros and Shi, Bowen and others}},
journal= {{arXiv:2305.13516}},
year = {{2023}}
}
Acknowledgements
- Google Research and Media Trust for releasing WAXAL
- Meta AI for the MMS base models
- Yoach Lacombe for
finetune-hf-vits
- Downloads last month
- 49
Model tree for Shinzmann/soro-tts-hau
Base model
facebook/mms-tts-hauDataset used to train Shinzmann/soro-tts-hau
Paper for Shinzmann/soro-tts-hau
Evaluation results
- Character Error Rate (ASR-based) on WAXAL TTS — Hausatest set self-reported24.510