Soro-TTS — Hausa 🇳🇬

Part of Soro-TTS, a multilingual text-to-speech system for Nigerian languages. This checkpoint is a fine-tune of facebook/mms-tts-hau on the google/WaxalNLP hau_tts subset.

Languages in the Soro-TTS suite

Quick start

from transformers import VitsModel, AutoTokenizer
import torch, scipy.io.wavfile

model = VitsModel.from_pretrained("Shinzmann/soro-tts-hau")
tokenizer = AutoTokenizer.from_pretrained("Shinzmann/soro-tts-hau")

text = "Sannu da zuwa Najeriya, ƙasarmu mai albarka."
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    waveform = model(**inputs).waveform[0].numpy()

scipy.io.wavfile.write("out.wav", rate=model.config.sampling_rate, data=waveform)

Training data

Trained on the hau_tts configuration of WAXAL — studio-quality, phonetically balanced single-speaker recordings collected by Media Trust under Google Research's WAXAL initiative.

Statistic Value
Total audio 13.12 hours
Training audio 10.45 hours (1572 clips)
Validation audio 1.27 hours
Test audio 1.39 hours
Speakers (train) 8
% words containing diacritics 0.0%
Sample rate 16 kHz

Architecture

VITS / MMS-TTS — a conditional VAE with adversarial training, a flow-based prior, and a HiFi-GAN-style decoder.

  • Parameters: ~83M
  • Sample rate: 16 kHz
  • Base model: facebook/mms-tts-hau (Pratap et al., 2023)

Training procedure

Hyperparameter Value
Epochs 100
Batch size 128
Learning rate 2e-05
Optimizer AdamW (β₁=0.8, β₂=0.99)
Precision bf16
Loss weights mel=35, kl=1.5, gen=1, fmaps=1, disc=3, duration=1
Recipe ylacombe/finetune-hf-vits

Evaluation

Character Error Rate (CER) measured by transcribing synthesised audio with facebook/mms-1b-all ASR (target_lang=hau):

Metric n Value
CER (ASR-based) 20 24.51%

This proxy metric measures intelligibility, not naturalness. Human MOS evaluation by native speakers is recommended for the latter.

Limitations and biases

  • Single voice. WAXAL TTS is recorded by 1–2 professional voice actors per language. The model inherits that voice and accent.
  • Domain. Training text covers news, narration, and read speech; conversational, code-switched, or highly informal text may be out of distribution.
  • Tonal nuance. Hausa relies on tone marks for meaning. Inputs without proper diacritics will produce flat or incorrect prosody.
  • Non-commercial. MMS-TTS base is CC BY-NC 4.0; this fine-tune inherits that license.

License

CC BY-NC 4.0 (inherited from facebook/mms-tts-hau). The WAXAL data itself is CC-BY-4.0. This model is for research only and may not be used commercially.

Citation

@misc{soro_tts_hau_2026,
  title  = {{Soro-TTS: A Multilingual Text-to-Speech System for Nigerian Languages — Hausa}},
  author = {{Soro-TTS authors}},
  year   = {{2026}},
  url    = {{https://huggingface.co/Shinzmann/soro-tts-hau}},
}
@article{pratap2023mms,
  title  = {{Scaling Speech Technology to 1{,}000+ Languages}},
  author = {{Pratap, Vineel and Tjandra, Andros and Shi, Bowen and others}},
  journal= {{arXiv:2305.13516}},
  year   = {{2023}}
}

Acknowledgements

  • Google Research and Media Trust for releasing WAXAL
  • Meta AI for the MMS base models
  • Yoach Lacombe for finetune-hf-vits
Downloads last month
49
Safetensors
Model size
39.6M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Shinzmann/soro-tts-hau

Finetuned
(5)
this model

Dataset used to train Shinzmann/soro-tts-hau

Paper for Shinzmann/soro-tts-hau

Evaluation results