---
language:
- de
license: mit
tags:
- text-to-speech
- tts
- german
- audio
- fastpitch
- hifigan
library_name: nemo
pipeline_tag: text-to-speech
---
# CaroTTS-60M-DE-Karlsson 🥕
Fast, lightweight German Text-to-Speech model based on FastPitch + HiFi-GAN architecture.
Full training and export code available at [CaroTTS GitHub repository](https://github.com/TassiloHo/CaroTTS).
## Model Description
This model provides high-quality German text-to-speech synthesis using a non-autoregressive architecture optimized for fast inference on CPUs and mobile devices. The model consists of two components:
- **FastPitch**: Duration predictor and mel-spectrogram generator (~46M parameters)
- **HiFi-GAN**: Neural vocoder that converts spectrograms to audio (~14M parameters)
- **Voice**: Single Speaker, Male Voice, 'Karlsson'
**Total Parameters**: ~60M
### Key Features
- 🚀 **Fast Inference**: Non-autoregressive architecture enables real-time synthesis
- 💻 **CPU-Friendly**: Optimized for deployment on resource-constrained devices
- 🎯 **High Quality**: Natural-sounding German speech
- 📦 **Multiple Formats**: Available in Nemo, ONNX and PT2 (PyTorch Inductor) formats
## Try It Out
👉 **[Interactive Demo on HuggingFace Spaces](https://huggingface.co/spaces/Warholt/CaroTTS-DE)** (uses PT2 format with Zero GPU)
## Model Files
- `Karlsson_fastpitch.nemo` - FastPitch model in NEMO format
- `Karlsson_hifigan.nemo` - HiFi-GAN vocoder in NEMO format
- `Karlsson_fastpitch.onnx` - FastPitch model in ONNX format
- `Karlsson_hifigan.onnx` - HiFi-GAN vocoder in ONNX format
- `Karlsson_fastpitch_encoder.pt2` - FastPitch-Encoder compiled with PyTorch Inductor (for CUDA/Zero GPU)
- `Karlsson_fastpitch_decoder.pt2` - FastPitch-Decoder compiled with PyTorch Inductor (for CUDA/Zero GPU)
- `Karlsson_hifigan.pt2` - HiFi-GAN compiled with PyTorch Inductor (for CUDA/Zero GPU)
-
**Note**: PT2 files have been exported and compiled on ZERO GPU and may have to be reexported for use on other hardware. Visit the [CaroTTS GitHub repository](https://github.com/TassiloHo/CaroTTS) for the export code.
## Usage
### ONNX Inference
ONNX provides the best compatibility and performance for CPU deployment:
```python
import numpy as np
import onnxruntime as ort
import soundfile as sf
# Tokenization function
def normalize_unicode_text(text: str) -> str:
import unicodedata
if not unicodedata.is_normalized("NFC", text):
text = unicodedata.normalize("NFC", text)
return text
def any_locale_text_preprocessing(text: str) -> str:
res = []
for c in normalize_unicode_text(text):
if c in ["'"]:
res.append("'")
else:
res.append(c)
return "".join(res)
def tokenize_german(text: str, punct: bool = True, apostrophe: bool = True,
pad_with_space: bool = True) -> list[int]:
"""Tokenize German text into a list of integer token IDs."""
_CHARSET_STR = "ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÜẞabcdefghijklmnopqrstuvwxyzäöüß"
_PUNCT_LIST = [
"!", '"', "(", ")", ",", "-", ".", "/", ":", ";", "?", "[", "]",
"{", "}", "«", "»", "‒", "–", "—", "'", "‚", '"', "„", "‹", "›",
]
tokens = [" "] # Space at index 0
tokens.extend(_CHARSET_STR)
if apostrophe:
tokens.append("'")
if punct:
tokens.extend(_PUNCT_LIST)
tokens.extend(["", "", ""])
token2id = {token: i for i, token in enumerate(tokens)}
space = " "
text = any_locale_text_preprocessing(text)
# Encode
cs = []
tokens_set = set(tokens)
for c in text:
if ((c == space and len(cs) > 0 and cs[-1] != space) or
((c.isalnum() or c == "'") and c in tokens_set) or
(c in _PUNCT_LIST and punct)):
cs.append(c)
if cs:
while cs and cs[-1] == space:
cs.pop()
if pad_with_space:
cs = [space] + cs + [space]
return [token2id[p] for p in cs]
# Load ONNX models
fastpitch_session = ort.InferenceSession("Karlsson_fastpitch.onnx")
hifigan_session = ort.InferenceSession("Karlsson_hifigan.onnx")
# Prepare text
text = "Hallo, ich bin CaroTTS, ein deutsches Text-zu-Sprache-System."
tokens = tokenize_german(text)
# Prepare inputs
paces = np.ones(len(tokens), dtype=np.float32)
pitches = np.zeros(len(tokens), dtype=np.float32)
inputs = {
"text": np.array([tokens], dtype=np.int64),
"pace": np.array([paces], dtype=np.float32),
"pitch": np.array([pitches], dtype=np.float32),
}
# Generate spectrogram
spec = fastpitch_session.run(None, inputs)[0]
# Generate audio
audio = hifigan_session.run(None, {"spec": spec})[0]
# Save audio (44.1kHz sample rate)
sf.write("output.wav", audio.squeeze(), 44100)
```
### NeMo Inference
If you have NeMo installed (```pip install nemo-toolkit[tts]```)and want to work with the original .nemo checkpoints:
```python
import torch
import soundfile as sf
from nemo.collections.tts.models.fastpitch import FastPitchModel
from nemo.collections.tts.models.hifigan import HifiGanModel
# Load models
device = "cuda" if torch.cuda.is_available() else "cpu"
fastpitch = FastPitchModel.restore_from("Karlsson_fastpitch.nemo", map_location=device).eval()
hifigan = HifiGanModel.restore_from("Karlsson_hifigan.nemo", map_location=device).eval()
# Prepare text
text = "Guten Tag. Herzlich Willkommen zu dieser Demonstration."
with torch.inference_mode():
# Parse and generate
parsed_text = fastpitch.parse(text)
spec = fastpitch.generate_spectrogram(tokens=parsed_text)
audio = hifigan.convert_spectrogram_to_audio(spec=spec)
# Save audio (44.1kHz sample rate)
sf.write("output.wav", audio.squeeze().cpu().numpy(), 44100)
```
## Citation
```bibtex
@misc{carotts2024,
title={CaroTTS: Fast Lightweight German Text-to-Speech},
author={Holtzwart, Tassilo},
year={2024},
url={https://github.com/TassiloHo/CaroTTS}
}
```
## Acknowledgments
- Built with [NVIDIA NeMo](https://github.com/NVIDIA/NeMo)
- FastPitch architecture: [arXiv:2006.06873](https://arxiv.org/abs/2006.06873)
- HiFi-GAN: [arXiv:2010.05646](https://arxiv.org/abs/2010.05646)
## License
MIT License
---
For more information, visit the [CaroTTS GitHub repository](https://github.com/TassiloHo/CaroTTS).