---
language:
- de
license: mit
tags:
- text-to-speech
- tts
- german
- audio
- fastpitch
- hifigan
library_name: nemo
pipeline_tag: text-to-speech
---
# CaroTTS-60M-DE-Karlsson 🥕

<p align="center">
<img src="/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F652a6ea890e317f2436fb7e1%2FxqaeRYV93A-NwfMv5eEft.png" width="250" alt="image" />
  
</p>


Fast, lightweight German Text-to-Speech model based on FastPitch + HiFi-GAN architecture.
Full training and export code available at [CaroTTS GitHub repository](https://github.com/TassiloHo/CaroTTS).

## Model Description

This model provides high-quality German text-to-speech synthesis using a non-autoregressive architecture optimized for fast inference on CPUs and mobile devices. The model consists of two components:

- **FastPitch**: Duration predictor and mel-spectrogram generator (~46M parameters)
- **HiFi-GAN**: Neural vocoder that converts spectrograms to audio (~14M parameters)
- **Voice**: Single Speaker, Male Voice, 'Karlsson'
  
**Total Parameters**: ~60M

### Key Features

- 🚀 **Fast Inference**: Non-autoregressive architecture enables real-time synthesis
- 💻 **CPU-Friendly**: Optimized for deployment on resource-constrained devices
- 🎯 **High Quality**: Natural-sounding German speech
- 📦 **Multiple Formats**: Available in Nemo, ONNX and PT2 (PyTorch Inductor) formats

## Try It Out

👉 **[Interactive Demo on HuggingFace Spaces](https://huggingface.co/spaces/Warholt/CaroTTS-DE)** (uses PT2 format with Zero GPU)

## Model Files
- `Karlsson_fastpitch.nemo` - FastPitch model in NEMO format
- `Karlsson_hifigan.nemo` - HiFi-GAN vocoder in NEMO format
- `Karlsson_fastpitch.onnx` - FastPitch model in ONNX format
- `Karlsson_hifigan.onnx` - HiFi-GAN vocoder in ONNX format
- `Karlsson_fastpitch_encoder.pt2` - FastPitch-Encoder compiled with PyTorch Inductor (for CUDA/Zero GPU)
- `Karlsson_fastpitch_decoder.pt2` - FastPitch-Decoder compiled with PyTorch Inductor (for CUDA/Zero GPU)
- `Karlsson_hifigan.pt2` - HiFi-GAN compiled with PyTorch Inductor (for CUDA/Zero GPU)
- 
**Note**: PT2 files have been exported and compiled on ZERO GPU and may have to be reexported for use on other hardware. Visit the [CaroTTS GitHub repository](https://github.com/TassiloHo/CaroTTS) for the export code.

## Usage

### ONNX Inference

ONNX provides the best compatibility and performance for CPU deployment:

```python
import numpy as np
import onnxruntime as ort
import soundfile as sf

# Tokenization function
def normalize_unicode_text(text: str) -> str:
    import unicodedata
    if not unicodedata.is_normalized("NFC", text):
        text = unicodedata.normalize("NFC", text)
    return text

def any_locale_text_preprocessing(text: str) -> str:
    res = []
    for c in normalize_unicode_text(text):
        if c in ["'"]:
            res.append("'")
        else:
            res.append(c)
    return "".join(res)

def tokenize_german(text: str, punct: bool = True, apostrophe: bool = True,
                   pad_with_space: bool = True) -> list[int]:
    """Tokenize German text into a list of integer token IDs."""

    _CHARSET_STR = "ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÜẞabcdefghijklmnopqrstuvwxyzäöüß"
    _PUNCT_LIST = [
        "!", '"', "(", ")", ",", "-", ".", "/", ":", ";", "?", "[", "]",
        "{", "}", "«", "»", "‒", "–", "—", "'", "‚", '"', "„", "‹", "›",
    ]

    tokens = [" "]  # Space at index 0
    tokens.extend(_CHARSET_STR)
    if apostrophe:
        tokens.append("'")
    if punct:
        tokens.extend(_PUNCT_LIST)
    tokens.extend(["<pad>", "<blank>", "<oov>"])

    token2id = {token: i for i, token in enumerate(tokens)}
    space = " "

    text = any_locale_text_preprocessing(text)

    # Encode
    cs = []
    tokens_set = set(tokens)

    for c in text:
        if ((c == space and len(cs) > 0 and cs[-1] != space) or
            ((c.isalnum() or c == "'") and c in tokens_set) or
            (c in _PUNCT_LIST and punct)):
            cs.append(c)

    if cs:
        while cs and cs[-1] == space:
            cs.pop()

    if pad_with_space:
        cs = [space] + cs + [space]

    return [token2id[p] for p in cs]

# Load ONNX models
fastpitch_session = ort.InferenceSession("Karlsson_fastpitch.onnx")
hifigan_session = ort.InferenceSession("Karlsson_hifigan.onnx")

# Prepare text
text = "Hallo, ich bin CaroTTS, ein deutsches Text-zu-Sprache-System."
tokens = tokenize_german(text)

# Prepare inputs
paces = np.ones(len(tokens), dtype=np.float32)
pitches = np.zeros(len(tokens), dtype=np.float32)

inputs = {
    "text": np.array([tokens], dtype=np.int64),
    "pace": np.array([paces], dtype=np.float32),
    "pitch": np.array([pitches], dtype=np.float32),
}

# Generate spectrogram
spec = fastpitch_session.run(None, inputs)[0]

# Generate audio
audio = hifigan_session.run(None, {"spec": spec})[0]

# Save audio (44.1kHz sample rate)
sf.write("output.wav", audio.squeeze(), 44100)
```

### NeMo Inference

If you have NeMo installed (```pip install nemo-toolkit[tts]```)and want to work with the original .nemo checkpoints:

```python
import torch
import soundfile as sf
from nemo.collections.tts.models.fastpitch import FastPitchModel
from nemo.collections.tts.models.hifigan import HifiGanModel

# Load models
device = "cuda" if torch.cuda.is_available() else "cpu"
fastpitch = FastPitchModel.restore_from("Karlsson_fastpitch.nemo", map_location=device).eval()
hifigan = HifiGanModel.restore_from("Karlsson_hifigan.nemo", map_location=device).eval()

# Prepare text
text = "Guten Tag. Herzlich Willkommen zu dieser Demonstration."

with torch.inference_mode():
    # Parse and generate
    parsed_text = fastpitch.parse(text)
    spec = fastpitch.generate_spectrogram(tokens=parsed_text)
    audio = hifigan.convert_spectrogram_to_audio(spec=spec)

    # Save audio (44.1kHz sample rate)
    sf.write("output.wav", audio.squeeze().cpu().numpy(), 44100)
```

## Citation

```bibtex
@misc{carotts2024,
  title={CaroTTS: Fast Lightweight German Text-to-Speech},
  author={Holtzwart, Tassilo},
  year={2024},
  url={https://github.com/TassiloHo/CaroTTS}
}
```

## Acknowledgments

- Built with [NVIDIA NeMo](https://github.com/NVIDIA/NeMo)
- FastPitch architecture: [arXiv:2006.06873](https://arxiv.org/abs/2006.06873)
- HiFi-GAN: [arXiv:2010.05646](https://arxiv.org/abs/2010.05646)

## License

MIT License

---

For more information, visit the [CaroTTS GitHub repository](https://github.com/TassiloHo/CaroTTS).