--- language: - de license: mit tags: - text-to-speech - tts - german - audio - fastpitch - hifigan library_name: nemo pipeline_tag: text-to-speech --- # CaroTTS-60M-DE-Karlsson 🥕

image

Fast, lightweight German Text-to-Speech model based on FastPitch + HiFi-GAN architecture. Full training and export code available at [CaroTTS GitHub repository](https://github.com/TassiloHo/CaroTTS). ## Model Description This model provides high-quality German text-to-speech synthesis using a non-autoregressive architecture optimized for fast inference on CPUs and mobile devices. The model consists of two components: - **FastPitch**: Duration predictor and mel-spectrogram generator (~46M parameters) - **HiFi-GAN**: Neural vocoder that converts spectrograms to audio (~14M parameters) - **Voice**: Single Speaker, Male Voice, 'Karlsson' **Total Parameters**: ~60M ### Key Features - 🚀 **Fast Inference**: Non-autoregressive architecture enables real-time synthesis - 💻 **CPU-Friendly**: Optimized for deployment on resource-constrained devices - 🎯 **High Quality**: Natural-sounding German speech - 📦 **Multiple Formats**: Available in Nemo, ONNX and PT2 (PyTorch Inductor) formats ## Try It Out 👉 **[Interactive Demo on HuggingFace Spaces](https://huggingface.co/spaces/Warholt/CaroTTS-DE)** (uses PT2 format with Zero GPU) ## Model Files - `Karlsson_fastpitch.nemo` - FastPitch model in NEMO format - `Karlsson_hifigan.nemo` - HiFi-GAN vocoder in NEMO format - `Karlsson_fastpitch.onnx` - FastPitch model in ONNX format - `Karlsson_hifigan.onnx` - HiFi-GAN vocoder in ONNX format - `Karlsson_fastpitch_encoder.pt2` - FastPitch-Encoder compiled with PyTorch Inductor (for CUDA/Zero GPU) - `Karlsson_fastpitch_decoder.pt2` - FastPitch-Decoder compiled with PyTorch Inductor (for CUDA/Zero GPU) - `Karlsson_hifigan.pt2` - HiFi-GAN compiled with PyTorch Inductor (for CUDA/Zero GPU) - **Note**: PT2 files have been exported and compiled on ZERO GPU and may have to be reexported for use on other hardware. Visit the [CaroTTS GitHub repository](https://github.com/TassiloHo/CaroTTS) for the export code. ## Usage ### ONNX Inference ONNX provides the best compatibility and performance for CPU deployment: ```python import numpy as np import onnxruntime as ort import soundfile as sf # Tokenization function def normalize_unicode_text(text: str) -> str: import unicodedata if not unicodedata.is_normalized("NFC", text): text = unicodedata.normalize("NFC", text) return text def any_locale_text_preprocessing(text: str) -> str: res = [] for c in normalize_unicode_text(text): if c in ["'"]: res.append("'") else: res.append(c) return "".join(res) def tokenize_german(text: str, punct: bool = True, apostrophe: bool = True, pad_with_space: bool = True) -> list[int]: """Tokenize German text into a list of integer token IDs.""" _CHARSET_STR = "ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÜẞabcdefghijklmnopqrstuvwxyzäöüß" _PUNCT_LIST = [ "!", '"', "(", ")", ",", "-", ".", "/", ":", ";", "?", "[", "]", "{", "}", "«", "»", "‒", "–", "—", "'", "‚", '"', "„", "‹", "›", ] tokens = [" "] # Space at index 0 tokens.extend(_CHARSET_STR) if apostrophe: tokens.append("'") if punct: tokens.extend(_PUNCT_LIST) tokens.extend(["", "", ""]) token2id = {token: i for i, token in enumerate(tokens)} space = " " text = any_locale_text_preprocessing(text) # Encode cs = [] tokens_set = set(tokens) for c in text: if ((c == space and len(cs) > 0 and cs[-1] != space) or ((c.isalnum() or c == "'") and c in tokens_set) or (c in _PUNCT_LIST and punct)): cs.append(c) if cs: while cs and cs[-1] == space: cs.pop() if pad_with_space: cs = [space] + cs + [space] return [token2id[p] for p in cs] # Load ONNX models fastpitch_session = ort.InferenceSession("Karlsson_fastpitch.onnx") hifigan_session = ort.InferenceSession("Karlsson_hifigan.onnx") # Prepare text text = "Hallo, ich bin CaroTTS, ein deutsches Text-zu-Sprache-System." tokens = tokenize_german(text) # Prepare inputs paces = np.ones(len(tokens), dtype=np.float32) pitches = np.zeros(len(tokens), dtype=np.float32) inputs = { "text": np.array([tokens], dtype=np.int64), "pace": np.array([paces], dtype=np.float32), "pitch": np.array([pitches], dtype=np.float32), } # Generate spectrogram spec = fastpitch_session.run(None, inputs)[0] # Generate audio audio = hifigan_session.run(None, {"spec": spec})[0] # Save audio (44.1kHz sample rate) sf.write("output.wav", audio.squeeze(), 44100) ``` ### NeMo Inference If you have NeMo installed (```pip install nemo-toolkit[tts]```)and want to work with the original .nemo checkpoints: ```python import torch import soundfile as sf from nemo.collections.tts.models.fastpitch import FastPitchModel from nemo.collections.tts.models.hifigan import HifiGanModel # Load models device = "cuda" if torch.cuda.is_available() else "cpu" fastpitch = FastPitchModel.restore_from("Karlsson_fastpitch.nemo", map_location=device).eval() hifigan = HifiGanModel.restore_from("Karlsson_hifigan.nemo", map_location=device).eval() # Prepare text text = "Guten Tag. Herzlich Willkommen zu dieser Demonstration." with torch.inference_mode(): # Parse and generate parsed_text = fastpitch.parse(text) spec = fastpitch.generate_spectrogram(tokens=parsed_text) audio = hifigan.convert_spectrogram_to_audio(spec=spec) # Save audio (44.1kHz sample rate) sf.write("output.wav", audio.squeeze().cpu().numpy(), 44100) ``` ## Citation ```bibtex @misc{carotts2024, title={CaroTTS: Fast Lightweight German Text-to-Speech}, author={Holtzwart, Tassilo}, year={2024}, url={https://github.com/TassiloHo/CaroTTS} } ``` ## Acknowledgments - Built with [NVIDIA NeMo](https://github.com/NVIDIA/NeMo) - FastPitch architecture: [arXiv:2006.06873](https://arxiv.org/abs/2006.06873) - HiFi-GAN: [arXiv:2010.05646](https://arxiv.org/abs/2010.05646) ## License MIT License --- For more information, visit the [CaroTTS GitHub repository](https://github.com/TassiloHo/CaroTTS).