| # CLAUDE.md | |
| This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. | |
| ## Project Overview | |
| IndexTTS-Rust is a high-performance Text-to-Speech engine, a complete Rust rewrite of the Python IndexTTS system. It uses ONNX Runtime for neural network inference and provides zero-shot voice cloning with emotion control. | |
| ## Build and Development Commands | |
| ```bash | |
| # Build (always build release for performance testing) | |
| cargo build --release | |
| # Run linter (MANDATORY before commits - catches many issues) | |
| cargo clippy -- -D warnings | |
| # Run tests | |
| cargo test | |
| # Run specific test | |
| cargo test test_name | |
| # Run benchmarks (Criterion-based) | |
| cargo bench | |
| # Run specific benchmark | |
| cargo bench --bench mel_spectrogram | |
| cargo bench --bench inference | |
| # Check compilation without building | |
| cargo check | |
| # Format code | |
| cargo fmt | |
| # Full pre-commit workflow (BUILD -> CLIPPY -> BUILD) | |
| cargo build --release && cargo clippy -- -D warnings && cargo build --release | |
| ``` | |
| ## CLI Usage | |
| ```bash | |
| # Show help | |
| ./target/release/indextts --help | |
| # Synthesize speech | |
| ./target/release/indextts synthesize \ | |
| --text "Hello world" \ | |
| --voice examples/voice_01.wav \ | |
| --output output.wav | |
| # Generate default config | |
| ./target/release/indextts init-config -o config.yaml | |
| # Show system info | |
| ./target/release/indextts info | |
| # Run built-in benchmarks | |
| ./target/release/indextts benchmark --iterations 100 | |
| ``` | |
| ## Architecture | |
| The codebase follows a modular pipeline architecture where each stage processes data sequentially: | |
| ``` | |
| Text Input → Normalization → Tokenization → Model Inference → Vocoding → Audio Output | |
| ``` | |
| ### Core Modules (src/) | |
| - **audio/** - Audio DSP operations | |
| - `mel.rs` - Mel-spectrogram computation (STFT, filterbanks) | |
| - `io.rs` - WAV file I/O using hound | |
| - `dsp.rs` - Signal processing utilities | |
| - `resample.rs` - Audio resampling using rubato | |
| - **text/** - Text processing pipeline | |
| - `normalizer.rs` - Text normalization (Chinese/English/mixed) | |
| - `tokenizer.rs` - BPE tokenization via HuggingFace tokenizers | |
| - `phoneme.rs` - Grapheme-to-phoneme conversion | |
| - **model/** - Neural network inference | |
| - `session.rs` - ONNX Runtime wrapper (load-dynamic feature) | |
| - `gpt.rs` - GPT-based sequence generation | |
| - `embedding.rs` - Speaker and emotion encoders | |
| - **vocoder/** - Neural vocoding | |
| - `bigvgan.rs` - BigVGAN waveform synthesis | |
| - `activations.rs` - Snake/SnakeBeta activation functions | |
| - **pipeline/** - TTS orchestration | |
| - `synthesis.rs` - Main synthesis logic, coordinates all modules | |
| - **config/** - Configuration management (YAML-based via serde) | |
| - **error.rs** - Error types using thiserror | |
| - **lib.rs** - Library entry point, exposes public API | |
| - **main.rs** - CLI entry point using clap | |
| ### Key Constants (lib.rs) | |
| ```rust | |
| pub const SAMPLE_RATE: u32 = 22050; // Output audio sample rate | |
| pub const N_MELS: usize = 80; // Mel filterbank channels | |
| pub const N_FFT: usize = 1024; // FFT size | |
| pub const HOP_LENGTH: usize = 256; // STFT hop length | |
| ``` | |
| ### Dependencies Pattern | |
| - **Audio**: hound (WAV), rustfft/realfft (DSP), rubato (resampling), dasp (signal processing) | |
| - **ML Inference**: ort (ONNX Runtime with load-dynamic), ndarray, safetensors | |
| - **Text**: tokenizers (HuggingFace), jieba-rs (Chinese), regex, unicode-segmentation | |
| - **Parallelism**: rayon (data parallelism), tokio (async) | |
| - **CLI**: clap (derive), env_logger, indicatif | |
| ## Important Notes | |
| 1. **ONNX Runtime**: Uses `load-dynamic` feature - requires ONNX Runtime library installed on system | |
| 2. **Model Files**: ONNX models go in `models/` directory (not in git, download separately) | |
| 3. **Reference Implementation**: Python code in `indextts - REMOVING - REF ONLY/` is kept for reference only | |
| 4. **Performance**: Release builds use LTO and single codegen-unit for maximum optimization | |
| 5. **Audio Format**: All internal processing at 22050 Hz, 80-band mel spectrograms | |
| ## Testing Strategy | |
| - Unit tests inline in modules | |
| - Criterion benchmarks in `benches/` for performance regression testing | |
| - Python regression tests in `tests/` for end-to-end validation | |
| - Example audio files in `examples/` for testing voice cloning | |
| ## Missing Infrastructure (TODO) | |
| - No `scripts/manage.sh` yet (should include build, test, clean, docker controls) | |
| - No `context.md` yet for conversation continuity | |
| - No integration tests with actual ONNX models | |