File size: 4,401 Bytes
e3e7558 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 |
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project Overview
IndexTTS-Rust is a high-performance Text-to-Speech engine, a complete Rust rewrite of the Python IndexTTS system. It uses ONNX Runtime for neural network inference and provides zero-shot voice cloning with emotion control.
## Build and Development Commands
```bash
# Build (always build release for performance testing)
cargo build --release
# Run linter (MANDATORY before commits - catches many issues)
cargo clippy -- -D warnings
# Run tests
cargo test
# Run specific test
cargo test test_name
# Run benchmarks (Criterion-based)
cargo bench
# Run specific benchmark
cargo bench --bench mel_spectrogram
cargo bench --bench inference
# Check compilation without building
cargo check
# Format code
cargo fmt
# Full pre-commit workflow (BUILD -> CLIPPY -> BUILD)
cargo build --release && cargo clippy -- -D warnings && cargo build --release
```
## CLI Usage
```bash
# Show help
./target/release/indextts --help
# Synthesize speech
./target/release/indextts synthesize \
--text "Hello world" \
--voice examples/voice_01.wav \
--output output.wav
# Generate default config
./target/release/indextts init-config -o config.yaml
# Show system info
./target/release/indextts info
# Run built-in benchmarks
./target/release/indextts benchmark --iterations 100
```
## Architecture
The codebase follows a modular pipeline architecture where each stage processes data sequentially:
```
Text Input → Normalization → Tokenization → Model Inference → Vocoding → Audio Output
```
### Core Modules (src/)
- **audio/** - Audio DSP operations
- `mel.rs` - Mel-spectrogram computation (STFT, filterbanks)
- `io.rs` - WAV file I/O using hound
- `dsp.rs` - Signal processing utilities
- `resample.rs` - Audio resampling using rubato
- **text/** - Text processing pipeline
- `normalizer.rs` - Text normalization (Chinese/English/mixed)
- `tokenizer.rs` - BPE tokenization via HuggingFace tokenizers
- `phoneme.rs` - Grapheme-to-phoneme conversion
- **model/** - Neural network inference
- `session.rs` - ONNX Runtime wrapper (load-dynamic feature)
- `gpt.rs` - GPT-based sequence generation
- `embedding.rs` - Speaker and emotion encoders
- **vocoder/** - Neural vocoding
- `bigvgan.rs` - BigVGAN waveform synthesis
- `activations.rs` - Snake/SnakeBeta activation functions
- **pipeline/** - TTS orchestration
- `synthesis.rs` - Main synthesis logic, coordinates all modules
- **config/** - Configuration management (YAML-based via serde)
- **error.rs** - Error types using thiserror
- **lib.rs** - Library entry point, exposes public API
- **main.rs** - CLI entry point using clap
### Key Constants (lib.rs)
```rust
pub const SAMPLE_RATE: u32 = 22050; // Output audio sample rate
pub const N_MELS: usize = 80; // Mel filterbank channels
pub const N_FFT: usize = 1024; // FFT size
pub const HOP_LENGTH: usize = 256; // STFT hop length
```
### Dependencies Pattern
- **Audio**: hound (WAV), rustfft/realfft (DSP), rubato (resampling), dasp (signal processing)
- **ML Inference**: ort (ONNX Runtime with load-dynamic), ndarray, safetensors
- **Text**: tokenizers (HuggingFace), jieba-rs (Chinese), regex, unicode-segmentation
- **Parallelism**: rayon (data parallelism), tokio (async)
- **CLI**: clap (derive), env_logger, indicatif
## Important Notes
1. **ONNX Runtime**: Uses `load-dynamic` feature - requires ONNX Runtime library installed on system
2. **Model Files**: ONNX models go in `models/` directory (not in git, download separately)
3. **Reference Implementation**: Python code in `indextts - REMOVING - REF ONLY/` is kept for reference only
4. **Performance**: Release builds use LTO and single codegen-unit for maximum optimization
5. **Audio Format**: All internal processing at 22050 Hz, 80-band mel spectrograms
## Testing Strategy
- Unit tests inline in modules
- Criterion benchmarks in `benches/` for performance regression testing
- Python regression tests in `tests/` for end-to-end validation
- Example audio files in `examples/` for testing voice cloning
## Missing Infrastructure (TODO)
- No `scripts/manage.sh` yet (should include build, test, clean, docker controls)
- No `context.md` yet for conversation continuity
- No integration tests with actual ONNX models
|