File size: 4,401 Bytes

e3e7558

# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

IndexTTS-Rust is a high-performance Text-to-Speech engine, a complete Rust rewrite of the Python IndexTTS system. It uses ONNX Runtime for neural network inference and provides zero-shot voice cloning with emotion control.

## Build and Development Commands

```bash
# Build (always build release for performance testing)
cargo build --release

# Run linter (MANDATORY before commits - catches many issues)
cargo clippy -- -D warnings

# Run tests
cargo test

# Run specific test
cargo test test_name

# Run benchmarks (Criterion-based)
cargo bench

# Run specific benchmark
cargo bench --bench mel_spectrogram
cargo bench --bench inference

# Check compilation without building
cargo check

# Format code
cargo fmt

# Full pre-commit workflow (BUILD -> CLIPPY -> BUILD)
cargo build --release && cargo clippy -- -D warnings && cargo build --release
```

## CLI Usage

```bash
# Show help
./target/release/indextts --help

# Synthesize speech
./target/release/indextts synthesize \
  --text "Hello world" \
  --voice examples/voice_01.wav \
  --output output.wav

# Generate default config
./target/release/indextts init-config -o config.yaml

# Show system info
./target/release/indextts info

# Run built-in benchmarks
./target/release/indextts benchmark --iterations 100
```

## Architecture

The codebase follows a modular pipeline architecture where each stage processes data sequentially:

```
Text Input → Normalization → Tokenization → Model Inference → Vocoding → Audio Output
```

### Core Modules (src/)

- **audio/** - Audio DSP operations
  - `mel.rs` - Mel-spectrogram computation (STFT, filterbanks)
  - `io.rs` - WAV file I/O using hound
  - `dsp.rs` - Signal processing utilities
  - `resample.rs` - Audio resampling using rubato

- **text/** - Text processing pipeline
  - `normalizer.rs` - Text normalization (Chinese/English/mixed)
  - `tokenizer.rs` - BPE tokenization via HuggingFace tokenizers
  - `phoneme.rs` - Grapheme-to-phoneme conversion

- **model/** - Neural network inference
  - `session.rs` - ONNX Runtime wrapper (load-dynamic feature)
  - `gpt.rs` - GPT-based sequence generation
  - `embedding.rs` - Speaker and emotion encoders

- **vocoder/** - Neural vocoding
  - `bigvgan.rs` - BigVGAN waveform synthesis
  - `activations.rs` - Snake/SnakeBeta activation functions

- **pipeline/** - TTS orchestration
  - `synthesis.rs` - Main synthesis logic, coordinates all modules

- **config/** - Configuration management (YAML-based via serde)

- **error.rs** - Error types using thiserror

- **lib.rs** - Library entry point, exposes public API

- **main.rs** - CLI entry point using clap

### Key Constants (lib.rs)

```rust
pub const SAMPLE_RATE: u32 = 22050;  // Output audio sample rate
pub const N_MELS: usize = 80;        // Mel filterbank channels
pub const N_FFT: usize = 1024;       // FFT size
pub const HOP_LENGTH: usize = 256;   // STFT hop length
```

### Dependencies Pattern

- **Audio**: hound (WAV), rustfft/realfft (DSP), rubato (resampling), dasp (signal processing)
- **ML Inference**: ort (ONNX Runtime with load-dynamic), ndarray, safetensors
- **Text**: tokenizers (HuggingFace), jieba-rs (Chinese), regex, unicode-segmentation
- **Parallelism**: rayon (data parallelism), tokio (async)
- **CLI**: clap (derive), env_logger, indicatif

## Important Notes

1. **ONNX Runtime**: Uses `load-dynamic` feature - requires ONNX Runtime library installed on system
2. **Model Files**: ONNX models go in `models/` directory (not in git, download separately)
3. **Reference Implementation**: Python code in `indextts - REMOVING - REF ONLY/` is kept for reference only
4. **Performance**: Release builds use LTO and single codegen-unit for maximum optimization
5. **Audio Format**: All internal processing at 22050 Hz, 80-band mel spectrograms

## Testing Strategy

- Unit tests inline in modules
- Criterion benchmarks in `benches/` for performance regression testing
- Python regression tests in `tests/` for end-to-end validation
- Example audio files in `examples/` for testing voice cloning

## Missing Infrastructure (TODO)

- No `scripts/manage.sh` yet (should include build, test, clean, docker controls)
- No `context.md` yet for conversation continuity
- No integration tests with actual ONNX models