VibeVoice Acoustic Tokenizer
VibeVoice is a novel framework designed for generating expressive, long-form, multi-speaker conversational audio, such as podcasts, from text. It addresses significant challenges in traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and natural turn-taking.
A core innovation of VibeVoice is its use of continuous speech tokenizers (Acoustic and Semantic) operating at an ultra-low frame rate of 7.5 Hz. These tokenizers efficiently preserve audio fidelity while significantly boosting computational efficiency for processing long sequences. VibeVoice employs a next-token diffusion framework, leveraging a Large Language Model (LLM) to understand textual context and dialogue flow, and a diffusion head to generate high-fidelity acoustic details.
The speech tokenizer is a key component for both VibeVoice TTS and ASR.
➡️ Technical Report: VibeVoice Technical Report
➡️ Project Page: microsoft/VibeVoice
Models
| Model | Context Length | Length (min) | Weight |
|---|---|---|---|
| VibeVoice-Realtime-0.5B | 8K | ~10 min | HF link |
| VibeVoice-1.5B | 64K | ~90 min | HF link |
| VibeVoice-ASR | 64K | ~60 min | HF link |
| VibeVoice-AcousticTokenizer | - | - | This model |
Usage
Setup
The VibeVoice acoustic tokenizer is not yet in an official Transformers release. However, it can be used by pulling the source code from the following fork:
pip install git+https://github.com/ebezzam/transformers.git@vibevoice_acoustic_tokenizer
Example
Encoding and decoding
import torch
from scipy.io import wavfile
from transformers import AutoFeatureExtractor, VibeVoiceAcousticTokenizerModel
from transformers.audio_utils import load_audio_librosa
model_id = "bezzam/VibeVoice-AcousticTokenizer"
sampling_rate = 24000
# load audio
audio = load_audio_librosa(
"https://hf.co/datasets/bezzam/vibevoice_samples/resolve/main/voices/en-Alice_woman.wav",
sampling_rate=sampling_rate,
)
# load model
feature_extractor = AutoFeatureExtractor.from_pretrained(model_id)
model = VibeVoiceAcousticTokenizerModel.from_pretrained(model_id, device_map="auto")
print("Model loaded on device:", model.device)
print("Model dtype:", model.dtype)
# preprocess audio
inputs = feature_extractor(
audio,
sampling_rate=sampling_rate,
padding=True,
pad_to_multiple_of=3200,
).to(model.device, model.dtype)
print("Input audio shape:", inputs.input_values.shape)
# Input audio shape: torch.Size([1, 1, 224000])
with torch.no_grad():
encoded_outputs = model.encode(inputs.input_values)
print("Latent shape:", encoded_outputs.latents.shape)
# Latent shape: torch.Size([1, 70, 64])
# VAE sampling (optional)
encoded_outputs = model.sample(encoded_outputs.latents)
print("Noisy latents shape:", encoded_outputs.latents.shape)
# Noisy latents shape: torch.Size([1, 70, 64])
decoded_outputs = model.decode(**encoded_outputs)
print("Reconstructed audio shape:", decoded_outputs.audio.shape)
# Reconstructed audio shape: torch.Size([1, 1, 224000])
# Save audio
output_fp = "vibevoice_acoustic_tokenizer_reconstructed.wav"
wavfile.write(output_fp, sampling_rate, decoded_outputs.audio.squeeze().float().cpu().numpy())
print(f"Reconstructed audio saved to : {output_fp}")
Original audio
Encoded/decoded audio
Streaming
For streaming, the use_cache parameter can be used when decoding:
# `padding_cache` can be initialized after a first pass
padding_cache = None
decoded_outputs = model.decode(**encoded_outputs, padding_cache=paddingS_cache, use_cache=True)
# `padding_cache` can be extracted from `decoded_outputs` for subsequent passes
padding_cache = decoded_outputs.padding_cache
print("Number of cached layers:", len(padding_cache.per_layer_in_channels))
# Number of cached layers: 34
- Downloads last month
- 213