Convert text to natural speech Use Zonos TTS

A cutting-edge open-weight text-to-speech model trained on over 200,000 hours of multilingual speech data.

About Zonos TTS

Zonos TTS is designed to generate highly natural speech from text prompts, utilizing speaker embeddings or audio prefixes. It only needs a few seconds of reference audio to achieve excellent voice cloning.

The model provides precise control over speech parameters such as speaking rate, pitch variation, audio quality, and emotional coloring like happiness, fear, sadness, and anger. Zonos TTS natively outputs 44kHz audio to ensure top-tier sound quality.

Get started

Core Features

Zero-shot TTS and voice cloning

Generate high-quality TTS output by inputting the desired text and a 10-30 second sample of the speaker.

Audio prefix input

Enhance speaker matching by adding an audio prefix to text input, enabling actions like whispering.

Multilingual Support

Supports English, Japanese, Chinese, French, and German with natural pronunciation.

Advanced Control

Precisely adjust speaking rate, pitch, audio quality, and emotional expression.

Quick Start Guide

Python implementation


import torch
import torchaudio
from zonos.model import Zonos
from zonos.conditioning import make_cond_dict

# Initialize model
model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-transformer", device="cuda")

# Load audio sample
wav, sampling_rate = torchaudio.load("assets/exampleaudio.mp3")
speaker = model.make_speaker_embedding(wav, sampling_rate)

# Generate speech
cond_dict = make_cond_dict(
    text="Hello, world!",
    speaker=speaker,
    language="en-us"
)
conditioning = model.prepare_conditioning(cond_dict)
codes = model.generate(conditioning)

# Save output
wavs = model.autoencoder.decode(codes).cpu()
torchaudio.save("sample.wav", wavs[0], model.autoencoder.sampling_rate)
                

Gradio interface (recommended)


uv run gradio_interface.py
# python gradio_interface.py
                

Zonos TTS Github >>

FAQ

What languages does Zonos TTS support?

Zonos TTS currently supports English, Japanese, Chinese, French, and German.

How to control the emotional tone of generated speech?

You can fine-tune the emotional tone by adjusting parameters such as happiness, anger, sadness, and fear in the settings.

What is the real-time factor of Zonos TTS?

The real-time factor for Zonos TTS is approximately 2x when running on the RTX 4090.

How to install Zonos TTS?

Zonos TTS can be easily installed and deployed using the Docker files provided in our repository.

Can I use Zonos TTS for commercial purposes?

Please refer to our license terms for information on commercial use.