Voice Cloning Experiments with Chatterbox-TTS

December 28, 2025·

It’s now possible to clone someone’s voice reliably, locally, and completely offline. This is something I’ve wanted to have in my toolkit for a few years now, but there were always barriers to entry. It required high-quality audio or imposed certain tricky limitations on the input. It required hardware I didn’t have or was too slow. That’s all out the window now, as I’ve demonstrated after downloading a couple GB of models/weights, I’m able to clone an individual’s voice using a clip that’s only ~10 seconds long.

Now that these technical limitations are resolved, I ran up against another set of barriers I didn’t expect. They were points where I had to stop and ask myself if trying this and making it more accessible was a good idea.

Should people like me really have this technology so readily available?
When I create synthetic audio for the purposes of demonstration or otherwise (deceit?), should this be watermarked?

Resemble AI has a set of models now which they’ve demonstrated on their demo page. The input audio is only like 3 seconds in some examples, with the output being multiple sentences. I’m not sure how the people in those samples speak normally, so I can’t validate how close the output sounds to their real voice. It does sound convincing. The systems I have available are a windows 11 system with WSL/Ubuntu and a macbook air m2. I wanted to figure out how simple it was to get this set up and how easily I could run it.

Model Differences

The models have a few shared parameters, but I wanted to play around with the available knobs to tune for each. I created a little web app to expose these knobs and dials.

ChatterboxTurboTTS

Parameters: temperature (0.8), top_p (0.95), top_k (1000), repetition_penalty (1.2), norm_loudness (True)
Ignores: cfg_weight, exaggeration, min_p (logs warning)
Requirement: Audio prompt must be > 5 seconds
Performance: Faster inference

ChatterboxTTS (Non-Turbo)

Parameters: temperature (0.8), top_p (1.0), min_p (0.05), repetition_penalty (1.2), cfg_weight (0.5), exaggeration (0.5)
Does NOT have: top_k, norm_loudness
Requirement: No minimum audio length
Performance: Slower but supports CFG and emotion control

Shared Parameters

temperature, top_p, repetition_penalty (present in both)

You actually do have to download stuff from huggingface

Initial setup involved being a user of the huggingface ecosystem. So that was a barrier most people have in front of them. Accounts are simple to set up, but I don’t see regular joes making a login over there. Those people will probably be the sort of users that just use these services running on someone else’s systems. Step 1 was to create a read-only HF token on my account and do an hf auth login on my local machine to get connected.

On the first run of the chatterbox-tts module, it’s going to download a couple GBs of data

the support matrix
hacking up the module to get things to work
aaa

A callout is a short piece of text intended to attract attention.

graph TD;
    A(foobar)
    A-->B{what};
    A-->C;
    B-->D;
    C-->D;

don't let em