ChatterBox for ComfyUI: Text-to-Speech, Voice Cloning & Conversion

June 6, 2025
ComfyUI
ChatterBox for ComfyUI: Text-to-Speech, Voice Cloning & Conversion
Discover how to use ChatterBox in ComfyUI to unlock powerful voice features — including text-to-speech (TTS), voice cloning, and voice conversion. Learn more!

1. Introduction to ChatterBox in ComfyUI

ChatterBox is an innovative custom node extension for ComfyUI that enhances the user experience by integrating Text-To-Speech (TTS) and Voice Conversion (VC) capabilities. Utilizing the powerful Chatterbox library, this extension allows users to generate realistic speech from text and convert voices with remarkable accuracy.

If you haven’t installed ComfyUI yet, check out our detailed guide below on how to set it up locally:

👉 How to Install ComfyUI Locally on Windows?

One of ChatterBox’s most impressive features is its ability to create custom voices through voice cloning, making it a powerful tool for developers, content creators, and anyone working with voice synthesis. However, there are a few limitations to keep in mind: speech generation is currently capped at 40 seconds, and the model performs best in English only.

In this blog post, we’ll walk you through how to set up ChatterBox in ComfyUI, explore its key features, and help you understand both its capabilities and current limitations.

2. Loading the ChatterBox Workflow in ComfyUI

To get started with ChatterBox in ComfyUI, begin by loading the workflow into your interface. Download the required .json file from the following link: ChatterBox ComfyUI Workflow. Once downloaded, open ComfyUI and drag the file into the canvas. You may see a popup message like "Missing Node Types" — this is expected and will be addressed shortly.

The workflow is organized into two main sections: one for Text-to-Speech (TTS) and the other for Voice Conversion (VC). This layout helps you quickly navigate and utilize ChatterBox’s advanced voice capabilities.

As you explore the workflow, you may notice some red outlines around the nodes labeled "FL_ChatterboxVC" and "FL_ChatterboxTTS". This indicates that these nodes are not yet properly installed, which we will address in the next section.

3. Installing Missing Nodes for ChatterBox

If you see red outlines around the ChatterBox nodes in ComfyUI, it means some custom nodes are missing. Follow these steps to install them and fix the issue:

  1. Open your ComfyUI interface and go to the top right corner.

  2. Click on the “Manager” option to open the Node Manager.

  3. In the Node Manager, find and select “Install missing custom nodes.”

  4. Locate the missing node labeled “ComfyUI_Fill-ChatterBox.”

  5. Click the install button next to this node to start the installation.

  6. After installation, you’ll be prompted to restart your server.

  7. Click the restart button in the bottom left corner and confirm.

  8. Wait for the server to reboot; a “Reconnecting” popup will appear in the top right corner.

  9. When prompted, click confirm to refresh your browser.

  10. After refreshing, the red outlines around the ChatterBox nodes should disappear, indicating a successful installation.

Once completed, the red outlines around the ChatterBox nodes will disappear, confirming the missing nodes have been installed successfully. You’re now ready to explore ChatterBox’s full range of TTS, voice cloning, and voice conversion features in ComfyUI. Let’s dive in together!

4. Exploring TTS with Voice Cloning Capabilities

Now that the necessary nodes are installed, let’s explore how to use ChatterBox’s voice cloning and TTS features step-by-step. We’ll also explain the key settings to help you get the best results.

How to Use ChatterBox TTS with Voice Cloning

1. Upload a voice sample

  1. Enter your audio prompt

  2. Click “Run”

Chatterbox TTS Settings and Explanation

  • Audio Prompt: "Well hello there… big boy. You're hearing the voice of Salma Hayek. Soft… slow… and dripping with everything your ears desire. Fully AI… but oh, I sound real enough to make you lean in closer, don’t I? Do you like it? Mmm… I know you do. Let’s take our time… darling."

  • Exaggeration: 0.5 (range 0.25–2)
    Controls how expressive the voice sounds. Higher values produce more dramatic, intense speech; lower values keep it natural and calm.

  • CFG Weight: 0.5 (range 0.2–1.0)
    Determines how closely the speech follows your text prompt. Higher values mean the voice sticks more closely to the words you wrote; lower values allow for more variation and creativity.

  • Temperature: 0.8 (range 0.05–5)
    Affects speech speed and randomness. Higher values make the speech faster and more unpredictable, which can reduce clarity. Lower values slow down the speech and make it clearer.

Chatterbox TTS Example (Salma Hayek)

Using just a 2-second clip of Salma Hayek’s voice and the audio prompt, the generated speech is impressively realistic and captures the sensual tone perfectly. It’s a clear demonstration of how powerful ChatterBox’s TTS capabilities can be—even with minimal input.

0:00
0:00

In the next chapter, we’ll dive into the ChatterBox Voice Conversion (VC) nodes, where you’ll learn how to transform one voice into another using the intuitive workflow.

5. Exploring Chatterbox Voice Conversion (VC) with Example

In addition to TTS, ChatterBox also offers powerful voice conversion capabilities that let you transform one voice into another, opening up exciting possibilities for creating diverse audio content. To use the voice conversion feature, follow these steps:

  • Input an original audio sample — upload the voice you want to convert (referred to as input_audio).

  • Select a target voice — choose the voice you want to convert the original audio into; this is the voice that will be cloned and applied to the input. (referred to as target_audio)

  • Run the conversion — ChatterBox applies its voice conversion algorithms to generate an output audio clip that maintains the original content but takes on the vocal characteristics of the target voice.

Original Audio (Input_Audio):

For example, you start with a male speaker’s audio file as the original input like below:

0:00
0:00

Target Voice (Salma Hayek):

Then, you select a short audio clip of Salma Hayek’s voice as the target voice you want to convert to.

0:00
0:00

Final Output Audio:

The output will be the original male speech transformed to sound like it’s spoken by Salma Hayek, combining the content of the original with the unique vocal qualities of the target.

0:00
0:00

This feature is especially valuable for voice actors, game developers, and creators seeking to produce dynamic, high-impact voice transformations that stand out in their projects.

In the next section, we’ll go over a few important limitations to keep in mind when working with ChatterBox.

6. Limitations of ChatterBox

While ChatterBox delivers impressive results in both TTS and voice conversion, it’s important to be mindful of its current limitations. Most notably, the speech generation is capped at a maximum of 40 seconds. Exceeding this may result in reduced clarity, distorted audio, or inconsistent quality—common challenges in longer voice synthesis.

Another important point: ChatterBox performs best with English-language voices and text. While it may technically process other languages, the most natural, coherent, and expressive results come from English inputs and voice samples.

By staying within these boundaries—shorter durations and English content—you’ll get the most reliable and high-quality performance from the ChatterBox nodes within ComfyUI.

7. Conclusion: Harnessing the Power of ChatterBox in ComfyUI

In conclusion, ChatterBox is a powerful and intuitive custom node extension for ComfyUI that unlocks advanced text-to-speech (TTS) and voice conversion (VC) capabilities. With support for custom voice cloning and seamless audio transformation, it opens up exciting new possibilities for creators working with voice-driven content.

By following the setup and workflow outlined in this guide, you can get up and running quickly—whether you're a developer, content creator, or simply a voice tech enthusiast. ChatterBox offers a user-friendly experience combined with robust functionality, making it an ideal tool for experimenting with synthetic speech.

Just remember:

  • Keep your speech under 40 seconds for the best quality output.

  • Stick to English for the most natural and consistent results.

With these tips in mind, you’re ready to start exploring the creative potential of ChatterBox. Push boundaries, personalize your audio, and bring your voice ideas to life.

Frequently Asked Questions

AI Video Generation

Create Amazing AI Videos

Generate stunning videos with our powerful AI video generation tool.

Get Started Now
OR