Multi-Speaker Audio Generation with Microsoft VibeVoice (ComfyUI)

Table of Contents
1. Introduction
Welcome to this comprehensive guide on generating multi-speaker conversational audio using ComfyUI-VibeVoice. In this tutorial, we’ll explore Microsoft’s VibeVoice, a cutting-edge model designed for creating expressive, long-form audio conversations. Unlike traditional text-to-speech, VibeVoice performs voice conversion (zero-shot voice cloning), meaning it always requires a reference audio to replicate a speaker’s unique timbre. This makes it perfect for generating natural-sounding dialogue for podcasts, storytelling, and other audio applications.
VibeVoice supports up to four distinct speakers, enabling dynamic and engaging multi-voice conversations. Using the ComfyUI workflow, you can seamlessly convert your text scripts into realistic speech with each speaker’s voice faithfully cloned from reference audio files.
Throughout this tutorial, we’ll guide you through setting up ComfyUI-VibeVoice, connecting reference audio for each speaker, configuring the voice conversion node, and creating your first multi-speaker conversation. By the end, you’ll have all the tools needed to bring your text-based dialogues to life with cloned voices.
2. Requirements for Microsoft VibeVoice in ComfyUI
Before we start setting up ComfyUI VibeVoice, ensure your system meets the requirements for generating multi-speaker conversational audio. Unlike standard text-to-speech, VibeVoice performs voice conversion (zero-shot voice cloning), meaning it always uses a reference audio to replicate a speaker’s voice.
Requirement 1: ComfyUI Installed
To use VibeVoice in ComfyUI, you first need a working ComfyUI installation. If you haven’t set it up yet, follow the detailed instructions below for a local Windows installation or use a cloud-based solution like RunPod for faster performance.
Option 1: Local Installation:
👉 How to Install ComfyUI Locally on Windows?
Option 2: Cloud Based GPU (RunPod)
👉 How to Run ComfyUI on RunPod with Network Volume
Requirement 2: Update ComfyUI
To ensure full compatibility with VibeVoice, make sure your ComfyUI installation is up to date.
For Windows Portable Users:
-
Open the folder:
...\ComfyUI_windows_portable\update -
Double-click update_comfyui.bat.
For Runpod Users:
From the RunPod terminal, run the following command line in your terminal
ts1 2cd /workspace/ComfyUI && git pull origin master && pip install -r requirements.txt && cd /workspace
Keeping ComfyUI updated guarantees you have the latest features and nodes, bug fixes and compatibility improvements.
Requirement 3: VibeVoice Models
Based on your selection, the correct VibeVoice model will be automatically downloaded the first time you run the node in ComfyUI. For reference, the table below lists the supported models.
Model | Context Length | Generation Length | Weight |
---|---|---|---|
VibeVoice-1.5B | 64K | ~90 min | HF link |
VibeVoice-Large | 32K | ~45 min | HF link |
These models are automatically managed by ComfyUI-VibeVoice, ensuring a smooth setup for generating multi-speaker conversational audio.
⚠️ Note: The table showcases available models, but Microsoft has deleted their Hugging Face repo for the VibeVoice-Large model. As of this writing, the 1.5B model is still available, so use that one for model selection in ComfyUI.
3. Downloading and Loading the VibeVoice Workflow for ComfyUI
With the requirements in place, the next step is to download and load the VibeVoice workflow into ComfyUI. This workflow is prepared to simplify setup and ensure the model operates correctly.
Step 1: Download the Workflow File
Begin by downloading the workflow file specifically designed for VibeVoice. This file contains all the necessary configurations and nodes required for multi-speaker voice cloning. You can find the download link below:
👉 Download Miscrosoft VibeVoice Multi-Speaker Voice Conversion Workflow JSON
Step 2: Load the Workflow in ComfyUI
Once you’ve downloaded the workflow file, launch ComfyUI and simply drag and drop the JSON file onto the canvas to load the full setup. This prepares the environment for generating multi-speaker conversations with cloned voices.
Step 3: Install Missing Custom Nodes
If you see red outlines around nodes in your ComfyUI VibeVoice workflow, it means some custom nodes are missing. Follow these steps to resolve the issue:
-
Open ComfyUI
Launch your ComfyUI interface in your browser. -
Access the Node Manager
Navigate to the top right corner and click on the "Manager" button. -
Install Missing Custom Nodes
In the Node Manager window, click on "Install missing custom nodes". This will scan your setup for any missing components. -
Locate and Install “ComfyUI-VibeVoice”
Locate the node named "ComfyUI-VibeVoice" in the list, then click the Install button and choose version 1.2.0. At the time of writing, version 1.2.0 was the stable release.
-
Restart the Server
After installation, click the Restart button in the bottom left corner and confirm the action. -
Wait for Reconnection
A “Reconnecting” popup will appear in the top right. Wait until the server fully reboots. -
Refresh Your Browser
Once the server is back online, refresh the browser tab.
The red outlines should now be gone, and your workflow is ready to generate multi-speaker audio with voice cloning using VibeVoice.in the workflow function properly.
4. Configuring Microsoft VibeVoice TTS Settings
Before entering your voice script or adjusting parameters, the first step is to load the reference voices you want to clone. VibeVoice performs voice conversion (zero-shot voice cloning), so each speaker requires a reference audio file to replicate their voice accurately.
Step 1: Load Reference Voices
Before entering your script, you need to load the voices you want to clone. In this example, we’ll create a podcast-style conversation with two speakers. For demonstration, we can use the voices of Joe Rogan and Gordan Ramsay as reference audio.
Joe Rogan:
Gordon Ramsay :
-
Use ComfyUI’s Load Audio node to load each reference file.
-
Connect each Load Audio node’s output to the corresponding speaker input in the VibeVoice TTS node—speaker_1_voice for Speaker 1 and speaker_2_voice for Speaker 2.
Step 2: Select the Model
For this example, we’ll select the VibeVoice-Large model. It will be automatically downloaded on the first run and requires approximately 18GB of disk space.
⚠️ Note: Microsoft has deleted their Hugging Face repo for the VibeVoice-Large model. As of this writing, the 1.5B model is still available, so use that one for model selection.
Step 3: Enter Your Script
Write your dialogue in the text input of the VibeVoice TTS node. Each line must begin with Speaker 1: or Speaker 2: to indicate which voice should speak. For example:
ts1 2Speaker 1: Welcome to today’s podcast, everybody! We’re talking cooking, chaos, and probably a little bit of insanity today. 3 4Speaker 2: Insanity?! Rogan, my kitchen looks like a crime scene! Who let this numpty touch my knives?! 5 6Speaker 1: Haha, alright, calm down. Let’s take it one step at a time. 7 8Speaker 2: ONE STEP AT A TIME?! The pasta is raw, the sauce is like wallpaper paste, and that chicken… I swear it’s mocking me, you donut! 9 10Speaker 1: Okay, okay… so first up, we’re talking sauces. What’s your top tip? 11 12Speaker 2: Tip? Listen, if you don’t taste it, season it, and STOP stirring like a plonker, it’s ruined! This sauce could sink a ship, you twat! 13 14Speaker 1: Haha, fair enough. Let’s try something simple. How about an omelet? 15 16Speaker 2: OMELET?! Rogan, look at these eggs—they’re more scrambled than your brain on a Monday morning! Sort it out, you absolute donut! 17 18Speaker 1: Alright, maybe we should focus on flavor combinations… 19 20Speaker 2: FLAVOR?! If flavor doesn’t hit you like a brick, it’s useless! Add salt, add spice, add common sense, you twat! 21 22Speaker 1: Haha, okay… listeners, Gordon’s really passionate about cooking. Anything else before we dive in? 23 24Speaker 2: If you ruin one more dish, I’ll shove this frying pan so far up your backside, you’ll taste it for a week, you freaking twat!
⚠️Note: Avoid leaving blank lines between speakers in your script, as this can sometimes generate unintended sound effects or audio glitches.
Step 4: Configure Settings and Run
The table below shows the settings that work best for me when generating multi-speaker audio. Feel free to experiment with different values to find what suits your workflow and hardware.
Parameter | Value |
---|---|
quantize_llm_4bit | Full Precision |
attention_mode | sdpa |
cfg_scale | 2.00 |
inference_steps | 30 |
control_after_generate | randomize |
do_sample | Enabled |
temperature | 0.3 |
top_p | 0 |
force_offload | Keep in VRAM |
Once everything is connected and configured, let’s hit RUN. The node will generate a single audio file, combining the cloned voices of both speakers into a lively, podcast-style conversation.
Final Result
Below you’ll find the generated audio file showcasing the full multi-speaker conversation with the cloned voices of Joe Rogan and Gordon Ramsay:
5. Conclusion
In conclusion, ComfyUI-VibeVoice offers a powerful and intuitive platform for generating multi-speaker conversations using voice cloning. By leveraging reference audio, you can replicate the unique timbre of each speaker, making it perfect for podcasts, storytelling, or any dialogue-driven audio project. This tutorial has guided you through loading reference voices, setting up the workflow, configuring the VibeVoice TTS node, and generating your first lively, multi-speaker conversation.
Remember, VibeVoice is built for voice conversion, so it always needs reference audio for each speaker. If there aren’t enough distinct speakers or if the script is sparse, the generated audio may sometimes include unexpected noises or sound effects that don’t match the dialogue. Even with this occasional quirk, VibeVoice is still a powerful and reliable tool for creating expressive, multi-speaker audio content.
Depending on the model you choose, you can generate different lengths of audio: the VibeVoice-1.5B model supports up to ~90 minutes of generated audio, while the VibeVoice-Large model (7B) supports up to ~45 minutes.
We encourage you to experiment with the various parameters and settings in VibeVoice to discover the full potential of this tool. Bring your text-based conversations to life, create dynamic multi-speaker audio, and have fun exploring expressive voice cloning with Microsoft VibeVoice in ComfyUI. Happy audio generating!