Multi-Speaker Audio Generation with Microsoft VibeVoice (ComfyUI)

September 3, 2025

ComfyUI

Generate natural multi-speaker conversations with VibeVoice in ComfyUI. Step-by-step guide for setup, workflow, and voice cloning with Microsoft VibeVoice.

1. Introduction
2. Requirements for Microsoft VibeVoice in ComfyUI
3. Downloading and Loading the VibeVoice Workflow for ComfyUI
4. Configuring Microsoft VibeVoice TTS Settings
5. Conclusion

1. Introduction

Welcome to this comprehensive guide on generating multi-speaker conversational audio using ComfyUI-VibeVoice. In this tutorial, we’ll explore Microsoft’s VibeVoice, a cutting-edge model designed for creating expressive, long-form audio conversations. Unlike traditional text-to-speech, VibeVoice performs voice conversion (zero-shot voice cloning), meaning it always requires a reference audio to replicate a speaker’s unique timbre. This makes it perfect for generating natural-sounding dialogue for podcasts, storytelling, and other audio applications.

VibeVoice supports up to four distinct speakers, enabling dynamic and engaging multi-voice conversations. Using the ComfyUI workflow, you can seamlessly convert your text scripts into realistic speech with each speaker’s voice faithfully cloned from reference audio files.

Throughout this tutorial, we’ll guide you through setting up ComfyUI-VibeVoice, connecting reference audio for each speaker, configuring the voice conversion node, and creating your first multi-speaker conversation. By the end, you’ll have all the tools needed to bring your text-based dialogues to life with cloned voices.

RunPod Special Offer

Load $10, get up to $500 in bonus credits randomly!

2. Requirements for Microsoft VibeVoice in ComfyUI

Before we start setting up ComfyUI VibeVoice, ensure your system meets the requirements for generating multi-speaker conversational audio. Unlike standard text-to-speech, VibeVoice performs voice conversion (zero-shot voice cloning), meaning it always uses a reference audio to replicate a speaker’s voice.

Requirement 1: ComfyUI Installed

To use VibeVoice in ComfyUI, you first need a working ComfyUI installation. If you haven’t set it up yet, follow the detailed instructions below for a local Windows installation or use a cloud-based solution like RunPod for faster performance.

Option 1: Local Installation:

👉 How to Install ComfyUI Locally on Windows?

Option 2: Cloud Based GPU (RunPod)

👉 How to Run ComfyUI on RunPod with Network Volume

Requirement 2: Update ComfyUI

To ensure full compatibility with VibeVoice, make sure your ComfyUI installation is up to date.

For Windows Portable Users:

Open the folder:
...\ComfyUI_windows_portable\update
Double-click update_comfyui.bat.

For Runpod Users:

From the RunPod terminal, run the following command line in your terminal

ts
1
2cd /workspace/ComfyUI && git pull origin master && pip install -r requirements.txt && cd /workspace

Keeping ComfyUI updated guarantees you have the latest features and nodes, bug fixes and compatibility improvements.

Requirement 3: VibeVoice Models

Based on your selection, the correct VibeVoice model will be automatically downloaded the first time you run the node in ComfyUI. For reference, the table below lists the supported models.

Model	Context Length	Generation Length	Weight
VibeVoice-1.5B	64K	~90 min	HF link
VibeVoice-Large	32K	~45 min	HF link

These models are automatically managed by ComfyUI-VibeVoice, ensuring a smooth setup for generating multi-speaker conversational audio.

3. Downloading and Loading the VibeVoice Workflow for ComfyUI

With the requirements in place, the next step is to download and load the VibeVoice workflow into ComfyUI. This workflow is prepared to simplify setup and ensure the model operates correctly.

Step 1: Download the Workflow File

Begin by downloading the workflow file specifically designed for VibeVoice. This file contains all the necessary configurations and nodes required for multi-speaker voice cloning. You can find the download link below:

👉 Download Miscrosoft VibeVoice Multi-Speaker Voice Conversion Workflow JSON

Step 2: Load the Workflow in ComfyUI

Once you’ve downloaded the workflow file, launch ComfyUI and simply drag and drop the JSON file onto the canvas to load the full setup. This prepares the environment for generating multi-speaker conversations with cloned voices.

Step 3: Install Missing Custom Nodes

If you see red outlines around nodes in your ComfyUI VibeVoice workflow, it means some custom nodes are missing. Follow these steps to resolve the issue:

Open ComfyUI
Launch your ComfyUI interface in your browser.
Access the Node Manager
Navigate to the top right corner and click on the "Manager" button.
Install Missing Custom Nodes
In the Node Manager window, click on "Install missing custom nodes". This will scan your setup for any missing components.
Locate and Install “ComfyUI-VibeVoice”

Locate the node named "ComfyUI-VibeVoice" in the list, then click the Install button.
Restart the Server
After installation, click the Restart button in the bottom left corner and confirm the action.
Wait for Reconnection
A “Reconnecting” popup will appear in the top right. Wait until the server fully reboots.
Refresh Your Browser
Once the server is back online, refresh the browser tab.

The red outlines should now be gone, and your workflow is ready to generate multi-speaker audio with voice cloning using VibeVoice.in the workflow function properly.

RunPod Special Offer

Load $10, get up to $500 in bonus credits randomly!

4. Configuring Microsoft VibeVoice TTS Settings

Before entering your voice script or adjusting parameters, the first step is to load the reference voices you want to clone. VibeVoice performs voice conversion (zero-shot voice cloning), so each speaker requires a reference audio file to replicate their voice accurately.

Uploaded image ### Step 1: Load Reference Voices

Before entering your script, you need to load the voices you want to clone. In this example, we’ll create a podcast-style conversation with two speakers. For demonstration, we can use the voices of Joe Rogan and Gordan Ramsay as reference audio.

Joe Rogan:

0:00

Gordon Ramsay :

0:00

Use ComfyUI’s Load Audio node to load each reference file.
Connect each Load Audio node’s output to the corresponding speaker input in the VibeVoice TTS node—speaker_1_voice for Speaker 1 and speaker_2_voice for Speaker 2.

Step 2: Select the Model

For this example, we’ll select the VibeVoice-Large model. It will be automatically downloaded on the first run and requires approximately 18GB of disk space.

Step 3: Enter Your Script

Write your dialogue in the text input of the VibeVoice TTS node. Each line must begin with Speaker 1: or Speaker 2: to indicate which voice should speak. For example:

ts
1
2Speaker 1: Welcome to today’s podcast, everybody! We’re talking cooking, chaos, and probably a little bit of insanity today.
3
4Speaker 2: Insanity?! Rogan, my kitchen looks like a crime scene! Who let this numpty touch my knives?!
5
6Speaker 1: Haha, alright, calm down. Let’s take it one step at a time.
7
8Speaker 2: ONE STEP AT A TIME?! The pasta is raw, the sauce is like wallpaper paste, and that chicken… I swear it’s mocking me, you donut!
9
10Speaker 1: Okay, okay… so first up, we’re talking sauces. What’s your top tip?
11
12Speaker 2: Tip? Listen, if you don’t taste it, season it, and STOP stirring like a plonker, it’s ruined! This sauce could sink a ship, you twat!
13
14Speaker 1: Haha, fair enough. Let’s try something simple. How about an omelet?
15
16Speaker 2: OMELET?! Rogan, look at these eggs—they’re more scrambled than your brain on a Monday morning! Sort it out, you absolute donut!
17
18Speaker 1: Alright, maybe we should focus on flavor combinations…
19
20Speaker 2: FLAVOR?! If flavor doesn’t hit you like a brick, it’s useless! Add salt, add spice, add common sense, you twat!
21
22Speaker 1: Haha, okay… listeners, Gordon’s really passionate about cooking. Anything else before we dive in?
23
24Speaker 2: If you ruin one more dish, I’ll shove this frying pan so far up your backside, you’ll taste it for a week, you freaking twat!

Step 4: Configure Settings and Run

The table below shows the settings that work best for me when generating multi-speaker audio. Feel free to experiment with different values to find what suits your workflow and hardware.

Parameter	Value
quantize_llm_4bit	Full Precision
attention_mode	sdpa
cfg_scale	2.00
inference_steps	30
control_after_generate	randomize
do_sample	Enabled
temperature	0.3
top_p	0
force_offload	Keep in VRAM

Once everything is connected and configured, let’s hit RUN. The node will generate a single audio file, combining the cloned voices of both speakers into a lively, podcast-style conversation.

Final Result

Below you’ll find the generated audio file showcasing the full multi-speaker conversation with the cloned voices of Joe Rogan and Gordon Ramsay:

0:00

5. Conclusion

In conclusion, ComfyUI-VibeVoice offers a powerful and intuitive platform for generating multi-speaker conversations using voice cloning. By leveraging reference audio, you can replicate the unique timbre of each speaker, making it perfect for podcasts, storytelling, or any dialogue-driven audio project. This tutorial has guided you through loading reference voices, setting up the workflow, configuring the VibeVoice TTS node, and generating your first lively, multi-speaker conversation.

Remember, VibeVoice is built for voice conversion, so it always needs reference audio for each speaker. If there aren’t enough distinct speakers or if the script is sparse, the generated audio may sometimes include unexpected noises or sound effects that don’t match the dialogue. Even with this occasional quirk, VibeVoice is still a powerful and reliable tool for creating expressive, multi-speaker audio content.

Depending on the model you choose, you can generate different lengths of audio: the VibeVoice-1.5B model supports up to ~90 minutes of generated audio, while the VibeVoice-Large model (7B) supports up to ~45 minutes.

We encourage you to experiment with the various parameters and settings in VibeVoice to discover the full potential of this tool. Bring your text-based conversations to life, create dynamic multi-speaker audio, and have fun exploring expressive voice cloning with Microsoft VibeVoice in ComfyUI. Happy audio generating!

RunPod Special Offer

Load $10, get up to $500 in bonus credits randomly!

Multi-Speaker Audio Generation with Microsoft VibeVoice (ComfyUI)

Table of Contents

1. Introduction

RunPod Special Offer

2. Requirements for Microsoft VibeVoice in ComfyUI

Requirement 1: ComfyUI Installed

Option 1: Local Installation:

👉 How to Install ComfyUI Locally on Windows?

Option 2: Cloud Based GPU (RunPod)

👉 How to Run ComfyUI on RunPod with Network Volume

Requirement 2: Update ComfyUI

Requirement 3: VibeVoice Models

3. Downloading and Loading the VibeVoice Workflow for ComfyUI

Step 1: Download the Workflow File

Step 2: Load the Workflow in ComfyUI

Step 3: Install Missing Custom Nodes

RunPod Special Offer

4. Configuring Microsoft VibeVoice TTS Settings

Step 2: Select the Model

Step 3: Enter Your Script

Step 4: Configure Settings and Run

Final Result

5. Conclusion

RunPod Special Offer

Frequently Asked Questions

Explore More Tutorials

How to Create AI Music with Ace Step V1.5 in ComfyUI

How to Deploy and Use The Hub on Next Diffusion

Uncensored AI Tools

Run ComfyUI in the Cloud with Ease

Multi-Speaker Audio Generation with Microsoft VibeVoice (ComfyUI)

Table of Contents

1. Introduction

RunPod Special Offer

2. Requirements for Microsoft VibeVoice in ComfyUI

Requirement 1: ComfyUI Installed

Option 1: Local Installation:

👉 How to Install ComfyUI Locally on Windows?

Option 2: Cloud Based GPU (RunPod)

👉 How to Run ComfyUI on RunPod with Network Volume

Requirement 2: Update ComfyUI

Requirement 3: VibeVoice Models

3. Downloading and Loading the VibeVoice Workflow for ComfyUI

Step 1: Download the Workflow File

Step 2: Load the Workflow in ComfyUI

Step 3: Install Missing Custom Nodes

RunPod Special Offer

4. Configuring Microsoft VibeVoice TTS Settings

Step 2: Select the Model

Step 3: Enter Your Script

Step 4: Configure Settings and Run

Final Result

5. Conclusion

RunPod Special Offer

Frequently Asked Questions

What are the system requirements for using ComfyUI-VibeVoice?

Why do I sometimes hear strange noises or sound effects in the generated audio?

Can I generate audio for multiple speakers using ComfyUI-VibeVoice?

Explore More Tutorials

How to Create AI Music with Ace Step V1.5 in ComfyUI

How to Deploy and Use The Hub on Next Diffusion

Uncensored AI Tools

Run ComfyUI in the Cloud with Ease