How to Create Video-to-Video Lip Sync with InfiniteTalk in ComfyUI

October 15, 2025
ComfyUI
How to Create Video-to-Video Lip Sync with InfiniteTalk in ComfyUI
Learn how to create realistic video-to-video lip-sync using InfiniteTalk in ComfyUI. Step-by-step guide for both high and low VRAM workflows, perfect for creators looking to bring their videos to life.

1. Introduction

In this tutorial, you’ll learn how to create realistic video-to-video lip-sync using InfiniteTalk in ComfyUI. This workflow follows the original video’s motion and applies talking audio perfectly synced to the video, generating natural speech that matches the footage. We’ll cover both high-VRAM FP8 and low-VRAM GGUF workflows, so you can produce expressive, talking videos on a wide range of hardware. By the end, you’ll be able to turn any video clip into a smooth, high-quality, lip-synced video — ideal for AI characters, tutorials, social media clips, or creative storytelling.

2. System Requirements for InfiniteTalk V2V Workflow

Before generating video-to-video lip-sync, ensure your system meets the hardware and software requirements to run the InfiniteTalk V2V workflow smoothly inside ComfyUI. This setup benefits from a strong GPU for faster processing — we recommend at least an RTX 4090 (24GB VRAM) or using a cloud GPU provider like RunPod for optimal performance.

Requirement 1: ComfyUI Installed & Updated

To get started, you’ll need ComfyUI installed either locally or through the cloud. For a local Windows setup, follow this guide:

👉 How to Install ComfyUI Locally on Windows

Once installed, open the Manager tab in ComfyUI and click “Update ComfyUI” to make sure you’re on the latest version. Keeping ComfyUI updated ensures compatibility with new workflows, models, and nodes used by InfiniteTalk.

While InfiniteTalk V2V can run locally, we highly recommend using our Next Diffusion - ComfyUI SageAttention template on RunPod for this workflow. Here’s why:

  • Sage Attention & Triton Acceleration — both are pre-installed and optimized in the RunPod template, significantly boosting generation speed and VRAM efficiency.

  • Plug-and-Play Setup — no manual CUDA or PyTorch dependencies to configure; everything is ready to run InfiniteTalk out-of-the-box.

  • Persistent Storage with Network Volume — saves your models and workflows so you don’t have to re-download or set up nodes each session.

You can spin up a ready-to-use ComfyUI instance in just a few minutes using the RunPod template below:

👉 How to Run ComfyUI on RunPod with Network Volume

Requirement 2: Download InfiniteTalk V2V Model Files

The InfiniteTalk Lip-Sync Workflow relies on a combination of WAN-based I2V diffusion models, audio feature encoders, and LoRA refinements to generate realistic lip-synced motion from still images.

Download each of the following models and place them in their respective ComfyUI model directories exactly as listed below.

File NameHugging Face Download PageFile Directory
Wan2_1-InfiniteTalk-Single_fp8_e4m3fn_scaled_KJ.safetensors🤗 Download..\ComfyUI\models\diffusion_models
Wan2_1-I2V-14B-480P_fp8_e4m3fn.safetensors🤗 Download..\ComfyUI\models\diffusion_models
MelBandRoformer_fp16.safetensors🤗 Download..\ComfyUI\models\diffusion_models
Wan2_1_VAE_bf16.safetensors🤗 Download..\ComfyUI\models\vae
clip_vision_h.safetensors🤗 Download..\ComfyUI\models\clip_vision
umt5-xxl-enc-bf16.safetensors🤗 Download..\ComfyUI\models\text_encoders
lightx2v_I2V_14B_480p_cfg_step_distill_rank64_bf16.safetensors🤗 Download..\ComfyUI\models\loras

Once all models are downloaded and placed in the proper folders, ComfyUI will automatically recognize them at startup. This ensures InfiniteTalk’s V2V and audio-processing nodes load correctly for smooth lip-sync generation.

Requirement 3: Verify Folder Structure

Before running the InfiniteTalk Lip-Sync Workflow, make sure all downloaded model files are placed in the correct ComfyUI subfolders. Your folder structure should look exactly like this:

ts
1📁 ComfyUI/
2└── 📁 models/
3    ├── 📁 diffusion_models/
4    │   ├── Wan2_1-InfiniteTalk-Single_fp8_e4m3fn_scaled_KJ.safetensors
5    │   ├── Wan2_1-I2V-14B-480P_fp8_e4m3fn.safetensors
6    │   ├── MelBandRoformer_fp16.safetensors
7    ├── 📁 vae/
8    │   └── Wan2_1_VAE_bf16.safetensors
9    ├── 📁 clip_vision/
10    │   └── clip_vision_h.safetensors
11    ├── 📁 text_encoders/
12    │   └── umt5-xxl-enc-bf16.safetensors
13    └── 📁 loras/
14        └── lightx2v_I2V_14B_480p_cfg_step_distill_rank64_bf16.safetensors

Once everything is installed and organized, you’re ready to download and load the InfiniteTalk V2V FP8 Lip-Sync Workflow in ComfyUI and start generating smooth, expressive talking videos from existing footage.

Thanks to the LightX2V LoRA, the entire process can be completed in just 4 total steps — delivering natural lip movement, perfect audio synchronization, and high visual fidelity, all with impressive speed.

3. Download & Load the InfiniteTalk V2V FP8 Lip-Sync Workflow

Now that your environment and model files are set up, it’s time to load and configure the InfiniteTalk V2V FP8 Lip-Sync Workflow in ComfyUI. This setup ensures all model connections, encoders, and audio-processing nodes work together seamlessly for high-quality lip-synced video generation that follows your original footage and applies talking audio. Once configured, you’ll be ready to bring any video clip to life with natural, expressive speech movement powered by InfiniteTalk.

Load the InfiniteTalk V2V FP8 Workflow JSON File

👉 Download the InfiniteTalk V2V FP8 Lip-Sync Workflow JSON file and drag it directly into your ComfyUI canvas.

This workflow comes fully pre-arranged with all essential nodes, model references, and audio synchronization components required for realistic, frame-accurate lip-sync generation from static images.

Install Missing Nodes

If any nodes appear in red, that means certain custom nodes are missing.

To fix this:

  1. Open the Manager tab in ComfyUI.

  2. Click Install Missing Custom Nodes.

  3. After installation, restart ComfyUI to apply the changes.

This will ensure all InfiniteTalk V2V and MelBandRoFormer nodes are properly installed and ready to process your video and audio seamlessly.

Once all nodes load correctly and your workflow opens without errors, you’re ready to upload your source video clip and the corresponding audio file into the designated nodes.

InfiniteTalk will automatically analyze the audio, map phoneme timings, and apply the talking audio to follow the video’s motion — creating an impressively natural, lip-synced video that feels alive and expressive.

4. Running the Lip-Sync Video Generation Workflow

With the workflow loaded and all components in place, it’s time to generate your first lip-synced video using InfiniteTalk V2V in ComfyUI. This section will walk you through setting up your source video, audio, and output parameters — then running the workflow to produce smooth, expressive results.

Upload Your Video

Start by loading your input video clip into the Video Loader node. This video will serve as the base for motion and facial tracking, while InfiniteTalk automatically applies talking audio perfectly synced to the footage.

👉 Tip: Use a video with a clear, front-facing subject and consistent lighting for the most natural lip-sync performance. Higher resolution videos capture better detail for facial movements and expressions.

Set Video Dimensions and Aspect Ratio

Next, define your video’s output size and aspect ratio in the Resize Video node. Preserve the original aspect ratio of your input video to ensure the lip-sync mapping stays accurate without stretching or cropping. For vertical portrait footage, a 9:16 aspect ratio with 480p vertical resolution works well.

SettingValueNotes
Width480Matches vertical portrait input
Height832Preserves 9:16 aspect ratio
Aspect Ratio9:16Keeps the subject’s proportions intact

Load Your Audio File

Import your audio clip into the Audio Loader node. InfiniteTalk will analyze the audio, extracting phoneme timings, amplitude, and rhythm patterns to drive accurate mouth and jaw motion frame-by-frame.

0:00
0:00

Set Frame Length and Number of Frames

In the Multi/InfiniteTalk Wav2vec2 Embeds Node, set the num_frames parameter. This controls how many frames the workflow will generate, directly determining the video length. Inside this node, you can also tweak audio_scale to strengthen the lip-sync effect — higher values make mouth movements more pronounced, while lower values make them subtler.

For testing, start with shorter clips and increase num_frames for longer sequences once you’re comfortable. At 25 fps, the following num_frames values correspond roughly to these audio durations:

Audio LengthNum Frames
4 seconds100
8 seconds200
10 seconds250

👉 Tip: If your audio is longer than the num_frames value, it will be trimmed at that limit. Adjust num_frames to match or slightly exceed your audio length to capture the full clip.

Set a Text Prompt

After loading your audio, you can use the Text Encoder node to provide a prompt describing the scene or character. This can help guide subtle facial expressions or mood in the generated video. With a Lightning LoRA, the CFG scale is set to 1 for subtle effects. For example:

“A woman talks calmly and sensually”

Run the Generation

Once your video, audio, and num_frames are set, run the InfiniteTalk workflow. The system will process your source video and audio, generating a fully lip-synced video automatically. The result is smooth, natural animation with facial movements accurately following the speech, ready for social media, presentations, or creative projects.

This render took around 70 seconds on an RTX 4090 (24GB VRAM). For faster or larger renders, consider using a cloud GPU provider like RunPod.

5. Bonus: InfiniteTalk V2V GGUF Workflow (Low V-RAM)

For users operating in low-VRAM environments, the GGUF variants of the InfiniteTalk and WAN2.1 I2V models offer a viable alternative for generating lip-synced animations locally.

For running locally without a high-end GPU, the main difference is that instead of the FP8 diffusion models, you’ll need GGUF models that fit your VRAM.

You can download them here:

That’s the only extra requirement for this workflow — everything else stays the same.

Verify Folder Structure

Once downloaded, place the GGUF models in the diffusion_models folder. Your folder structure should look like this:

ts
1📁 ComfyUI/
2└── 📁 models/
3    ├── 📁 diffusion_models/
4    │   ├── Wan2_1-InfiniteTalk_Single_Q4_K_M.gguf # Or any other GGUF version
5    │   ├── wan2.1-i2v-14b-480p-Q4_K_M.gguf # Or any other GGUF version
6    │   ├── MelBandRoformer_fp16.safetensors
7    ├── 📁 vae/
8    │   └── Wan2_1_VAE_bf16.safetensors
9    ├── 📁 clip_vision/
10    │   └── clip_vision_h.safetensors
11    ├── 📁 text_encoders/
12    │   └── umt5-xxl-enc-bf16.safetensors
13    └── 📁 loras/
14        └── lightx2v_I2V_14B_480p_cfg_step_distill_rank64_bf16.safetensors

Once in place, ComfyUI will recognize them at startup, and you can run the InfiniteTalk workflow just like normal, even on lower-VRAM hardware.

Load the InfiniteTalk V2V GGUF Lip-Sync Workflow

Start by loading the InfiniteTalk V2V GGUF Lip-Syncs Workflow into ComfyUI:

👉 Download the InfiniteTalk V2V GGUF Lip-Sync workflow JSON file and drag it directly into your ComfyUI canvas.

This workflow comes fully configured with all nodes and references required for smooth lip-synced animation generation. With the GGUF workflow set up, you’re ready to start generating talking animations locally, even on lower-VRAM hardware.

6. Conclusion

Congratulations! You’ve now learned how to create video-to-video lip-sync videos using InfiniteTalk in ComfyUI. From setting up your environment and models to configuring the workflow and generating smooth, expressive results, you can bring your footage to life with perfectly synced audio.

Whether using the high-VRAM FP8 or VRAM-friendly GGUF workflow, you now know how to produce professional-quality lip-synced videos efficiently. Experiment with your own video clips, audio files, and prompts to create engaging, lifelike talking videos — perfect for AI characters, tutorials, social media, or creative storytelling.

Frequently Asked Questions

Custom LoRA Training for Flux Dev Model

Train Custom Character LoRAs for Flux Dev

Automatically generate a dataset, create captions, and train LoRAs from a single image.

Start Training Now
OR