Create Lip-Sync Videos from Images with InfiniteTalk in ComfyUI

October 5, 2025
ComfyUI
Create Lip-Sync Videos from Images with InfiniteTalk in ComfyUI
Learn how to create realistic lip-sync videos from images using InfiniteTalk in ComfyUI. Step-by-step guide for both high and low VRAM workflows for creators.

1. Introduction

In this tutorial, you’ll learn how to create realistic lip-sync videos from static images using InfiniteTalk in ComfyUI. Instead of manually animating facial movements, this workflow uses InfiniteTalk’s audio-driven Image-to-Video (I2V) system to generate natural speech motion perfectly synced to your audio. We’ll cover both high-VRAM FP8 and low-VRAM GGUF workflows, so you can produce expressive, talking avatars on a wide range of hardware. By the end, you’ll be able to turn any image into a smooth, high-quality, lip-synced talking video — ideal for AI characters, tutorials, social media clips, or creative storytelling.

2. System Requirements for InfiniteTalk I2V Workflow

Before generating lip-sync videos, ensure your system meets the hardware and software requirements to run the InfiniteTalk I2V workflow smoothly inside ComfyUI. This setup benefits from a strong GPU for faster inference — we recommend at least an RTX 4090 (24GB VRAM) or using a cloud GPU provider like RunPod for optimal performance.

Requirement 1: ComfyUI Installed & Updated

To get started, you’ll need ComfyUI installed either locally or through the cloud. For a local Windows setup, follow this guide:

👉 How to Install ComfyUI Locally on Windows

Once installed, open the Manager tab in ComfyUI and click “Update ComfyUI” to make sure you’re on the latest version. Keeping ComfyUI updated ensures compatibility with new workflows, models, and nodes used by InfiniteTalk.

While InfiniteTalk I2V can run locally, we highly recommend using our Next Diffusion - ComfyUI SageAttention template on RunPod for this workflow. Here’s why:

  • Sage Attention & Triton Acceleration — both are pre-installed and optimized in the RunPod template, significantly boosting generation speed and VRAM efficiency.

  • Plug-and-Play Setup — no manual CUDA or PyTorch dependencies to configure; everything is ready to run InfiniteTalk out-of-the-box.

  • Persistent Storage with Network Volume — saves your models and workflows so you don’t have to re-download or set up nodes each session.

You can spin up a ready-to-use ComfyUI instance in just a few minutes using the RunPod template below:

👉 How to Run ComfyUI on RunPod with Network Volume

Requirement 2: Download InfiniteTalk I2V Model Files

The InfiniteTalk Lip-Sync Workflow relies on a combination of WAN-based I2V diffusion models, audio feature encoders, and LoRA refinements to generate realistic lip-synced motion from still images.

Download each of the following models and place them in their respective ComfyUI model directories exactly as listed below.

File NameHugging Face Download PageFile Directory
Wan2_1-InfiniteTalk-Single_fp8_e4m3fn_scaled_KJ.safetensors🤗 Download..\ComfyUI\models\diffusion_models
Wan2_1-I2V-14B-480P_fp8_e4m3fn.safetensors🤗 Download..\ComfyUI\models\diffusion_models
MelBandRoformer_fp16.safetensors🤗 Download..\ComfyUI\models\diffusion_models
Wan2_1_VAE_bf16.safetensors🤗 Download..\ComfyUI\models\vae
clip_vision_h.safetensors🤗 Download..\ComfyUI\models\clip_vision
umt5-xxl-enc-bf16.safetensors🤗 Download..\ComfyUI\models\text_encoders
lightx2v_I2V_14B_480p_cfg_step_distill_rank64_bf16.safetensors🤗 Download..\ComfyUI\models\loras

Once all models are downloaded and placed in the proper folders, ComfyUI will automatically recognize them at startup. This ensures InfiniteTalk’s I2V and audio-processing nodes load correctly for smooth lip-sync generation.

Requirement 3: Verify Folder Structure

Before running the InfiniteTalk Lip-Sync Workflow, make sure all downloaded model files are placed in the correct ComfyUI subfolders. Your folder structure should look exactly like this:

ts
1📁 ComfyUI/
2└── 📁 models/
3    ├── 📁 diffusion_models/
4    │   ├── Wan2_1-InfiniteTalk-Single_fp8_e4m3fn_scaled_KJ.safetensors
5    │   ├── Wan2_1-I2V-14B-480P_fp8_e4m3fn.safetensors
6    │   ├── MelBandRoformer_fp16.safetensors
7    ├── 📁 vae/
8    │   └── Wan2_1_VAE_bf16.safetensors
9    ├── 📁 clip_vision/
10    │   └── clip_vision_h.safetensors
11    ├── 📁 text_encoders/
12    │   └── umt5-xxl-enc-bf16.safetensors
13    └── 📁 loras/
14        └── lightx2v_I2V_14B_480p_cfg_step_distill_rank64_bf16.safetensors

Once everything is installed and organized, you’re ready to download and load the InfiniteTalk I2V FP8 Lip-Sync Workflow in ComfyUI and start generating smooth, expressive talking animations from still images.

Thanks to the LightX2V LoRA, the entire render can be completed in just 6 total steps — delivering natural lip movement, precise audio synchronization, and high visual fidelity, all with impressive speed.

3. Download & Load the InfiniteTalk I2V FP8 Lip-Sync Workflow

Now that your environment and model files are set up, it’s time to load and configure the InfiniteTalk I2V FP8 Lip-Sync Workflow in ComfyUI. This setup ensures all model connections, encoders, and audio-processing nodes work together seamlessly for high-quality lip-synced video generation from a single still image and audio file. Once configured, you’ll be ready to bring any portrait to life with natural, expressive speech movement powered by InfiniteTalk.

Load the InfiniteTalk I2V FP8 Workflow JSON File

👉 Download the InfiniteTalk I2V FP8 Lip-Sync Workflow JSON file and drag it directly into your ComfyUI canvas.

This workflow comes fully pre-arranged with all essential nodes, model references, and audio synchronization components required for realistic, frame-accurate lip-sync generation from static images.

Install Missing Nodes

If any nodes appear in red, that means certain custom nodes are missing.\

To fix this:

  1. Open the Manager tab in ComfyUI.

  2. Click Install Missing Custom Nodes.

  3. After installation, restart ComfyUI to apply the changes.

This will ensure all InfiniteTalk I2V and MelBandRoFormer nodes are properly installed and ready to process your image and audio seamlessly.

Once all nodes load correctly and your workflow opens without errors, you’re ready to upload your portrait image and corresponding audio into the designated nodes.\

InfiniteTalk will automatically analyze the audio, map phoneme timings, and synchronize mouth movements in real time — creating an impressively natural talking animation that feels alive and expressive

4. Running the Lip-Sync Video Generation Workflow

With the workflow loaded and all components in place, it’s time to generate your first lip-synced talking video using InfiniteTalk in ComfyUI. This section will walk you through setting up your image, audio, and output parameters — then running the workflow to produce smooth, expressive results.

Upload Your Image

Start by loading your initial image into the Image Loader node. This still image will serve as the base for mouth and facial movement generation, while InfiniteTalk automatically drives the animation according to your audio input.

👉 Tip: Use a front-facing portrait with clear lighting and a neutral expression for the most natural lip-sync performance. Higher resolution images capture better detail for facial deformation and expression synthesis.

Set Video Dimensions and Aspect Ratio

Next, define your video’s output size and aspect ratio in the Resize Image node. Since our workflow begins with a vertical portrait image, we set it to a 9:16 aspect ratio with a 480p vertical resolution to maintain the original framing and facial proportions. This ensures that the lip-sync mapping stays accurate without stretching or cropping your subject.

SettingValueNotes
Width480Matches vertical portrait input
Height832Preserves 9:16 aspect ratio
Aspect Ratio9:16Keeps the subject’s proportions intact

This 480×832 setup keeps generation fast while preserving facial details.

Load Your Audio File

Import your audio clip into the Audio Loader node. InfiniteTalk will analyze the audio, extracting phoneme timings, amplitude, and rhythm patterns to drive accurate mouth and jaw motion frame-by-frame.

0:00
0:00

👉 Tip: Ensure your audio is clear and free of background noise. Clear speech improves lip-sync precision.

Set Frame Length and Num Frames

In the Multi/InfiniteTalk Wav2vec2 Embeds Node, set the num_frames parameter. This controls how many frames the workflow will generate, directly determining the video length. Inside this node, you can also tweak audio_scale to strengthen the lip-sync effect — higher values make mouth movements more pronounced, while lower values make them subtler.

For testing, start with shorter audio clips and increase num_frames for longer sequences once you’re comfortable. At 25 fps, the following num_frames values correspond roughly to these audio durations:

Audio LengthNum Frames
4 seconds100
8 seconds200
10 seconds250

👉 Tip: If your audio is longer than the num_frames value, it will be trimmed at that limit. Adjust num_frames to match or slightly exceed your audio length to capture the full clip.

Set a Text Prompt

After loading your audio, you can use the Text Encoder node to provide a prompt describing the scene or the character. This can help guide facial expressions, mood, or style in the generated video. Since we’re using a Lightning LoRA, the CFG scale is set to 1, meaning the prompt will have a subtle effect and may not be fully followed. You can still add descriptive cues like: “A woman talks slowly. She is breathing heavily.

Run the Generation

Once your image, audio, and num_frames are set, run the InfiniteTalk workflow. The system will process your vertical portrait image and audio, generating a fully lip-synced video automatically. The result is a smooth, natural animation with facial movements accurately following the speech, ready for presentations, social media clips, or any creative project.

This render took around 200 seconds on a RTX 4090 (24GB VRAM). So i highly recommend renting out a GPU on RunPod.

5. Bonus: InfiniteTalk I2V GGUF Workflow (Low V-RAM)

For users operating in low-VRAM environments, the GGUF variants of the InfiniteTalk and WAN2.1 I2V models offer a viable alternative for generating lip-synced animations locally.

For running locally without a high-end GPU, the main difference is that instead of the FP8 diffusion models, you’ll need GGUF models that fit your VRAM.

You can download them here:

That’s the only extra requirement for this workflow — everything else stays the same.

Verify Folder Structure

Once downloaded, place the GGUF models in the diffusion_models folder. Your folder structure should look like this:

ts
1📁 ComfyUI/
2└── 📁 models/
3    ├── 📁 diffusion_models/
4    │   ├── Wan2_1-InfiniteTalk_Single_Q4_K_M.gguf # Or any other GGUF version
5    │   ├── wan2.1-i2v-14b-480p-Q4_K_M.gguf # Or any other GGUF version
6    │   ├── MelBandRoformer_fp16.safetensors
7    ├── 📁 vae/
8    │   └── Wan2_1_VAE_bf16.safetensors
9    ├── 📁 clip_vision/
10    │   └── clip_vision_h.safetensors
11    ├── 📁 text_encoders/
12    │   └── umt5-xxl-enc-bf16.safetensors
13    └── 📁 loras/
14        └── lightx2v_I2V_14B_480p_cfg_step_distill_rank64_bf16.safetensors

Once in place, ComfyUI will recognize them at startup, and you can run the InfiniteTalk workflow just like normal, even on lower-VRAM hardware.

Load the InfiniteTalk I2V GGUF Lip-Sync Workflow

Start by loading the InfiniteTalk I2V GGUF Lip-Syncs Workflow into ComfyUI:

👉 Download the InfiniteTalk I2V GGUF Lip-Sync workflow JSON file and drag it directly into your ComfyUI canvas.

This workflow comes fully configured with all nodes and references required for smooth lip-synced animation generation. With the GGUF workflow set up, you’re ready to start generating talking animations locally, even on lower-VRAM hardware.

6. Conclusion

Congratulations! You’ve now explored the full capabilities of InfiniteTalk Lip-Sync Videos in ComfyUI — from the high-VRAM FP8 workflow to the VRAM-friendly GGUF variant. You’ve learned how to set up your system, organize and load the necessary models, configure the workflow, feed in your images and audio, craft optional prompts, and generate smooth, expressive talking animations with precise lip-sync and facial motion.

With InfiniteTalk, creating realistic, high-quality lip-synced videos is no longer limited to advanced software or top-tier GPUs. Whether you’re producing AI avatars, tutorials, social media clips, or creative storytelling content, the FP8 and GGUF workflows provide the flexibility to generate professional results efficiently on a wide range of hardware.

Now it’s your turn: experiment with your own images, audio clips, and descriptive prompts to bring characters to life. With the workflows and tools covered here, you’re fully equipped to create engaging, natural-looking talking videos — frame by frame, with perfect synchronization.

Frequently Asked Questions

Custom LoRA Training for Flux Dev Model

Train Custom Character LoRAs for Flux Dev

Automatically generate a dataset, create captions, and train LoRAs from a single image.

Start Training Now
OR