Create Lip-Sync Videos from Images with InfiniteTalk in ComfyUI

Table of Contents
1. Introduction
In this tutorial, you’ll learn how to create realistic lip-sync videos from static images using InfiniteTalk in ComfyUI. Instead of manually animating facial movements, this workflow uses InfiniteTalk’s audio-driven Image-to-Video (I2V) system to generate natural speech motion perfectly synced to your audio. We’ll cover both high-VRAM FP8 and low-VRAM GGUF workflows, so you can produce expressive, talking avatars on a wide range of hardware. By the end, you’ll be able to turn any image into a smooth, high-quality, lip-synced talking video — ideal for AI characters, tutorials, social media clips, or creative storytelling.
2. System Requirements for InfiniteTalk I2V Workflow
Before generating lip-sync videos, ensure your system meets the hardware and software requirements to run the InfiniteTalk I2V workflow smoothly inside ComfyUI. This setup benefits from a strong GPU for faster inference — we recommend at least an RTX 4090 (24GB VRAM) or using a cloud GPU provider like RunPod for optimal performance.
Requirement 1: ComfyUI Installed & Updated
To get started, you’ll need ComfyUI installed either locally or through the cloud. For a local Windows setup, follow this guide:
👉 How to Install ComfyUI Locally on Windows
Once installed, open the Manager tab in ComfyUI and click “Update ComfyUI” to make sure you’re on the latest version. Keeping ComfyUI updated ensures compatibility with new workflows, models, and nodes used by InfiniteTalk.
While InfiniteTalk I2V can run locally, we highly recommend using our Next Diffusion - ComfyUI SageAttention template on RunPod for this workflow. Here’s why:
-
Sage Attention & Triton Acceleration — both are pre-installed and optimized in the RunPod template, significantly boosting generation speed and VRAM efficiency.
-
Plug-and-Play Setup — no manual CUDA or PyTorch dependencies to configure; everything is ready to run InfiniteTalk out-of-the-box.
-
Persistent Storage with Network Volume — saves your models and workflows so you don’t have to re-download or set up nodes each session.
You can spin up a ready-to-use ComfyUI instance in just a few minutes using the RunPod template below:
👉 How to Run ComfyUI on RunPod with Network Volume
Requirement 2: Download InfiniteTalk I2V Model Files
The InfiniteTalk Lip-Sync Workflow relies on a combination of WAN-based I2V diffusion models, audio feature encoders, and LoRA refinements to generate realistic lip-synced motion from still images.
Download each of the following models and place them in their respective ComfyUI model directories exactly as listed below.
File Name | Hugging Face Download Page | File Directory |
---|---|---|
Wan2_1-InfiniteTalk-Single_fp8_e4m3fn_scaled_KJ.safetensors | 🤗 Download | ..\ComfyUI\models\diffusion_models |
Wan2_1-I2V-14B-480P_fp8_e4m3fn.safetensors | 🤗 Download | ..\ComfyUI\models\diffusion_models |
MelBandRoformer_fp16.safetensors | 🤗 Download | ..\ComfyUI\models\diffusion_models |
Wan2_1_VAE_bf16.safetensors | 🤗 Download | ..\ComfyUI\models\vae |
clip_vision_h.safetensors | 🤗 Download | ..\ComfyUI\models\clip_vision |
umt5-xxl-enc-bf16.safetensors | 🤗 Download | ..\ComfyUI\models\text_encoders |
lightx2v_I2V_14B_480p_cfg_step_distill_rank64_bf16.safetensors | 🤗 Download | ..\ComfyUI\models\loras |
Once all models are downloaded and placed in the proper folders, ComfyUI will automatically recognize them at startup. This ensures InfiniteTalk’s I2V and audio-processing nodes load correctly for smooth lip-sync generation.
Requirement 3: Verify Folder Structure
Before running the InfiniteTalk Lip-Sync Workflow, make sure all downloaded model files are placed in the correct ComfyUI subfolders. Your folder structure should look exactly like this:
ts1📁 ComfyUI/ 2└── 📁 models/ 3 ├── 📁 diffusion_models/ 4 │ ├── Wan2_1-InfiniteTalk-Single_fp8_e4m3fn_scaled_KJ.safetensors 5 │ ├── Wan2_1-I2V-14B-480P_fp8_e4m3fn.safetensors 6 │ ├── MelBandRoformer_fp16.safetensors 7 ├── 📁 vae/ 8 │ └── Wan2_1_VAE_bf16.safetensors 9 ├── 📁 clip_vision/ 10 │ └── clip_vision_h.safetensors 11 ├── 📁 text_encoders/ 12 │ └── umt5-xxl-enc-bf16.safetensors 13 └── 📁 loras/ 14 └── lightx2v_I2V_14B_480p_cfg_step_distill_rank64_bf16.safetensors
Once everything is installed and organized, you’re ready to download and load the InfiniteTalk I2V FP8 Lip-Sync Workflow in ComfyUI and start generating smooth, expressive talking animations from still images.
Thanks to the LightX2V LoRA, the entire render can be completed in just 6 total steps — delivering natural lip movement, precise audio synchronization, and high visual fidelity, all with impressive speed.
3. Download & Load the InfiniteTalk I2V FP8 Lip-Sync Workflow
Now that your environment and model files are set up, it’s time to load and configure the InfiniteTalk I2V FP8 Lip-Sync Workflow in ComfyUI. This setup ensures all model connections, encoders, and audio-processing nodes work together seamlessly for high-quality lip-synced video generation from a single still image and audio file. Once configured, you’ll be ready to bring any portrait to life with natural, expressive speech movement powered by InfiniteTalk.
Load the InfiniteTalk I2V FP8 Workflow JSON File
👉 Download the InfiniteTalk I2V FP8 Lip-Sync Workflow JSON file and drag it directly into your ComfyUI canvas.
This workflow comes fully pre-arranged with all essential nodes, model references, and audio synchronization components required for realistic, frame-accurate lip-sync generation from static images.
Install Missing Nodes
If any nodes appear in red, that means certain custom nodes are missing.\
To fix this:
-
Open the Manager tab in ComfyUI.
-
Click Install Missing Custom Nodes.
-
After installation, restart ComfyUI to apply the changes.
This will ensure all InfiniteTalk I2V and MelBandRoFormer nodes are properly installed and ready to process your image and audio seamlessly.
Once all nodes load correctly and your workflow opens without errors, you’re ready to upload your portrait image and corresponding audio into the designated nodes.\
InfiniteTalk will automatically analyze the audio, map phoneme timings, and synchronize mouth movements in real time — creating an impressively natural talking animation that feels alive and expressive
4. Running the Lip-Sync Video Generation Workflow
With the workflow loaded and all components in place, it’s time to generate your first lip-synced talking video using InfiniteTalk in ComfyUI. This section will walk you through setting up your image, audio, and output parameters — then running the workflow to produce smooth, expressive results.
Upload Your Image
Start by loading your initial image into the Image Loader node. This still image will serve as the base for mouth and facial movement generation, while InfiniteTalk automatically drives the animation according to your audio input.
👉 Tip: Use a front-facing portrait with clear lighting and a neutral expression for the most natural lip-sync performance. Higher resolution images capture better detail for facial deformation and expression synthesis.
Set Video Dimensions and Aspect Ratio
Next, define your video’s output size and aspect ratio in the Resize Image node. Since our workflow begins with a vertical portrait image, we set it to a 9:16 aspect ratio with a 480p vertical resolution to maintain the original framing and facial proportions. This ensures that the lip-sync mapping stays accurate without stretching or cropping your subject.
Setting | Value | Notes |
---|---|---|
Width | 480 | Matches vertical portrait input |
Height | 832 | Preserves 9:16 aspect ratio |
Aspect Ratio | 9:16 | Keeps the subject’s proportions intact |
This 480×832 setup keeps generation fast while preserving facial details.
Load Your Audio File
Import your audio clip into the Audio Loader node. InfiniteTalk will analyze the audio, extracting phoneme timings, amplitude, and rhythm patterns to drive accurate mouth and jaw motion frame-by-frame.
👉 Tip: Ensure your audio is clear and free of background noise. Clear speech improves lip-sync precision.
Set Frame Length and Num Frames
In the Multi/InfiniteTalk Wav2vec2 Embeds Node, set the num_frames parameter. This controls how many frames the workflow will generate, directly determining the video length. Inside this node, you can also tweak audio_scale to strengthen the lip-sync effect — higher values make mouth movements more pronounced, while lower values make them subtler.
For testing, start with shorter audio clips and increase num_frames for longer sequences once you’re comfortable. At 25 fps, the following num_frames values correspond roughly to these audio durations:
Audio Length | Num Frames |
---|---|
4 seconds | 100 |
8 seconds | 200 |
10 seconds | 250 |
👉 Tip: If your audio is longer than the num_frames value, it will be trimmed at that limit. Adjust num_frames to match or slightly exceed your audio length to capture the full clip.
Set a Text Prompt
After loading your audio, you can use the Text Encoder node to provide a prompt describing the scene or the character. This can help guide facial expressions, mood, or style in the generated video. Since we’re using a Lightning LoRA, the CFG scale is set to 1, meaning the prompt will have a subtle effect and may not be fully followed. You can still add descriptive cues like: “A woman talks slowly. She is breathing heavily.”
Run the Generation
Once your image, audio, and num_frames are set, run the InfiniteTalk workflow. The system will process your vertical portrait image and audio, generating a fully lip-synced video automatically. The result is a smooth, natural animation with facial movements accurately following the speech, ready for presentations, social media clips, or any creative project.
This render took around 200 seconds on a RTX 4090 (24GB VRAM). So i highly recommend renting out a GPU on RunPod.
5. Bonus: InfiniteTalk I2V GGUF Workflow (Low V-RAM)
For users operating in low-VRAM environments, the GGUF variants of the InfiniteTalk and WAN2.1 I2V models offer a viable alternative for generating lip-synced animations locally.
For running locally without a high-end GPU, the main difference is that instead of the FP8 diffusion models, you’ll need GGUF models that fit your VRAM.
You can download them here:
That’s the only extra requirement for this workflow — everything else stays the same.
Verify Folder Structure
Once downloaded, place the GGUF models in the diffusion_models folder. Your folder structure should look like this:
ts1📁 ComfyUI/ 2└── 📁 models/ 3 ├── 📁 diffusion_models/ 4 │ ├── Wan2_1-InfiniteTalk_Single_Q4_K_M.gguf # Or any other GGUF version 5 │ ├── wan2.1-i2v-14b-480p-Q4_K_M.gguf # Or any other GGUF version 6 │ ├── MelBandRoformer_fp16.safetensors 7 ├── 📁 vae/ 8 │ └── Wan2_1_VAE_bf16.safetensors 9 ├── 📁 clip_vision/ 10 │ └── clip_vision_h.safetensors 11 ├── 📁 text_encoders/ 12 │ └── umt5-xxl-enc-bf16.safetensors 13 └── 📁 loras/ 14 └── lightx2v_I2V_14B_480p_cfg_step_distill_rank64_bf16.safetensors
Once in place, ComfyUI will recognize them at startup, and you can run the InfiniteTalk workflow just like normal, even on lower-VRAM hardware.
Load the InfiniteTalk I2V GGUF Lip-Sync Workflow
Start by loading the InfiniteTalk I2V GGUF Lip-Syncs Workflow into ComfyUI:
👉 Download the InfiniteTalk I2V GGUF Lip-Sync workflow JSON file and drag it directly into your ComfyUI canvas.
This workflow comes fully configured with all nodes and references required for smooth lip-synced animation generation. With the GGUF workflow set up, you’re ready to start generating talking animations locally, even on lower-VRAM hardware.
6. Conclusion
Congratulations! You’ve now explored the full capabilities of InfiniteTalk Lip-Sync Videos in ComfyUI — from the high-VRAM FP8 workflow to the VRAM-friendly GGUF variant. You’ve learned how to set up your system, organize and load the necessary models, configure the workflow, feed in your images and audio, craft optional prompts, and generate smooth, expressive talking animations with precise lip-sync and facial motion.
With InfiniteTalk, creating realistic, high-quality lip-synced videos is no longer limited to advanced software or top-tier GPUs. Whether you’re producing AI avatars, tutorials, social media clips, or creative storytelling content, the FP8 and GGUF workflows provide the flexibility to generate professional results efficiently on a wide range of hardware.
Now it’s your turn: experiment with your own images, audio clips, and descriptive prompts to bring characters to life. With the workflows and tools covered here, you’re fully equipped to create engaging, natural-looking talking videos — frame by frame, with perfect synchronization.