LTX 2.3 Image-to-Video with Custom Audio in ComfyUI

March 16, 2026
ComfyUI
LTX 2.3 Image-to-Video with Custom Audio in ComfyUI
Learn to create synchronized image-to-video animations with custom audio using LTX 2.3 in ComfyUI. Perfect for music visualizers and artistic projects!

1. Introduction

In this tutorial, you'll learn how to create a lip-synced talking video from a single still image using LTX 2.3 in ComfyUI. The workflow is straightforward: you provide an image and a speech audio file, and LTX 2.3 generates a video where the subject's mouth moves in sync with the spoken audio. The result is a natural, believable talking animation β€” driven entirely by the voice you supply.

This makes LTX 2.3 a powerful tool for bringing portraits, characters, avatars, and illustrations to life with spoken dialogue. Whether you're animating a product spokesperson, a historical figure, an AI avatar, or any other subject β€” if it has a face and you have a voice, LTX 2.3 can make it talk.

We cover the full-quality FP8 workflow for high-VRAM GPUs and a GGUF variant for lower-spec systems β€” so you can run this locally on a wide range of hardware.

2. System Requirements for LTX 2.3 I2V + Audio (FP8 Workflow)

Before generating your first talking portrait, make sure your system meets the hardware and software requirements. LTX 2.3 is a large 22B-parameter model β€” we recommend at least an RTX 4090 (24 GB VRAM) for the FP8 workflow, or a cloud GPU service like RunPod.

Requirement 1: ComfyUI Installed & Updated

You need ComfyUI installed locally or via cloud. For a local Windows setup:
πŸ‘‰ How to Install ComfyUI Locally on Windows

Once installed, open the Manager tab and click Update ComfyUI to ensure compatibility with the LTX 2.3 nodes this workflow requires.

If you don’t have a high-end GPU locally, consider running ComfyUI on RunPod with a network volume for persistent storage:
πŸ‘‰ How to Run ComfyUI on RunPod with Network Volume

Requirement 2: Download LTX 2.3 FP8 Model Files

Download each model file below and place it in the correct ComfyUI folder.

File NameHugging Face DownloadComfyUI Folder
ltx-2.3-22b-distilled_transformer_only_fp8_scaled.safetensorsπŸ€— Download..\ComfyUI\models\diffusion_models
MelBandRoformer_fp16.safetensorsπŸ€— Download..\ComfyUI\models\diffusion_models
gemma_3_12B_it_fpmixed.safetensorsπŸ€— Download..\ComfyUI\models\text_encoders
ltx-2.3_text_projection_bf16.safetensorsπŸ€— Download..\ComfyUI\models\text_encoders
LTX23_audio_vae_bf16.safetensorsπŸ€— Download..\ComfyUI\models\vae
LTX23_video_vae_bf16.safetensorsπŸ€— Download..\ComfyUI\models\vae
taeltx2_3.safetensorsπŸ€— Download..\ComfyUI\models\vae
ltx-2.3-spatial-upscaler-x2-1.0.safetensorsπŸ€— Download..\ComfyUI\models\latent_upscale_models

Requirement 3: Verify Folder Structure

Confirm your files are organized exactly like this before loading the workflow:

ts
1πŸ“ ComfyUI/
2└── πŸ“ models/
3    β”œβ”€β”€ πŸ“ diffusion_models/
4    β”‚   β”œβ”€β”€ ltx-2.3-22b-distilled_transformer_only_fp8_scaled.safetensors
5    β”‚   └── MelBandRoformer_fp16.safetensors
6    β”œβ”€β”€ πŸ“ text_encoders/
7    β”‚   β”œβ”€β”€ gemma_3_12B_it_fpmixed.safetensors
8    β”‚   └── ltx-2.3_text_projection_bf16.safetensors
9    β”œβ”€β”€ πŸ“ vae/
10    β”‚   β”œβ”€β”€ LTX23_audio_vae_bf16.safetensors
11    β”‚   β”œβ”€β”€ LTX23_video_vae_bf16.safetensors
12    β”‚   └── taeltx2_3.safetensors
13    └── πŸ“ latent_upscale_models/
14        └── ltx-2.3-spatial-upscaler-x2-1.0.safetensors

3. Download & Load the LTX 2.3 I2V Audio Workflow

With your environment and model files ready, it's time to load the workflow in ComfyUI.

Load the Workflow

πŸ‘‰ Download the LTX 2.3 I2V + Custom Audio workflow JSON file and drag it directly onto your ComfyUI canvas.

The workflow arrives fully pre-wired with all required nodes: audio loader, MelBand Roformer vocal separator, audio VAE encoder, video VAE, LTX 2.3 distilled transformer, image-to-video conditioning, dual samplers, spatial upscaler, and the custom audio switch.

Install Missing Nodes

If any nodes appear in red after loading, open the Manager tab, click Install Missing Custom Nodes, and restart ComfyUI. Once everything loads cleanly, you're ready to configure your inputs.

4. Running the Image-to-Video + Audio Generation

With the workflow loaded and all nodes green, here is how to configure and run your first lip-synced animation.

Step 1 β€” Generate or Prepare Your Portrait Image

Start with a high-quality portrait image as your source frame. For this tutorial we generated our starting image using Z-Image Turbo β€” a fast, high-quality image generation workflow β€” with the following prompt:

Close-up naughty woman 20-25, porcelain skin dense freckles illuminated by bright window sun, fiery copper-red long wavy hair thick straight bangs radiant, intense emerald eyes lashes liner sparkling, prominent round silver cow septum nose ring catching light, delicate necklace, deep teal low-cut sweater neckline huge cleavage, head cocked to side with gentle inviting smile direct gaze, strong natural sunlight glowing on freckles and soft skin shine, elegant white room with subtle tropical palm and white-on-white art, dominant close-up, ultra-detailed realistic gloss 8k

For the best lip-sync results, choose a front-facing or mild three-quarter portrait where the subject's mouth and lower face are clearly visible. Sharper, higher-resolution portraits produce noticeably better output.

πŸ’‘ Tip: Avoid images where the mouth is covered, heavily shadowed, or shot at a steep side angle β€” lip-sync accuracy drops significantly when the mouth area isn't clearly visible.

Step 2 β€” Upload Your Voice Audio

Load your voice recording into the audio input node. This is the speech the portrait will appear to say.

0:00
0:00

πŸ’‘ Tip: A clean vocal recording without heavy background noise or music gives the model the clearest phoneme signal and produces the most accurate lip movements. If your audio has a music bed, the MelBand Roformer node will automatically isolate the vocal stem before it reaches the model.

Step 3 β€” Set Your Animation Prompt

Your text prompt controls the animation style, motion, and overall feel of the video. For lip-sync generation, keep the camera locked and focus the prompt on facial expressions, eye behaviour, and subtle natural movement. Here is the prompt we used for this example:

Static tight close-up portrait, fixed immobile camera locked on face only β€” no movement, no zoom, no push-in, no change whatsoever. She immediately talks with perfect lip-sync to custom audio, explicit lip movements breathy tone parted glossy lips. Ultra-sensual eyes: slow heavy blinks prolonged eye contact sultry half-lidded gazes eyebrow raises. Bites/licks lip slowly, head tilts leans in. Sensually plays with long hair β€” tucking twirling running fingers through. Gentle upper body movement. Soft window glow on freckles lips hair, intimate white room. Hyper-realistic skin pores freckles hair movement lip-sync, cinematic 24fps smooth motion, 8k, face-centered static shot only.

When writing your own prompts, keep the camera static and describe only gradual, continuous motion β€” subtle eye movement, natural head tilts, hair play, and facial expressions. Avoid prompting large camera moves or sudden action; these compete with the lip-sync signal.

Step 4 β€” Configure Video Settings

Set your output dimensions and video length in the Video Settings nodes β€” WIDTH, HEIGHT, LENGTH (in seconds), and FPS. For this tutorial we are using 736 Γ— 1280 at 24 fps for a 9:16 portrait video (ideal for Reels and TikTok).

⚠️ Important β€” Valid Parameter Rules: Width and height values must be divisible by 32, plus 1 (e.g. 737, not 736 β€” though the workflow will silently round to the nearest valid value if you don't). Frame count must be divisible by 8, plus 1. Running with invalid parameters will not throw an error β€” the workflow will silently use the closest valid values instead.

Below is a reference table of common resolutions you can use. We default to 720p (736 Γ— 1280) for 9:16. If you have a powerful GPU (RTX 5090 or better), try 1088 Γ— 1920 for full 1080p quality.

Aspect RatioWidthHeightQualityVRAM
9:16 (portrait)480864Low / fast previewLow
9:16 (portrait)7361280720p β€” recommended defaultMedium
9:16 (portrait)108819201080p β€” high qualityHigh (RTX 5090+)
16:9 (landscape)864480Low / fast previewLow
16:9 (landscape)1280736720p β€” recommendedMedium
16:9 (landscape)192010881080p β€” high qualityHigh (RTX 5090+)
1:1 (square)768768Social media squareMedium

Step 5 β€” Run the Generation

Once your image, audio, prompt, and video settings are configured, click RUN. LTX 2.3 runs two sampling passes β€” a fast distilled pass followed by a refinement pass β€” then decodes both the audio and video latents and combines them into the final output.

5. Bonus: GGUF I2V + Audio Workflow for Low-VRAM Systems

For users without a high-end GPU, the GGUF variant delivers the same audio-driven image-to-video generation at significantly reduced VRAM usage.

Load the GGUF Workflow

πŸ‘‰ Download the LTX 2.3 I2V + Custom Audio GGUF workflow JSON file and drag it onto your ComfyUI canvas.

The GGUF workflow uses GGUF loader nodes in place of the standard safetensors loaders β€” everything else in the pipeline (audio input, MelBand Roformer, VAE) works identically to the FP8 version.

GGUF Model Download

Replace the FP8 diffusion model with a GGUF-quantized LTX 2.3 model. Pick a quantization level that fits your VRAM (Q4 for tighter budgets, Q5/Q8 for higher quality):

⚠️ Important: Unlike the FP8 model which goes in diffusion_models, the GGUF file must be placed in the unet folder.

For the GGUF text encoder (Gemma 3 12B):
πŸ€— Unsloth β€” gemma-3-12b-it-GGUF

All other model files remain identical to the FP8 workflow β€” both VAE files, the audio VAE, MelBand Roformer, text projection, and spatial upscaler are unchanged.

GGUF Folder Structure

ts
1πŸ“ ComfyUI/
2└── πŸ“ models/
3    β”œβ”€β”€ πŸ“ unet/
4    β”‚   └── LTX-2.3-distilled-Q4_K_M.gguf  ← GGUF model goes here (NOT diffusion_models)
5    β”œβ”€β”€ πŸ“ diffusion_models/
6    β”‚   └── MelBandRoformer_fp16.safetensors
7    β”œβ”€β”€ πŸ“ text_encoders/
8    β”‚   β”œβ”€β”€ gemma-3-12b-it-Q4_K_M.gguf
9    β”‚   └── ltx-2.3_text_projection_bf16.safetensors
10    β”œβ”€β”€ πŸ“ vae/
11    β”‚   β”œβ”€β”€ LTX23_audio_vae_bf16.safetensors
12    β”‚   β”œβ”€β”€ LTX23_video_vae_bf16.safetensors
13    β”‚   └── taeltx2_3.safetensors
14    └── πŸ“ latent_upscale_models/
15        └── ltx-2.3-spatial-upscaler-x2-1.0.safetensors

And that's it β€” you're all set! You may need a few GGUF-specific custom nodes, so make sure to open the Manager and install any missing nodes before you start. After that, just load the workflow, upload your portrait image and voice audio, adjust your video settings, and hit RUN. Happy Gooning! πŸ”₯

6. Conclusion

Congratulations β€” you now have everything you need to create realistic lip-synced talking portrait videos using LTX 2.3 in ComfyUI. From generating your source image to uploading a voice recording and running the full LTX 2.3 image-to-video workflow, you've seen how straightforward the process is once everything is in place.

What makes this workflow genuinely exciting is that LTX 2.3's native audio VAE handles the lip-sync as part of the generation itself β€” not as a post-process overlay. That means the mouth movement, facial expressions, and natural head motion are all produced together in a single pass, giving you results that feel cohesive and believable rather than pasted on.

Whether you're animating a photorealistic portrait, a digital illustration, an AI-generated character, or any face you can imagine β€” if you have a voice recording and a good source image, this workflow will bring it to life. And with the GGUF variant available for lower-VRAM systems, you don't need a top-tier GPU to get started. Now it's your turn: pick your portrait, record or find your audio, and run the workflow. The results might surprise you. Happy generating!

Frequently Asked Questions

Custom LoRA Training for Flux Dev Model

Uncensored AI Tools

Deploy your own private generation hub and create uncensored visuals on demand.

Learn More
OR