HunyuanCustom Revolutionizes Deepfake Video Creation with Single-Image Input and Audio Lip Sync

Overview of HunyuanCustom

Tencent has unveiled HunyuanCustom, a new version of its Hunyuan Video model that enables deepfake-style video generation from just a single image. This multimodal system supports video customization with synchronized audio and lip movements, aiming to replace the need for multiple images or extensive fine-tuning such as LoRA models.

Single-Image Video Generation

HunyuanCustom can generate videos based on a single reference image and a text prompt. The system interprets the prompt while maintaining the identity of the subject in the source image. However, the model has limitations when the subject rotates beyond 20-25 degrees or shows different facial expressions due to relying on a single frontal image.

Audio and Lip Sync Integration

The model integrates the LatentSync system to achieve lip movements that match supplied audio and text. This allows characters in generated videos to speak in sync with the audio, enhancing realism. Examples show promising lip-sync quality, although English-language demos are not yet widely available.

Video-to-Video Editing

HunyuanCustom supports video-to-video (vid2vid) editing, where a segment of existing video can be masked and replaced intelligently using a single reference image. This enables targeted subject replacement without rebuilding the entire video, offering flexible editing capabilities.

Technical Innovations and Training

The system fine-tunes the December 2024 HunyuanVideo foundation model with new architectural insertions rather than rebuilding from scratch. It incorporates a complex data pipeline adhering to GDPR, utilizing various datasets and advanced tools for video segmentation, annotation, and quality filtering. The model uses Flow Matching training and combines image and text embeddings using LLaVA for coherent multi-modal generation.

Performance and Comparisons

In extensive tests against other commercial and open-source video customization methods, HunyuanCustom achieved the best identity and subject consistency, comparable text-to-video alignment, and strong temporal stability. It also outperformed competitors in multi-subject and audio-driven video customization tasks.

Accessibility and Future Prospects

The code and weights are available on GitHub, currently Linux-focused and requiring substantial GPU memory (24GB minimum, recommended 80GB). An API demo exists but requires WeChat access. The community is expected to adapt the model for more consumer-friendly use and Windows compatibility soon.

Conclusion

HunyuanCustom introduces a significant step forward in video deepfake technology by enabling single-image input, synchronized audio lip-sync, and flexible video editing. Its strong performance against leading competitors hints at a promising future for accessible, high-quality customized video generation.