SyncTalk

SyncTalk: The Devil😈 is in the Synchronization for Talking Head Synthesis

¹Renmin University of China, ²Beijing University of Posts and Telecommunications, ³Psyche AI Inc., ⁴Chinese Academy of Sciences, ⁵Tsinghua University

Abstract

Achieving high synchronization in the synthesis of realistic, speech-driven talking head videos presents a significant challenge. Traditional Generative Adversarial Networks (GAN) struggle to maintain consistent facial identity, while Neural Radiance Fields (NeRF) methods, although they can address this issue, often produce mismatched lip movements, inadequate facial expressions, and unstable head poses. A lifelike talking head requires synchronized coordination of subject identity, lip movements, facial expressions, and head poses. The absence of these synchronizations is a fundamental flaw, leading to unrealistic and artificial outcomes.

To address the critical issue of synchronization, identified as the ''devil'' in creating realistic talking heads, we introduce SyncTalk. This NeRF-based method effectively maintains subject identity, enhancing synchronization and realism in talking head synthesis. SyncTalk employs a Face-Sync Controller to align lip movements with speech and innovatively uses a 3D facial blendshape model to capture accurate facial expressions. Our Head-Sync Stabilizer optimizes head poses, achieving more natural head movements. The Portrait-Sync Generator restores hair details and blends the generated head with the torso for a seamless visual experience. Extensive experiments and user studies demonstrate that SyncTalk outperforms state-of-the-art methods in synchronization and realism. We recommend watching the supplementary video.

Proposed Method

Overview of SyncTalk. Given a cropped reference video of a talking head and the corresponding speech, SyncTalk can extract the Lip Feature , Expression Feature , and Head Pose through two synchronization modules and . The Tri-Plane Hash Representation then models the head, outputting a rough speech-driven video. The Portrait-Sync Generator further restores details such as hair and background, ultimately producing a high-resolution talking head video.

BibTeX

@article{peng2023synctalk, title={SyncTalk: The Devil is in the Synchronization for Talking Head Synthesis}, author={Ziqiao Peng and Wentao Hu and Yue Shi and Xiangyu Zhu and Xiaomei Zhang and Jun He and Hongyan Liu and Zhaoxin Fan}, journal={arXiv preprint arXiv:2311.17590}, year={2023} }

SyncTalk: The Devil😈 is in the Synchronization for Talking Head Synthesis

SyncTalk synthesizes synchronized talking head videos, employing tri-plane hash representations to maintain subject identity. It can generate synchronized lip movements, facial expressions, and stable head poses, and restores hair details to create high-resolution videos.

Abstract

Proposed Method

BibTeX