SyncTalk: The Devil😈 is in the Synchronization for Talking Head Synthesis

Ziqiao Peng1, Wentao Hu2, Yue Shi3, Xiangyu Zhu4, Xiaomei Zhang4, Hao Zhao5,
Jun He1, Hongyan Liu5*, Zhaoxin Fan1*
1Renmin University of China, 2Beijing University of Posts and Telecommunications, 3Psyche AI Inc., 4Chinese Academy of Sciences, 5Tsinghua University

SyncTalk synthesizes synchronized talking head videos, employing tri-plane hash representations to maintain subject identity. It can generate synchronized lip movements, facial expressions, and stable head poses, and restores hair details to create high-resolution videos.

Abstract

Achieving high synchronization in the synthesis of realistic, speech-driven talking head videos presents a significant challenge. Traditional Generative Adversarial Networks (GAN) struggle to maintain consistent facial identity, while Neural Radiance Fields (NeRF) methods, although they can address this issue, often produce mismatched lip movements, inadequate facial expressions, and unstable head poses. A lifelike talking head requires synchronized coordination of subject identity, lip movements, facial expressions, and head poses. The absence of these synchronizations is a fundamental flaw, leading to unrealistic and artificial outcomes.

To address the critical issue of synchronization, identified as the ''devil'' in creating realistic talking heads, we introduce SyncTalk. This NeRF-based method effectively maintains subject identity, enhancing synchronization and realism in talking head synthesis. SyncTalk employs a Face-Sync Controller to align lip movements with speech and innovatively uses a 3D facial blendshape model to capture accurate facial expressions. Our Head-Sync Stabilizer optimizes head poses, achieving more natural head movements. The Portrait-Sync Generator restores hair details and blends the generated head with the torso for a seamless visual experience. Extensive experiments and user studies demonstrate that SyncTalk outperforms state-of-the-art methods in synchronization and realism. We recommend watching the supplementary video.



Proposed Method

synctalk


Overview of SyncTalk. Given a cropped reference video of a talking head and the corresponding speech, SyncTalk can extract the Lip Feature , Expression Feature , and Head Pose through two synchronization modules and . The Tri-Plane Hash Representation then models the head, outputting a rough speech-driven video. The Portrait-Sync Generator further restores details such as hair and background, ultimately producing a high-resolution talking head video.

BibTeX


  @article{peng2023synctalk,
    title={SyncTalk: The Devil is in the Synchronization for Talking Head Synthesis}, 
    author={Ziqiao Peng and Wentao Hu and Yue Shi and Xiangyu Zhu and Xiaomei Zhang and Jun He and Hongyan Liu and Zhaoxin Fan},
    journal={arXiv preprint arXiv:2311.17590},
    year={2023}
  }