ICCV 2023

EmoTalk: Speech-Driven Emotional Disentanglement for 3D Face Animation

Ziqiao Peng1, Haoyu Wu1, Zhenbo Song2, Hao Xu3,6, Xiangyu Zhu4
Hongyan Liu5, Jun He1, Zhaoxin Fan1,6
1Renmin University of China, 2Nanjing University of Science and Technology, 3The Hong Kong University of Science and Technology, 4Chinese Academy of Sciences, 5Tsinghua University, 6Psyche AI Inc.

EmoTalk is an end-to-end neural network for generating speech-driven emotion-enhanced 3D facial animation.

Abstract

Speech-driven 3D face animation aims to generate realistic facial expressions that match the speech content and emotion. However, existing methods often neglect emotional facial expressions or fail to disentangle them from speech content.

To address this issue, this paper proposes an end-to-end neural network to disentangle different emotions in speech so as to generate rich 3D facial expressions. Specifically, we introduce the emotion disentangling encoder (EDE) to disentangle the emotion and content in the speech by cross-reconstructed speech signals with different emotion labels. Then an emotion-guided feature fusion decoder is employed to generate a 3D talking face with enhanced emotion. The decoder is driven by the disentangled identity, emotional, and content embeddings so as to generate controllable personal and emotional styles.

Finally, considering the scarcity of the 3D emotional talking face data, we resort to the supervision of facial blendshapes, which enables the reconstruction of plausible 3D faces from 2D emotional data, and contribute a large-scale 3D emotional talking face dataset (3D-ETF) to train the network. Our experiments and user studies demonstrate that our approach outperforms state-of-the-art methods and exhibits more diverse facial movements. We recommend watching the supplementary video.



Proposed Method

emotalk


Overview of EmoTalk. Given a speech input , emotional level , and personal style as inputs, our model disentangles the emotion and content in the speech using two latent spaces. The features extracted from these latent spaces are combined and fed into the emotion-guided feature fusion decoder, which outputs emotion-enhanced blendshape coefficients. These coefficients can be used to animate a FLAME model or rendered as an image sequence.

BibTeX


  @inproceedings{peng2023emotalk,
    title={EmoTalk: Speech-Driven Emotional Disentanglement for 3D Face Animation}, 
    author={Ziqiao Peng and Haoyu Wu and Zhenbo Song and Hao Xu and Xiangyu Zhu and Hongyan Liu and Jun He and Zhaoxin Fan},
    booktitle={Proceedings of the IEEE/CVF international conference on computer vision},
    year={2023}
  }