ActAvatar

Temporally-Aware Precise Action Control
for Talking Avatars

  • Ziqiao Peng1*
  • Yi Chen2
  • Yifeng Ma2
  • Guozhen Zhang2
  • Zhiyao Sun2
  • Zixiang Zhou2
  • Youliang Zhang2
  • Zhengguang Zhou2
  • Zhaoxin Fan3
  • Hongyan Liu4✉
  • Yuan Zhou2†
  • Qinglin Lu2✉
  • Jun He1✉
  • 1 Renmin University of China
  • 2 Tencent Hunyuan
  • 3 Beihang University
  • 4 Tsinghua University

Project Leader Corresponding Author * Work done during an internship at Tencent Hunyuan

We present ActAvatar, a novel framework that achieves precise temporal control over talking avatar actions at the phase level. While existing methods generate avatars from holistic text descriptions without temporal grounding, ActAvatar introduces structured prompts with explicit temporal boundaries, enabling users to specify exactly when and what actions should occur during speech.

ActAvatar Results Gallery
Below we showcase diverse talking avatars generated by ActAvatar with phase-level temporal precision. Each video demonstrates our model's ability to execute fine-grained, temporally-grounded actions synchronized with speech, following structured prompts that specify exact timing and semantic content of gestures, head movements, and expressions. Notice how actions occur at precise moments aligned with speech phases, achieving natural coordination between audio and motion that prior methods struggle to capture.

Examples

Examples demonstrating ActAvatar's performance across various action types and temporal complexities.

26.mp4
19.mp4
58.mp4
28.mp4
30.mp4
11.mp4
14.mp4
8.mp4
45.mp4
24.mp4
54.mp4
2.mp4
42.mp4
35.mp4
38.mp4
52.mp4
34.mp4
1.mp4
61.mp4
Qualitative Comparisons
We show qualitative comparisons on four representative prompts. Each row uses the same reference image, audio and action description, while different methods generate the avatar motion. The name of each method is shown under the video.
Scenario 1: Thoughtful Chess Analysis
A middle-aged man sits at a chessboard, reflecting on his next move. The action description requires him to first lift his head upward and to the left, then raise his left hand to chest level in an open questioning gesture before returning to look at the board.
ActAvatar (ours)
FantasyTalking
OmniAvatar
StableAvatar
Scenario 2: Artist Explaining Her Process
A woman stands in an art gallery, explaining her creative process. The prompt asks her to first lift her head from a downward gaze to look forward while holding a sketchbook, then release the sketchbook with her left hand and gesture outward and upward to chest level.
ActAvatar (ours)
EchoMimicV3
HunyuanVideo-Avatar
WanS2V
Scenario 3: Elderly Woman in a Garden
An elderly woman crouches in her sunny garden, warmly talking about her flowers. The two phases involve first gesturing broadly to indicate the whole garden, then bringing her hand back to point down toward the flowers in front of her.
ActAvatar (ours)
EchoMimicV3
FantasyTalking
WanS2V
Scenario 4: Director Framing a Shot
A male director in a studio uses his hands to frame a shot while giving instructions. The action sequence requires him to first narrow the hand frame by moving his hands inward, then widen it by moving both hands out toward shoulder width in a pull-back motion.
ActAvatar (ours)
Hallo3
MultiTalk
StableAvatar
Method Overview
ActAvatar introduces a novel framework for phase-level temporal action control in talking avatar generation. Our approach consists of three key technical innovations: (1) Phase-Aware Cross-Attention (PACA), which enables fine-grained temporal grounding by injecting phase-specific action features at precise moments; (2) Progressive Audio-Visual Alignment, achieving tight synchronization between speech and motion through hierarchical feature matching; and (3) Two-Stage Training with Structured Prompts, where we leverage an MLLM to generate action descriptions with explicit temporal boundaries. The architecture builds upon a pre-trained video diffusion model, augmented with our temporal control modules to achieve unprecedented precision in action timing and semantic fidelity.
ActAvatar pipeline diagram
Citation
If you find ActAvatar useful for your research, please consider citing our paper:
@article{peng2025actavatar,
                title={ActAvatar: Temporally-Aware Precise Action Control for Talking Avatars},
                author={Peng, Ziqiao and Chen, Yi and Ma, Yifeng and Zhang, Guozhen and Sun, Zhiyao and Zhou, Zixiang and Zhang, Youliang and Zhou, Zhengguang and Fan, Zhaoxin and Liu, Hongyan and Zhou, Yuan and Lu, Qinglin and He, Jun },
                journal={arXiv preprint arXiv:2512.19546},
                year={2025}
              }