ActAvatar

Temporally-Aware Precise Action Control
for Talking Avatars

Ziqiao Peng^1*
Yi Chen²
Yifeng Ma²
Guozhen Zhang²
Zhiyao Sun²
Zixiang Zhou²
Youliang Zhang²
Zhengguang Zhou²
Zhaoxin Fan³
Hongyan Liu^4✉
Yuan Zhou^2†
Qinglin Lu^2✉
Jun He^1✉

¹ Renmin University of China
² Tencent Hunyuan
³ Beihang University
⁴ Tsinghua University

^† Project Leader ^✉ Corresponding Author ^* Work done during an internship at Tencent Hunyuan

We present ActAvatar, a novel framework that achieves precise temporal control over talking avatar actions at the phase level. While existing methods generate avatars from holistic text descriptions without temporal grounding, ActAvatar introduces structured prompts with explicit temporal boundaries, enabling users to specify exactly when and what actions should occur during speech.

ActAvatar Results Gallery

Below we showcase diverse talking avatars generated by ActAvatar with phase-level temporal precision. Each video demonstrates our model's ability to execute fine-grained, temporally-grounded actions synchronized with speech, following structured prompts that specify exact timing and semantic content of gestures, head movements, and expressions. Notice how actions occur at precise moments aligned with speech phases, achieving natural coordination between audio and motion that prior methods struggle to capture.

Examples

Examples demonstrating ActAvatar's performance across various action types and temporal complexities.

26.mp4

19.mp4

58.mp4

28.mp4

30.mp4

11.mp4

14.mp4

8.mp4

45.mp4

24.mp4

54.mp4

2.mp4

42.mp4

35.mp4

38.mp4

52.mp4

34.mp4

1.mp4

61.mp4

Complete Gallery

Additional results demonstrating ActAvatar's consistency and robustness across diverse scenarios.

9.mp4

17.mp4

32.mp4

36.mp4

43.mp4

49.mp4

22.mp4

3.mp4

4.mp4

5.mp4

6.mp4

10.mp4

12.mp4

13.mp4

27.mp4

29.mp4

31.mp4

15.mp4

16.mp4

60.mp4

18.mp4

20.mp4

21.mp4

23.mp4

25.mp4

33.mp4

37.mp4

39.mp4

40.mp4

41.mp4

44.mp4

46.mp4

47.mp4

48.mp4

56.mp4

57.mp4

50.mp4

51.mp4

53.mp4

55.mp4

59.mp4

Qualitative Comparisons

We show qualitative comparisons on four representative prompts. Each row uses the same reference image, audio and action description, while different methods generate the avatar motion. The name of each method is shown under the video.

Scenario 1: Thoughtful Chess Analysis
A middle-aged man sits at a chessboard, reflecting on his next move. The action description requires him to first lift his head upward and to the left, then raise his left hand to chest level in an open questioning gesture before returning to look at the board.

ActAvatar (ours)

FantasyTalking

OmniAvatar

StableAvatar

Scenario 2: Artist Explaining Her Process
A woman stands in an art gallery, explaining her creative process. The prompt asks her to first lift her head from a downward gaze to look forward while holding a sketchbook, then release the sketchbook with her left hand and gesture outward and upward to chest level.

ActAvatar (ours)

EchoMimicV3

HunyuanVideo-Avatar

WanS2V

Scenario 3: Elderly Woman in a Garden
An elderly woman crouches in her sunny garden, warmly talking about her flowers. The two phases involve first gesturing broadly to indicate the whole garden, then bringing her hand back to point down toward the flowers in front of her.

ActAvatar (ours)

EchoMimicV3

FantasyTalking

WanS2V

Scenario 4: Director Framing a Shot
A male director in a studio uses his hands to frame a shot while giving instructions. The action sequence requires him to first narrow the hand frame by moving his hands inward, then widen it by moving both hands out toward shoulder width in a pull-back motion.

ActAvatar (ours)

Hallo3

MultiTalk

StableAvatar

Method Overview

ActAvatar introduces a novel framework for phase-level temporal action control in talking avatar generation. Our approach consists of three key technical innovations: (1) Phase-Aware Cross-Attention (PACA), which enables fine-grained temporal grounding by injecting phase-specific action features at precise moments; (2) Progressive Audio-Visual Alignment, achieving tight synchronization between speech and motion through hierarchical feature matching; and (3) Two-Stage Training with Structured Prompts, where we leverage an MLLM to generate action descriptions with explicit temporal boundaries. The architecture builds upon a pre-trained video diffusion model, augmented with our temporal control modules to achieve unprecedented precision in action timing and semantic fidelity.

Citation

If you find ActAvatar useful for your research, please consider citing our paper:

@article{peng2025actavatar,
                title={ActAvatar: Temporally-Aware Precise Action Control for Talking Avatars},
                author={Peng, Ziqiao and Chen, Yi and Ma, Yifeng and Zhang, Guozhen and Sun, Zhiyao and Zhou, Zixiang and Zhang, Youliang and Zhou, Zhengguang and Fan, Zhaoxin and Liu, Hongyan and Zhou, Yuan and Lu, Qinglin and He, Jun },
                journal={arXiv preprint arXiv:2512.19546},
                year={2025}
              }