Flowing from Reasoning to Motion: Learning 3D Hand Trajectory Prediction from Egocentric Human Interaction Videos

Mingfei Chen^1,2, Yifan Wang¹, Zhengqin Li¹, Homanga Bharadhwaj¹, Yujin Chen¹, Chuan Qin¹, Ziyi Kou¹, Yuan Tian¹, Eric Whitmire¹, Rajinder Sodhi¹, Hrvoje Benko¹, Eli Shlizerman², Yue Liu¹

¹Meta ²University of Washington

arXiv Paper Code (Under Internal Review)

We introduce the EgoMAN dataset (top), a large-scale egocentric dataset for stage-aware 3D hand trajectory prediction, containing 219K 6-DoF trajectories and 3M structured QA pairs for semantic, spatial, and motion reasoning. At inference, the EgoMAN model (bottom) takes an image, past hand motion, and an intent query, performs stage-aware reasoning to predict intent-specific waypoints, and generates distinct 6-DoF hand trajectories for different intents in the same scene.

Introduction video of EgoMAN with qualitative results.

Abstract

Prior works on 3D hand trajectory prediction are constrained by datasets that decouple motion from semantic supervision and by models that weakly link reasoning and action. To address these, we first present the EgoMAN dataset, a large-scale egocentric dataset for interaction stage–aware 3D hand trajectory prediction with 219K 6DoF trajectories and 3M structured QA pairs for semantic, spatial, and motion reasoning.

We then introduce the EgoMAN model, a reasoning-to-motion framework that links vision–language reasoning and motion generation via a trajectory-token interface. Trained progressively to align reasoning with motion dynamics, our approach yields accurate and stage-aware trajectories with generalization across real-world scenes.

Trajectory Forecasting on EgoMAN Unseen
(Dynamic Ego-Video Overlay)

Hand:

Action:

Object:

Scene:

Zero-Shot Eval on HOT3D Out-Of-Domain
(Dynamic Ego-Video Overlay)

Hand:

Object:

P-ID:

Trajectory Prediction with Diverse Intention Text

Given an input frame and diverse intention texts, our model generates multiple diverse trajectory predictions

BibTeX

@misc{chen2025flowingreasoningmotionlearning,
      title={Flowing from Reasoning to Motion: Learning 3D Hand Trajectory Prediction from Egocentric Human Interaction Videos},
      author={Mingfei Chen and Yifan Wang and Zhengqin Li and Homanga Bharadhwaj and Yujin Chen and Chuan Qin and Ziyi Kou and Yuan Tian and Eric Whitmire and Rajinder Sodhi and Hrvoje Benko and Eli Shlizerman and Yue Liu},
      year={2025},
      eprint={2512.16907},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.16907},
}