Introduction video of EgoMAN with qualitative results.
Prior works on 3D hand trajectory prediction are constrained by datasets that decouple motion from semantic supervision and by models that weakly link reasoning and action. To address these, we first present the EgoMAN dataset, a large-scale egocentric dataset for interaction stage–aware 3D hand trajectory prediction with 219K 6DoF trajectories and 3M structured QA pairs for semantic, spatial, and motion reasoning.
We then introduce the EgoMAN model, a reasoning-to-motion framework that links vision–language reasoning and motion generation via a trajectory-token interface. Trained progressively to align reasoning with motion dynamics, our approach yields accurate and stage-aware trajectories with generalization across real-world scenes.
Given an input frame and diverse intention texts, our model generates multiple diverse trajectory predictions
@misc{chen2025flowingreasoningmotionlearning,
title={Flowing from Reasoning to Motion: Learning 3D Hand Trajectory Prediction from Egocentric Human Interaction Videos},
author={Mingfei Chen and Yifan Wang and Zhengqin Li and Homanga Bharadhwaj and Yujin Chen and Chuan Qin and Ziyi Kou and Yuan Tian and Eric Whitmire and Rajinder Sodhi and Hrvoje Benko and Eli Shlizerman and Yue Liu},
year={2025},
eprint={2512.16907},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.16907},
}