Introduction video of EgoMAN with qualitative results.
Our work addresses 3D hand trajectory prediction in egocentric interaction, where future hand motion is inferred from visual observations, past motion, spatial context, and intent. Real-world actions follow stage-aware interaction structures (e.g., approach, manipulate) describing how the hand interacts with objects over time. However, prior works typically treat trajectory prediction as continuous signal regression, decoupling motion from semantic supervision and ignoring interaction structure. Without stage-aware cues to infer intent, models struggle to separate purposeful motion from egocentric noise and generalize across diverse interactions. We therefore present EgoMAN, a unified framework for interaction-structured 3D hand trajectory prediction that models hand motion as stage-aware interactions between the hand and surrounding objects. EgoMAN introduces a novel Trajectory-Token Interface where a small set of tokens encodes interaction stages, temporal progression, and 6DoF pose, enabling interaction stage-aware reasoning to guide efficient long-horizon 3D trajectory generation while preserving physical interpretability. To support this formulation, we construct the EgoMAN dataset with 219K 6DoF trajectories, stage-aware annotations, and 3M semantic, spatial, and motion QA pairs. Experiments show that EgoMAN improves trajectory accuracy, smoothness, and generalization, enabling interaction-structured reasoning for egocentric hand motion prediction for applications in robotics and assistive systems.
Given an input frame and diverse intention texts, our model generates multiple diverse trajectory predictions