EgoMAN: Interaction-Structured Reasoning for Egocentric 3D Hand Trajectory Prediction

ECCV 2026

1Meta 2University of Washington
EgoMAN Teaser

EgoMAN predicts egocentric 3D hand trajectories through interaction-structured reasoning. Given visual observations, past hand motion, and an intent query, the model infers key interaction stages (start, contact, end) and generates intent-consistent 6DoF trajectories. To enable this formulation, we introduce the EgoMAN dataset, a large-scale egocentric benchmark with stage-aware trajectory annotations and 3M structured QA pairs for semantic, spatial, and motion reasoning.

Introduction video of EgoMAN with qualitative results.

Abstract

Our work addresses 3D hand trajectory prediction in egocentric interaction, where future hand motion is inferred from visual observations, past motion, spatial context, and intent. Real-world actions follow stage-aware interaction structures (e.g., approach, manipulate) describing how the hand interacts with objects over time. However, prior works typically treat trajectory prediction as continuous signal regression, decoupling motion from semantic supervision and ignoring interaction structure. Without stage-aware cues to infer intent, models struggle to separate purposeful motion from egocentric noise and generalize across diverse interactions. We therefore present EgoMAN, a unified framework for interaction-structured 3D hand trajectory prediction that models hand motion as stage-aware interactions between the hand and surrounding objects. EgoMAN introduces a novel Trajectory-Token Interface where a small set of tokens encodes interaction stages, temporal progression, and 6DoF pose, enabling interaction stage-aware reasoning to guide efficient long-horizon 3D trajectory generation while preserving physical interpretability. To support this formulation, we construct the EgoMAN dataset with 219K 6DoF trajectories, stage-aware annotations, and 3M semantic, spatial, and motion QA pairs. Experiments show that EgoMAN improves trajectory accuracy, smoothness, and generalization, enabling interaction-structured reasoning for egocentric hand motion prediction for applications in robotics and assistive systems.

Trajectory Forecasting on EgoMAN Unseen
(Dynamic Ego-Video Overlay)

Zero-Shot Eval on HOT3D Out-Of-Domain
(Dynamic Ego-Video Overlay)

Trajectory Prediction with Diverse Intention Text

Given an input frame and diverse intention texts, our model generates multiple diverse trajectory predictions

BibTeX