HandsOnVLM

Vision-Language Models for Hand-Object Interaction Prediction

1Carnegie Mellon University, 2UC San Diego
†Equal Advising

HandsOnVLM is a personal in-context action prediction assistant for daily activities. We evaluate HandsOnVLM across 100s of diverse scenarios in homes, kitchens, and outdoors.


We develop a framework and propose two benchmarks for predicting future interaction trajectories of human hands in a scene given high-level colloquial task specifications in the form of natural language. This requires extensive understanding of human daily activities and reasoning abilities about what is happening next given cues from the current scene. Our proposed benchmarks, Vanilla Hand Prediction (VHP) and Reasoning-Based Hand Prediction (RBHP) enable making significant strides towards this goal. We build a model integrating high-level world knowledge and reasoning capabilities of Vision-Language Models (VLMs) with the auto-regressive nature of low-level ego-centric hand trajectories. Our model, HandsOnVLM is a novel VLM that can generate textual responses and produce future hand trajectories through natural-language conversations.

HandsOnVLM Architecture

HandsOnVLM is a video-based VLM with the capability of predicting future hand trajectories given a video context and language instructions. There are three key components of HandsOnVLM’s architecture: (1) SlowFast tokens to capture temporal information at fine temporal resolution, (2) hand representation using an augmented vocabulary of <HAND> token, and (3) iterative position encodings to enable auto-regressive trajectory training and inference. In training stage, we fine-tune a pre-trained VLM by combining next-token prediction loss and trajectory loss.
  • SlowFast Token Compression: HandsOnVLM interpret temporal information at a fine resolution as well as capture spatial relationships. We adapt SlowFast tokens that encode both slow and fast features from the egocentric video.
  • Hand as Embedding: HandsOnVLM extend the existing vocabulary with a new <HAND> token to represent hand in the language space. Experiments find that this representation is crucial for the model to generate accurate hand trajectories.
  • Iterative Position Encoding: During the inference stage, when <HAND> is predicted as the next token, we decode it immediately. This decoded position is then encoded into corresponding embedding for following prediction rounds. This way ensure that each subsequent prediction is conditioned on all previously predicted hand positions.
Overview of the HandsOnVLM architecture.

RBHP: Reasoning-based Hand Prediction Task

We introduce the Reasoning-based Hand Prediction (RBHP) task. Instead of utilizing explicit instructions to directly predict the hand motion, here the system is required to reason about it with implicit instructions. We define implicit instructions as colloquial language instructions that provide sufficient information for inferring the intended human hand action through reasoning, without explicitly naming the target object or action.

Illustration of the annotation pipeline for the RBHP task.

Qualitative Results

We evaluate HandsOnVLM across unseen clips from different human video datasets. The qualitative results here show the context RGB video and the language description, followed by predictions of future hand locations. For quantitative evaluations, please refer to the main paper.

Epic Kitchen Dataset
FPGA Dataset
Ego4D Dataset

Acknowledgements

We thank Gaurav Parmar and Jinkun Cao for feedback on the paper, and thank Yufei Ye, Ruihan Yang, Unnat Jain, Mohan Kumar Srirama, Shubham Tulsiani, and many others from CMU and UCSD for helpful discussions. This work used Bridges-2 at Pittsburgh Supercomputing Center from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by National Science Foundation grants 2138259, 2138286, 2138307, 2137603, and 2138296.