HandsOnVLM: Vision-Language Models for Hand-Object Interaction Prediction

Vision-Language Models for Hand-Object Interaction Prediction

¹Carnegie Mellon University, ²UC San Diego

†Equal Advising

HandsOnVLM Architecture

HandsOnVLM is a video-based VLM with the capability of predicting future hand trajectories given a video context and language instructions. There are three key components of HandsOnVLM’s architecture: (1) SlowFast tokens to capture temporal information at fine temporal resolution, (2) hand representation using an augmented vocabulary of <HAND> token, and (3) iterative position encodings to enable auto-regressive trajectory training and inference. In training stage, we fine-tune a pre-trained VLM by combining next-token prediction loss and trajectory loss.

SlowFast Token Compression: HandsOnVLM interpret temporal information at a fine resolution as well as capture spatial relationships. We adapt SlowFast tokens that encode both slow and fast features from the egocentric video.
Hand as Embedding: HandsOnVLM extend the existing vocabulary with a new <HAND> token to represent hand in the language space. Experiments find that this representation is crucial for the model to generate accurate hand trajectories.
Iterative Position Encoding: During the inference stage, when <HAND> is predicted as the next token, we decode it immediately. This decoded position is then encoded into corresponding embedding for following prediction rounds. This way ensure that each subsequent prediction is conditioned on all previously predicted hand positions.

Overview of the HandsOnVLM architecture.

RBHP: Reasoning-based Hand Prediction Task

We introduce the Reasoning-based Hand Prediction (RBHP) task. Instead of utilizing explicit instructions to directly predict the hand motion, here the system is required to reason about it with implicit instructions. We define implicit instructions as colloquial language instructions that provide sufficient information for inferring the intended human hand action through reasoning, without explicitly naming the target object or action.

Illustration of the annotation pipeline for the RBHP task.

Qualitative Results

We evaluate HandsOnVLM across unseen clips from different human video datasets. The qualitative results here show the context RGB video and the language description, followed by predictions of future hand locations. For quantitative evaluations, please refer to the main paper.

Epic Kitchen Dataset

How should my hand move if I want to grasp the flat, rectangular gray object resting on the stovetop?

How should my hand move if I want to transfer the pizza from its parchment paper to a decorative dish?

What is the suggested hand movement for lifting the transparent bottle from the wooden surface?

Where should my hand move to if I want to remove the cork from the top of the dark-colored bottle?

Where should my hand move to if I want to access the contents inside the wooden cabinet or drawer?

Where should my hand move to if I want to gently return the container of light brown and white fungi to its original spot?

FPGA Dataset

What is the hand trajectory for flipping the sponge?

What is the hand trajectory for cleaning the glasses?

What is the recommended hand movement for pour juice bottle?

Can you provide the hand trajectory for stirring?

Where should my hand move to if I want to tear paper?

Ego4D Dataset

Where should my hand move to if I want to clear the bubbles from the top of the soapy water?

Where should my hand move to if I want to position the garment on the rack for drying?

Where should my hand move to if I want to direct a stream of water onto the soil?

Can you provide the hand trajectory for grasping a shallow dish typically found in scientific experiments?

What is the recommended hand movement for applying ink to a surface in a creative setting?

Can you provide the hand trajectory for positioning the dough on the roller?

Acknowledgements

We thank Gaurav Parmar and Jinkun Cao for feedback on the paper, and thank Yufei Ye, Ruihan Yang, Unnat Jain, Mohan Kumar Srirama, Shubham Tulsiani, and many others from CMU and UCSD for helpful discussions. This work used Bridges-2 at Pittsburgh Supercomputing Center from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by National Science Foundation grants 2138259, 2138286, 2138307, 2137603, and 2138296.

BibTeX

@misc{bao2024handsonvlmvisionlanguagemodelshandobject, title={HandsOnVLM: Vision-Language Models for Hand-Object Interaction Prediction}, author={Chen Bao and Jiarui Xu and Xiaolong Wang and Abhinav Gupta and Homanga Bharadhwaj}, year={2024}, eprint={2412.13187}, archivePrefix={arXiv}, primaryClass={cs.CV} }