HandsOnVLM is a personal in-context action prediction assistant for daily activities. We evaluate HandsOnVLM across 100s of diverse scenarios in homes, kitchens, and outdoors.
We develop a framework and propose two benchmarks for predicting future interaction trajectories of human hands in a scene given high-level colloquial task specifications in the form of natural language. This requires extensive understanding of human daily activities and reasoning abilities about what is happening next given cues from the current scene. Our proposed benchmarks, Vanilla Hand Prediction (VHP) and Reasoning-Based Hand Prediction (RBHP) enable making significant strides towards this goal. We build a model integrating high-level world knowledge and reasoning capabilities of Vision-Language Models (VLMs) with the auto-regressive nature of low-level ego-centric hand trajectories. Our model, HandsOnVLM is a novel VLM that can generate textual responses and produce future hand trajectories through natural-language conversations.
We introduce the Reasoning-based Hand Prediction (RBHP) task. Instead of utilizing explicit instructions to directly predict the hand motion, here the system is required to reason about it with implicit instructions. We define implicit instructions as colloquial language instructions that provide sufficient information for inferring the intended human hand action through reasoning, without explicitly naming the target object or action.
We evaluate HandsOnVLM across unseen clips from different human video datasets. The qualitative results here show the context RGB video and the language description, followed by predictions of future hand locations. For quantitative evaluations, please refer to the main paper.
We thank Gaurav Parmar and Jinkun Cao for feedback on the paper, and thank Yufei Ye, Ruihan Yang, Unnat Jain, Mohan Kumar Srirama, Shubham Tulsiani, and many others from CMU and UCSD for helpful discussions. This work used Bridges-2 at Pittsburgh Supercomputing Center from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by National Science Foundation grants 2138259, 2138286, 2138307, 2137603, and 2138296.
@misc{bao2024handsonvlmvisionlanguagemodelshandobject,
title={HandsOnVLM: Vision-Language Models for Hand-Object Interaction Prediction},
author={Chen Bao and Jiarui Xu and Xiaolong Wang and Abhinav Gupta and Homanga Bharadhwaj},
year={2024},
eprint={2412.13187},
archivePrefix={arXiv},
primaryClass={cs.CV}
}