Miao Liu
2026
Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning
Jiacheng Hua | Yishu Yin | Yuhang Wu | Tai Wang | Yifei Huang | Miao Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jiacheng Hua | Yishu Yin | Yuhang Wu | Tai Wang | Yifei Huang | Miao Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Existing Multimodal Large Language Models (MLLMs) struggle with 3D spatial reasoning, as they fail to construct structured abstractions of the 3D environment depicted in video inputs. To bridge this gap, drawing inspiration from cognitive theories of allocentric spatial reasoning, we investigate how to enable MLLMs to model and reason over text-based spatial representations of video. Specifically, we introduce Textual Representation of Allocentric Context from Egocentric Video (TRACE), a prompting method that induces MLLMs to generate text-based representations of 3D environments as intermediate reasoning traces for more accurate spatial question answering. TRACE encodes meta-context, camera trajectories, and detailed object entities to support structured spatial reasoning over egocentric videos. Extensive experiments on VSI-Bench and OST-Bench demonstrate that TRACE yields notable and consistent improvements over prior prompting strategies across a diverse range of MLLM backbones, spanning different parameter scales and training schemas. We further present ablation studies to validate our design choices, along with detailed analyses that probe the bottlenecks of 3D spatial reasoning in MLLMs.
2025
Evaluating the Prompt Steerability of Large Language Models
Erik Miehling | Michael Desmond | Karthikeyan Natesan Ramamurthy | Elizabeth M. Daly | Kush R. Varshney | Eitan Farchi | Pierre Dognin | Jesus Rios | Djallel Bouneffouf | Miao Liu | Prasanna Sattigeri
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Erik Miehling | Michael Desmond | Karthikeyan Natesan Ramamurthy | Elizabeth M. Daly | Kush R. Varshney | Eitan Farchi | Pierre Dognin | Jesus Rios | Djallel Bouneffouf | Miao Liu | Prasanna Sattigeri
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Building pluralistic AI requires designing models that are able to be shaped to represent a wide range of value systems and cultures. Achieving this requires first being able to evaluate the degree to which a given model is capable of reflecting various personas. To this end, we propose a benchmark for evaluating the steerability of model personas as a function of prompting. Our design is based on a formal definition of prompt steerability, which analyzes the degree to which a model’s joint behavioral distribution can be shifted from its baseline. By defining steerability indices and inspecting how these indices change as a function of steering effort, we can estimate the steerability of a model across various persona dimensions and directions. Our benchmark reveals that the steerability of many current models is limited — due to both a skew in their baseline behavior and an asymmetry in their steerability across many persona dimensions. We release an implementation of our benchmark at https://github.com/IBM/prompt-steering.
X-LeBench: A Benchmark for Extremely Long Egocentric Video Understanding
Wenqi Zhou | Kai Cao | Hao Zheng | Yunze Liu | Xinyi Zheng | Miao Liu | Per Ola Kristensson | Walterio W. Mayol-Cuevas | Fan Zhang | Weizhe Lin | Junxiao Shen
Findings of the Association for Computational Linguistics: EMNLP 2025
Wenqi Zhou | Kai Cao | Hao Zheng | Yunze Liu | Xinyi Zheng | Miao Liu | Per Ola Kristensson | Walterio W. Mayol-Cuevas | Fan Zhang | Weizhe Lin | Junxiao Shen
Findings of the Association for Computational Linguistics: EMNLP 2025
Long-form egocentric video understanding provides rich contextual information and unique insights into long-term human behaviors, holding significant potential for applications in embodied intelligence, long-term activity analysis, and personalized assistive technologies. However, existing benchmark datasets primarily focus on single, short (e.g., minutes to tens of minutes) to moderately long videos, leaving a substantial gap in evaluating extensive, ultra-long egocentric video recordings. To address this, we introduce X-LeBench, a novel benchmark dataset meticulously designed to fill this gap by focusing on tasks requiring a comprehensive understanding of extremely long egocentric video recordings. Our X-LeBench develops a life-logging simulation pipeline that produces realistic, coherent daily plans aligned with real-world video data. This approach enables the flexible integration of synthetic daily plans with real-world footage from Ego4D—a massive-scale egocentric video dataset covers a wide range of daily life scenarios—resulting in 432 simulated video life logs spanning from 23 minutes to 16.4 hours. The evaluations of several baseline systems and multimodal large language models (MLLMs) reveal their poor performance across the board, highlighting the inherent challenges of long-form egocentric video understanding, such as temporal localization and reasoning, context aggregation, and memory retention, and underscoring the need for more advanced models.
2023
Werewolf Among Us: Multimodal Resources for Modeling Persuasion Behaviors in Social Deduction Games
Bolin Lai | Hongxin Zhang | Miao Liu | Aryan Pariani | Fiona Ryan | Wenqi Jia | Shirley Anugrah Hayati | James Rehg | Diyi Yang
Findings of the Association for Computational Linguistics: ACL 2023
Bolin Lai | Hongxin Zhang | Miao Liu | Aryan Pariani | Fiona Ryan | Wenqi Jia | Shirley Anugrah Hayati | James Rehg | Diyi Yang
Findings of the Association for Computational Linguistics: ACL 2023
Persuasion modeling is a key building block for conversational agents. Existing works in this direction are limited to analyzing textual dialogue corpus. We argue that visual signals also play an important role in understanding human persuasive behaviors. In this paper, we introduce the first multimodal dataset for modeling persuasion behaviors. Our dataset includes 199 dialogue transcriptions and videos captured in a multi-player social deduction game setting, 26,647 utterance level annotations of persuasion strategy, and game level annotations of deduction game outcomes. We provide extensive experiments to show how dialogue context and visual signals benefit persuasion strategy prediction. We also explore the generalization ability of language models for persuasion modeling and the role of persuasion strategies in predicting social deduction game outcomes. Our dataset can be found at https://persuasion-deductiongame. socialai-data.org. The codes and models are available at https://github.com/SALT-NLP/PersuationGames.
Search
Fix author
Co-authors
- Djallel Bouneffouf 1
- Kai Cao 1
- Elizabeth M. Daly 1
- Michael Desmond 1
- Pierre Dognin 1
- Eitan Farchi 1
- Shirley Anugrah Hayati 1
- Jiacheng Hua 1
- Yifei Huang 1
- Wenqi Jia 1
- Per Ola Kristensson 1
- Bolin Lai 1
- Weizhe Lin 1
- Yunze Liu 1
- Walterio W. Mayol-Cuevas 1
- Erik Miehling 1
- Karthikeyan Natesan Ramamurthy 1
- Aryan Pariani 1
- James Rehg 1
- Jesus Rios 1
- Fiona Ryan 1
- Prasanna Sattigeri 1
- Junxiao Shen 1
- Kush R. Varshney 1
- Tai Wang 1
- Yuhang Wu 1
- Diyi Yang 1
- Yishu Yin 1
- Fan Zhang 1
- Hongxin Zhang 1
- Hao Zheng 1
- Xinyi Zheng 1
- Wenqi Zhou 1