Yauwai Yim
2026
DIXITWORLD: Evaluating Multimodal Abductive Reasoning in Vision-Language Models with Multi-Agent Dixit Gameplay
Yunxiang MO | Tianshi Zheng | Qing Zong | Jiayu Liu | Baixuan Xu | Yauwai Yim | Chunkit Chan | Jiaxin Bai | Yangqiu Song
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Yunxiang MO | Tianshi Zheng | Qing Zong | Jiayu Liu | Baixuan Xu | Yauwai Yim | Chunkit Chan | Jiaxin Bai | Yangqiu Song
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Multimodal abductive reasoning — the generation and selection of explanatory hypotheses from partial observations — is a cornerstone of intelligence. Current evaluations of such ability in vision–language models (VLMs) are largely confined to static, single-agent tasks. Inspired by Dixit, we introduce DixitWorld, a comprehensive evaluation suite designed to deconstruct this challenge. DixitWorld features two core components: DixitArena, a dynamic, multi-agent environment that evaluates both hypothesis generation (a "storyteller" crafting cryptic clues) and hypothesis selection ("listeners" choosing the target image from decoys) under imperfect information; and DixitBench, a static QA benchmark that isolates the listener’s task for efficient, controlled evaluation. Results from DixitArena reveal distinct, role-dependent behaviors: smaller open-source models often excel as creative storytellers, producing imaginative yet less discriminative clues, whereas larger proprietary models demonstrate superior overall performance, particularly as listeners. Performance on DixitBench strongly correlates with listener results in DixitArena, validating it as a reliable proxy for hypothesis selection. Our findings reveal a key trade-off between generative creativity and discriminative understanding in multimodal abductive reasoning, a central challenge for developing more balanced and capable vision-language agents.
AutoSchemaKG: Autonomous Knowledge Graph Construction through Dynamic Schema Induction from Web-Scale Corpora
Jiaxin Bai | Wei Fan | Qi Hu | Qing Zong | Chunyang Li | Hong Ting Tsang | Hongyu Luo | Yauwai Yim | Haoyu Huang | Xiao Zhou | Feng Qin | Tianshi Zheng | Xi Peng | Xin Yao | Huiwen Yang | Leijie Wu | JI Yi | Gong Zhang | Renhai Chen | Yangqiu Song
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jiaxin Bai | Wei Fan | Qi Hu | Qing Zong | Chunyang Li | Hong Ting Tsang | Hongyu Luo | Yauwai Yim | Haoyu Huang | Xiao Zhou | Feng Qin | Tianshi Zheng | Xi Peng | Xin Yao | Huiwen Yang | Leijie Wu | JI Yi | Gong Zhang | Renhai Chen | Yangqiu Song
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
We present AutoSchemaKG, a framework for fully autonomous knowledge graph construction that eliminates the need for predefined schemas. Our system leverages large language models to simultaneously extract knowledge triples and induce comprehensive schemas directly from text, modeling both entities and events while employing conceptualization to organize instances into semantic categories. Processing over 50 million documents, we construct ATLAS (Automated Triple Linking And Schema induction), a family of knowledge graphs with 900+ million nodes and 5.9 billion edges. This approach outperforms state-of-the-art baselines on multi-hop QA tasks and enhances LLM factuality. Notably, our schema induction achieves 92% semantic alignment with human-crafted schemas with zero manual intervention, demonstrating that billion-scale knowledge graphs with dynamically induced schemas can effectively complement parametric knowledge in large language models.
XToM: Exploring the Multilingual Theory of Mind for Large Language Models
Chunkit Chan | Yauwai Yim | Hongchuan Zeng | Zhiying Zou | Xinyuan Cheng | Zhifan Sun | Zheye Deng | Kawai Chung | Yuzhuo Ao | Fan Yixiang | Cheng Jiayang | Ercong Nie | Ginny Wong | Helmut Schmid | Hinrich Schuetze | Simon See | Yangqiu Song
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Chunkit Chan | Yauwai Yim | Hongchuan Zeng | Zhiying Zou | Xinyuan Cheng | Zhifan Sun | Zheye Deng | Kawai Chung | Yuzhuo Ao | Fan Yixiang | Cheng Jiayang | Ercong Nie | Ginny Wong | Helmut Schmid | Hinrich Schuetze | Simon See | Yangqiu Song
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Theory of Mind (ToM)—the ability to infer mental states in others—is pivotal for human social cognition. Existing evaluations of ToM in LLMs are largely limited to English, neglecting the linguistic diversity that shapes human cognition. This limitation raises a critical question: can LLMs exhibit Multilingual Theory of Mind—the capacity to reason about mental states across diverse linguistic contexts? To address this gap, we present XToM, a rigorously validated multilingual benchmark that evaluates ToM across five languages and incorporates diverse, contextually rich task scenarios. Using XToM, we systematically evaluate LLMs (e.g., DeepSeek R1), revealing a pronounced dissonance: while models excel in multilingual language understanding, their ToM performance varies across languages. Our findings expose limitations in LLMs’ ability to replicate human-like mentalizing across linguistic contexts.
2024
NegotiationToM: A Benchmark for Stress-testing Machine Theory of Mind on Negotiation Surrounding
Chunkit Chan | Cheng Jiayang | Yauwai Yim | Zheye Deng | Wei Fan | Haoran Li | Xin Liu | Hongming Zhang | Weiqi Wang | Yangqiu Song
Findings of the Association for Computational Linguistics: EMNLP 2024
Chunkit Chan | Cheng Jiayang | Yauwai Yim | Zheye Deng | Wei Fan | Haoran Li | Xin Liu | Hongming Zhang | Weiqi Wang | Yangqiu Song
Findings of the Association for Computational Linguistics: EMNLP 2024
Large Language Models (LLMs) have sparked substantial interest and debate concerning their potential emergence of Theory of Mind (ToM) ability. Theory of mind evaluations currently focuses on testing models using machine-generated data or game settings prone to shortcuts and spurious correlations, which lacks evaluation of machine ToM ability in real-world human interaction scenarios. This poses a pressing demand to develop new real-world scenario benchmarks. We introduce NegotiationToM, a new benchmark designed to stress-test machine ToM in real-world negotiation surrounding covered multi-dimensional mental states (i.e., desires, beliefs, and intentions). Our benchmark builds upon the Belief-Desire-Intention (BDI) agent modeling theory and conducts the necessary empirical experiments to evaluate large language models. Our findings demonstrate that NegotiationToM is challenging for state-of-the-art LLMs, as they consistently perform significantly worse than humans, even when employing the chain-of-thought (CoT) method.
ActPlan-1K: Benchmarking the Procedural Planning Ability of Visual Language Models in Household Activities
Ying Su | Zhan Ling | Haochen Shi | Cheng Jiayang | Yauwai Yim | Yangqiu Song
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Ying Su | Zhan Ling | Haochen Shi | Cheng Jiayang | Yauwai Yim | Yangqiu Song
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Large language models(LLMs) have been adopted to process textual task description and accomplish procedural planning in embodied AI tasks because of their powerful reasoning ability. However, there is still lack of study on how vision language models(VLMs) behave when multi-modal task inputs are considered. Counterfactual planning that evaluates the model’s reasoning ability over alternative task situations are also under exploited. In order to evaluate the planning ability of both multi-modal and counterfactual aspects, we propose ActPlan-1K. ActPlan-1K is a multi-modal planning benchmark constructed based on ChatGPT and household activity simulator iGibson2. The benchmark consists of 153 activities and 1,187 instances. Each instance describing one activity has a natural language task description and multiple environment images from the simulator. The gold plan of each instance is action sequences over the objects in provided scenes. Both the correctness and commonsense satisfaction are evaluated on typical VLMs. It turns out that current VLMs are still struggling at generating human-level procedural plans for both normal activities and counterfactual activities. We further provide automatic evaluation metrics by finetuning over BLEURT model to facilitate future research on our benchmark.
Text-Tuple-Table: Towards Information Integration in Text-to-Table Generation via Global Tuple Extraction
Zheye Deng | Chunkit Chan | Weiqi Wang | Yuxi Sun | Wei Fan | Tianshi Zheng | Yauwai Yim | Yangqiu Song
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Zheye Deng | Chunkit Chan | Weiqi Wang | Yuxi Sun | Wei Fan | Tianshi Zheng | Yauwai Yim | Yangqiu Song
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
The task of condensing large chunks of textual information into concise and structured tables has gained attention recently due to the emergence of Large Language Models (LLMs) and their potential benefit for downstream tasks, such as text summarization and text mining. Previous approaches often generate tables that directly replicate information from the text, limiting their applicability in broader contexts, as text-to-table generation in real-life scenarios necessitates information extraction, reasoning, and integration. However, there is a lack of both datasets and methodologies towards this task. In this paper, we introduce LiveSum, a new benchmark dataset created for generating summary tables of competitions based on real-time commentary texts. We evaluate the performances of state-of-the-art LLMs on this task in both fine-tuning and zero-shot settings, and additionally propose a novel pipeline called T3(Text-Tuple-Table) to improve their performances. Extensive experimental results demonstrate that LLMs still struggle with this task even after fine-tuning, while our approach can offer substantial performance gains without explicit training. Further analyses demonstrate that our method exhibits strong generalization abilities, surpassing previous approaches on several other text-to-table datasets. Our codeand data can be found at https://github.com/HKUST-KnowComp/LiveSum.
Search
Fix author
Co-authors
- Yangqiu Song 6
- Chunkit Chan 4
- Zheye Deng 3
- Wei Fan 3
- Cheng Jiayang 3
- Tianshi Zheng 3
- Jiaxin Bai 2
- Weiqi Wang 2
- Qing Zong 2
- Yuzhuo Ao 1
- Renhai Chen 1
- Xinyuan Cheng 1
- Kawai Chung 1
- Qi Hu 1
- Haoyu Huang 1
- Chunyang Li 1
- Haoran Li 1
- Zhan Ling 1
- Jiayu Liu 1
- Xin Liu 1
- Hongyu Luo 1
- Yunxiang MO 1
- Ercong Nie 1
- Xi Peng 1
- Feng Qin 1
- Helmut Schmid 1
- Hinrich Schuetze 1
- Simon See 1
- Haochen Shi 1
- Ying Su 1
- Yuxi Sun 1
- Zhifan Sun 1
- Hong Ting Tsang 1
- Ginny Wong 1
- Leijie Wu 1
- Baixuan Xu 1
- Huiwen Yang 1
- Xin Yao 1
- JI Yi 1
- Fan Yixiang 1
- Hongchuan Zeng 1
- Gong Zhang 1
- Hongming Zhang 1
- Xiao Zhou 1
- Zhiying Zou 1