Fan Gao
2026
See2Refine: Vision-Language Feedback Improves LLM-Based eHMI Action Designers
Ding Xia | Xinyue Gui | Mark Colley | Fan Gao | Zhongyi Zhou | Dongyuan Li | Renhe Jiang | Takeo Igarashi
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Ding Xia | Xinyue Gui | Mark Colley | Fan Gao | Zhongyi Zhou | Dongyuan Li | Renhe Jiang | Takeo Igarashi
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Automated vehicles lack natural communication channels with other road users, making external Human-Machine Interfaces (eHMIs) essential for conveying intent and maintaining trust in shared environments. However, most eHMI studies rely on developer-crafted message-action pairs, which are difficult to adapt to diverse and dynamic traffic contexts. A promising alternative is to use Large Language Models (LLMs) as action designers that generate context-conditioned eHMI actions, yet such designers lack perceptual verification and typically depend on fixed prompts or costly human-annotated feedback for improvement.We present See2Refine, a human-free, closed-loop framework that uses vision-language models (VLMs) for perceptual evaluation as automated visual feedback to improve an LLM-based eHMI action designer. Given a driving context and a candidate eHMI action, the VLM evaluates the perceived appropriateness of the action, and this feedback is used to iteratively revise the designer’s outputs, enabling systematic refinement without human supervision.We evaluate our framework across three eHMI modalities (lightbar, eyes, and arm) and multiple LLM model sizes. Across settings, our framework consistently outperforms prompt-only LLM designers and manually specified baselines in both VLM-based metrics and human-subject evaluations. The results further indicate that the improvements are generalized across modalities and that VLM evaluations are reasonably aligned with human preferences in our controlled settings, supporting the robustness and effectiveness of See2Refine for scalable action design.
MED-COREASONER: Reducing Language Disparities in Medical Reasoning via Language-Informed Co-Reasoning
Fan Gao | Sherry T. Tong | Jiwoong Sohn | Jiahao Huang | Junfeng Jiang | Ding Xia | Piyalitt Ittichaiwong | Kanyakorn Veerakanjana | Hyunjae Kim | Qingyu Chen | Edison Marrese-Taylor | Kazuma Kobayashi | Akiko Aizawa | Irene Li
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Fan Gao | Sherry T. Tong | Jiwoong Sohn | Jiahao Huang | Junfeng Jiang | Ding Xia | Piyalitt Ittichaiwong | Kanyakorn Veerakanjana | Hyunjae Kim | Qingyu Chen | Edison Marrese-Taylor | Kazuma Kobayashi | Akiko Aizawa | Irene Li
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
While reasoning-enhanced large language models perform strongly on English medical tasks, a persistent multilingual gap remains, with substantially weaker reasoning in local languages, limiting equitable global medical deployment. To bridge this gap, we introduce Med-CoReasoner, a language-informed co-reasoning framework that elicits parallel English and local-language reasoning, abstracts them into structured concepts, and integrates local clinical knowledge into an English logical scaffold via concept-level alignment and retrieval. This design combines the structural robustness of English reasoning with the practice-grounded expertise encoded in local languages. To evaluate multilingual medical reasoning beyond multiple-choice settings, we construct MultiMed-X, a benchmark covering seven languages with expert-annotated long-form question answering and natural language inference tasks, comprising 350 instances per language. Experiments across three benchmarks show that Med-CoReasoner improves multilingual reasoning performance by an average of 5%, with particularly substantial gains in low-resource languages. Moreover, model distillation and expert evaluation analysis further confirm that Med-CoReasoner produces clinically sound and culturally grounded reasoning traces.
2025
ReAgent: Reversible Multi-Agent Reasoning for Knowledge-Enhanced Multi-Hop QA
Zhao Xinjie | Fan Gao | Xingyu Song | Yingjian Chen | Rui Yang | Yanran Fu | Yuyang Wang | Yusuke Iwasawa | Yutaka Matsuo | Irene Li
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Zhao Xinjie | Fan Gao | Xingyu Song | Yingjian Chen | Rui Yang | Yanran Fu | Yuyang Wang | Yusuke Iwasawa | Yutaka Matsuo | Irene Li
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Multi-hop question answering (QA) remains challenging, as solutions must reliably integrate and reconcile evidence from multiple sources without succumbing to error propagation. While large language models (LLMs) have achieved substantial improvements via chain-of-thought (CoT) prompting and retrieval-augmented generation, these methods typically adopt a forward-only workflow—early mistakes persist throughout inference, and contradictions discovered later cannot systematically trigger re-evaluation. To address this limitation, we present ReAgent, a reversible multi-agent reasoning framework. Specifically, ReAgent enables agents to backtrack to earlier valid states when conflicts arise, thereby isolating and rectifying flawed assumptions before they undermine subsequent reasoning. Our approach combines explicit local and global rollback protocols with modular role specialization, resulting in a flexible and error-tolerant pipeline. Empirical evaluation on three multi-hop QA benchmarks demonstrates consistent performance gains of approximately 6% over forward-only baselines, in addition to enhanced interpretability. These findings highlight the value of non-monotonic, backtracking-driven inference in complex QA scenarios and point to broader implications for multi-agent collaboration in knowledge-intensive tasks.
TDCSA: LLM-Guided Top-Down Approach for Robust Citation Sentiment Analysis
Fan Gao | Jieyang Peng | Xiaoming Tao | Wang Youzheng
Findings of the Association for Computational Linguistics: ACL 2025
Fan Gao | Jieyang Peng | Xiaoming Tao | Wang Youzheng
Findings of the Association for Computational Linguistics: ACL 2025
Citation Sentiment Analysis (CSA) plays a crucial role in understanding academic influence and knowledge diffusion. While pre-trained language models (PLMs) and large language models (LLMs) showed remarkable success in general sentiment analysis, they encounter specialized challenges in CSA due to the less significant and implicit sentiment expressions in academic writing, as well as complex sentiment transitions. % importance & limitations In order to address the challenges, We propose TDCSA, a Top-Down framework that leverages LLMs’ semantic understanding capabilities to enhance PLM-based CSA, which transforms the traditional bottom-up feature engineering paradigm into a top-down architecture. % what we do Our framework consists of three key components: (1) a Dual LLM Feature Generation module for robust quadruple extraction, (2) a Multi-view Feature Representation mechanism for neutral citation processing, and (3) a Quad Feature Enhanced PLM. % how we do Experiments demonstrate that TDCSA significantly outperforms existing methods, achieving state-of-the-art performance while maintaining robustness to quadruple quality variations.
MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation
Weihao Xuan | Rui Yang | Heli Qi | Qingcheng Zeng | Yunze Xiao | Aosong Feng | Dairui Liu | Yun Xing | Junjue Wang | Fan Gao | Jinghui Lu | Yuang Jiang | Huitao Li | Xin Li | Kunyu Yu | Ruihai Dong | Shangding Gu | Yuekang Li | Xiaofei Xie | Felix Juefei-Xu | Foutse Khomh | Osamu Yoshie | Qingyu Chen | Douglas Teodoro | Nan Liu | Randy Goebel | Lei Ma | Edison Marrese-Taylor | Shijian Lu | Yusuke Iwasawa | Yutaka Matsuo | Irene Li
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Weihao Xuan | Rui Yang | Heli Qi | Qingcheng Zeng | Yunze Xiao | Aosong Feng | Dairui Liu | Yun Xing | Junjue Wang | Fan Gao | Jinghui Lu | Yuang Jiang | Huitao Li | Xin Li | Kunyu Yu | Ruihai Dong | Shangding Gu | Yuekang Li | Xiaofei Xie | Felix Juefei-Xu | Foutse Khomh | Osamu Yoshie | Qingyu Chen | Douglas Teodoro | Nan Liu | Randy Goebel | Lei Ma | Edison Marrese-Taylor | Shijian Lu | Yusuke Iwasawa | Yutaka Matsuo | Irene Li
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Existing large language model (LLM) evaluation benchmarks primarily focus on English, while current multilingual tasks lack parallel questions that specifically assess cross-lingual reasoning abilities. This dual limitation makes it challenging to assess LLMs’ performance in the multilingual setting comprehensively. To fill this gap, we introduce MMLU-ProX, a comprehensive benchmark covering 29 languages, built on an English benchmark. Each language version consists of 11,829 identical questions, enabling direct cross-lingual comparisons. Additionally, to meet efficient evaluation needs, we provide a lite version containing 658 questions per language. To ensure the high quality of MMLU-ProX, we employ a rigorous development process that involves multiple powerful LLMs for translation, followed by expert review to ensure accurate expression, consistent terminology, and cultural relevance. Building on this, we systematically evaluate 36 state-of-the-art LLMs, including reasoning-enhanced and multilingual-optimized LLMs. The results reveal significant disparities in the multilingual capabilities of LLMs: While they perform well in high-resource languages, their performance declines markedly in low-resource languages, particularly for African languages. Through MMLU-ProX, we aim to advance the development of more inclusive AI systems and promote equitable access to technology across global contexts.
Automating eHMI Action Design with LLMs for Automated Vehicle Communication
Ding Xia | Xinyue Gui | Fan Gao | Dongyuan Li | Mark Colley | Takeo Igarashi
Findings of the Association for Computational Linguistics: EMNLP 2025
Ding Xia | Xinyue Gui | Fan Gao | Dongyuan Li | Mark Colley | Takeo Igarashi
Findings of the Association for Computational Linguistics: EMNLP 2025
The absence of explicit communication channels between automated vehicles (AVs) and other road users requires the use of external Human-Machine Interfaces (eHMIs) to convey messages effectively in uncertain scenarios. Currently, most eHMI studies employ predefined text messages and manually designed actions to perform these messages, which limits the real-world deployment of eHMIs, where adaptability in dynamic scenarios is essential. Given the generalizability and versatility of large language models (LLMs), they could potentially serve as automated action designers for the message-action design task. To validate this idea, we make three contributions: (1) We propose a pipeline that integrates LLMs and 3D renderers, using LLMs as action designers to generate executable actions for controlling eHMIs and rendering action clips. (2) We collect a user-rated Action-Design Scoring dataset comprising a total of 320 action sequences for eight intended messages and four representative eHMI modalities. The dataset validates that LLMs can translate intended messages into actions close to a human level, particularly for reasoning-enabled LLMs. (3) We introduce two automated raters, Action Reference Score (ARS) and Vision-Language Models (VLMs), to benchmark 18 LLMs, finding that the VLM aligns with human preferences yet varies across eHMI modalities. The source code, prompts, Blender scenarios, and rendered clips are available at https://github.com/ApisXia/AutoActionDesign.
TLUE: A Tibetan Language Understanding Evaluation Benchmark
Fan Gao | Cheng Huang | Yutong Liu | Nyima Tashi | Xiangxiang Wang | Thupten Tsering | Ban Ma-bao | Renzeng Duojie | Gadeng Luosang | Rinchen Dongrub | Dorje Tashi | Xiao Feng Cd | Yongbin Yu | Hao Wang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Fan Gao | Cheng Huang | Yutong Liu | Nyima Tashi | Xiangxiang Wang | Thupten Tsering | Ban Ma-bao | Renzeng Duojie | Gadeng Luosang | Rinchen Dongrub | Dorje Tashi | Xiao Feng Cd | Yongbin Yu | Hao Wang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large language models have made tremendous progress in recent years, but low-resource languages, like Tibetan, remain significantly underrepresented in their evaluation. Despite Tibetan being spoken by over seven million people, it has largely been neglected in the development and assessment of LLMs. To address this gap, we present a Tibetan Language Understanding Evaluation Benchmark, TLUE, which is also the first large-scale benchmark for measuring the proficiency of large language models in the Tibetan language. TLUE comprises two major components: a comprehensive multi-task understanding benchmark spanning 5 domains and 67 subdomains, and a safety benchmark encompassing 7 subdomains. Finally, we evaluate a diverse set of state-of-the-art LLMs. Experimental results demonstrate that most large language models perform below the random baseline, especially highlighting the considerable challenges they face in Tibetan language processing. TLUE provides a crucial foundation for advancing future research in Tibetan language understanding and highlights the importance of promoting greater inclusivity in the development of large language models.
2024
Evaluating Large Language Models on Wikipedia-Style Survey Generation
Fan Gao | Hang Jiang | Rui Yang | Qingcheng Zeng | Jinghui Lu | Moritz Blum | Tianwei She | Yuang Jiang | Irene Li
Findings of the Association for Computational Linguistics: ACL 2024
Fan Gao | Hang Jiang | Rui Yang | Qingcheng Zeng | Jinghui Lu | Moritz Blum | Tianwei She | Yuang Jiang | Irene Li
Findings of the Association for Computational Linguistics: ACL 2024
Educational materials such as survey articles in specialized fields like computer science traditionally require tremendous expert inputs and are therefore expensive to create and update. Recently, Large Language Models (LLMs) have achieved significant success across various general tasks. However, their effectiveness and limitations in the education domain are yet to be fully explored. In this work, we examine the proficiency of LLMs in generating succinct survey articles specific to the niche field of NLP in computer science, focusing on a curated list of 99 topics. Automated benchmarks reveal that GPT-4 surpasses its predecessors, inluding GPT-3.5, PaLM2, and LLaMa2 by margins ranging from 2% to 20% in comparison to the established ground truth. We compare both human and GPT-based evaluation scores and provide in-depth analysis. While our findings suggest that GPT-created surveys are more contemporary and accessible than human-authored ones, certain limitations were observed. Notably, GPT-4, despite often delivering outstanding content, occasionally exhibited lapses like missing details or factual errors. At last, we compared the rating behavior between humans and GPT-4 and found systematic bias in using GPT evaluation.
Search
Fix author
Co-authors
- Irene Li 4
- Ding Xia 3
- Rui Yang 3
- Qingyu Chen 2
- Mark Colley 2
- Xinyue Gui 2
- Takeo Igarashi 2
- Yusuke Iwasawa 2
- Yuang Jiang 2
- Dongyuan Li 2
- Jinghui Lu 2
- Edison Marrese-Taylor 2
- Yutaka Matsuo 2
- Qingcheng Zeng 2
- Akiko Aizawa 1
- Moritz Blum 1
- Xiao Feng Cd 1
- Yingjian Chen 1
- Ruihai Dong 1
- Rinchen Dongrub 1
- Renzeng Duojie 1
- Aosong Feng 1
- Yanran Fu 1
- Randy Goebel 1
- Shangding Gu 1
- Jiahao Huang 1
- Cheng Huang 1
- Piyalitt Ittichaiwong 1
- Renhe Jiang 1
- Hang Jiang 1
- Junfeng Jiang 1
- Felix Juefei-Xu 1
- Foutse Khomh 1
- Hyunjae Kim 1
- Kazuma Kobayashi 1
- Huitao Li 1
- Xin Li 1
- Yuekang Li 1
- Dairui Liu 1
- Nan Liu 1
- Yutong Liu 1
- Shijian Lu 1
- Gadeng Luosang 1
- Lei Ma 1
- Ban Ma-bao 1
- Jieyang Peng 1
- Heli Qi 1
- Tianwei She 1
- Jiwoong Sohn 1
- Xingyu Song 1
- Xiaoming Tao 1
- Nyima Tashi 1
- Dorje Tashi 1
- Douglas Teodoro 1
- Sherry T. Tong 1
- Thupten Tsering 1
- Kanyakorn Veerakanjana 1
- Yuyang Wang 1
- Junjue Wang 1
- Xiangxiang Wang 1
- Hao Wang 1
- Yunze Xiao 1
- Xiaofei Xie 1
- Yun Xing 1
- Zhao Xinjie 1
- Weihao Xuan 1
- Osamu Yoshie 1
- Wang Youzheng 1
- Kunyu Yu 1
- Yongbin Yu 1
- Zhongyi Zhou 1