Yunfei Lu
2026
MPR-GUI: Benchmarking and Enhancing Multilingual Perception and Reasoning in GUI Agents
Ruihan Chen | Qiming Li | Xiaocheng Feng | Weihong Zhong | Xiaoliang Yang | Yuxuan Gu | Zekun Zhou | Yunfei Lu | Haoyu Ren | Kun Chen | Dandan Tu | Bing Qin
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Ruihan Chen | Qiming Li | Xiaocheng Feng | Weihong Zhong | Xiaoliang Yang | Yuxuan Gu | Zekun Zhou | Yunfei Lu | Haoyu Ren | Kun Chen | Dandan Tu | Bing Qin
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Vision–Language Models (LVLMs) have shown strong potential as multilingual Graphical User Interface (GUI) agents, as evidenced by existing GUI benchmarks. However, these benchmarks exhibit two primary limitations: (1) although Perception and Reasoning (P R) capabilities are fundamental for GUI agents, current benchmarks lack fine-grained diagnostics to identify which specific capabilities lead to task failures, hindering targeted improvements; (2) existing benchmarks fail to provide a strictly aligned cross-lingual evaluation environment, introducing confounding factors that prevent isolating the language impact on GUI agent performance. To address these issues, we propose the Multilingual P R GUI Benchmark (MPR-GUI-Bench), featuring strictly aligned environments across six languages and eight fine-grained P R tasks. Our benchmark reveals consistent P R gaps between English and non-English settings, particularly on reasoning-intensive tasks. To leverage the superior English P R capabilities for bridging cross-lingual gaps, we identify layers sensitive to language and propose GUI-XLI, a GUI Cross-Lingual Intervention method that aligns non-English hidden states with their English counterparts at these layers during inference. Experiments show that GUI-XLI effectively reduces the cross-lingual gaps, with an average gain of 6.5% in non-English settings.
Unlocking Multilingual Reasoning Capability of LLMs and LVLMs through Representation Engineering
Qiming Li | Xiaocheng Feng | Yixuan Ma | Ruihan Chen | Zihe Tong | Zekai Ye | Xiachong Feng | Libo Qin | Haoyu Ren | Kun Chen | Yunfei Lu | Dandan Tu | Bing Qin
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Qiming Li | Xiaocheng Feng | Yixuan Ma | Ruihan Chen | Zihe Tong | Zekai Ye | Xiachong Feng | Libo Qin | Haoyu Ren | Kun Chen | Yunfei Lu | Dandan Tu | Bing Qin
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) demonstrate strong reasoning capabilities, yet their performance in English significantly outperforms that in low-resource languages, raising fairness concerns in multilingual applications. Existing approaches either rely on costly multilingual training or employ prompting with external translation tools, both of which are resource-intensive and sensitive to translation quality. To address these limitations, we propose a training-free inference-time method to enhance Multilingual Reasoning capabilities via Representation Engineering (MRRE) without using any additional training data or tools. MRRE sequentially injects two precomputed vectors at specific layers during inference processing: cross-lingual reasoning enhancement vectors, which steer non-English reasoning representations toward English space to unlock multilingual reasoning, and target-language output anchoring vectors, which restore the distribution of the target language to preserve input–output language consistency. Comprehensive experiments across six advanced LLMs and LVLMs on four reasoning benchmarks demonstrate that MRRE consistently enhances non-English reasoning by an average gain of 5.48% and up to 7.54% in low-resource languages (e.g., Thai and Swahili), while improving input-output language consistency by 3.78%.
x1: Learning to Think Adaptively Across Languages and Cultures
Yangfan Ye | Xiaocheng Feng | Xiachong Feng | Yichong Huang | Zekun Yuan | Lei Huang | Weitao Ma | Qichen Hong | Yunfei Lu | Dandan Tu | Bing Qin
Findings of the Association for Computational Linguistics: ACL 2026
Yangfan Ye | Xiaocheng Feng | Xiachong Feng | Yichong Huang | Zekun Yuan | Lei Huang | Weitao Ma | Qichen Hong | Yunfei Lu | Dandan Tu | Bing Qin
Findings of the Association for Computational Linguistics: ACL 2026
Languages encode distinct abstractions and inductive priors, yet most large language models (LLMs) overlook this diversity by reasoning in a single dominant language. In this work, we introduce x1, a family of reasoning models that can adaptively reason in an advantageous language on a per-instance basis. To isolate the effect of reasoning-language choice, x1 is constructed without expanding the model’s knowledge boundaries and is trained by contrasting linguistically distinct reasoning trajectories for the same input. Our extensive experiments demonstrate the benefits of adaptive multilingual reasoning across multilingual mathematical reasoning and culturally grounded tasks. Moreover, our results challenge a simplistic view of scaling laws: while scaling reduces cross-lingual disparities in procedural domains such as math reasoning, it does not eliminate the advantages of culture-associated languages in culturally grounded tasks, as we empirically show that such reasoning enables more efficient and accurate cultural knowledge recall. Overall, our findings establish language choice as a functional component of reasoning, with implications for building more generalist and globally competent reasoning models.
Culture-Aware Machine Translation in Large Language Models: Benchmarking and Investigation
Zekun Yuan | Yangfan Ye | Xiaocheng Feng | Baohang Li | Qichen Hong | Yunfei Lu | Dandan Tu | Bing Qin
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zekun Yuan | Yangfan Ye | Xiaocheng Feng | Baohang Li | Qichen Hong | Yunfei Lu | Dandan Tu | Bing Qin
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) have achieved strong performance in general machine translation, yet their ability in culture-aware scenarios remains poorly understood. To bridge this gap, we introduce CanMT, a Culture-Aware Novel-Driven Parallel Dataset for Machine Translation, together with a theoretically grounded, multi-dimensional evaluation frame work for assessing cultural translation quality. Leveraging CanMT, we systematically evaluate a wide range of LLMs and translation systems under different translation strategy constraints. Our findings reveal substantial performance disparities across models and demonstrate that translation strategies exert a systematic influence on model behavior. Further analysis shows that translation difficulty varies across types of culture-specific items, and that a persistent gap remains between models’recognition of culture-specific knowledge and their ability to correctly operationalize it in translation outputs. In addition, incorporating reference translations is shown to substantially improve evaluation reliability in LLM-as-a-judge, underscoring their essential role in assessing culture-aware translation quality. The corpus and code are available at CanMT.
2025
CC-Tuning: A Cross-Lingual Connection Mechanism for Improving Joint Multilingual Supervised Fine-Tuning
Yangfan Ye | Xiaocheng Feng | Zekun Yuan | Xiachong Feng | Libo Qin | Lei Huang | Weitao Ma | Yichong Huang | Zhirui Zhang | Yunfei Lu | Xiaohui Yan | Duyu Tang | Dandan Tu | Bing Qin
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yangfan Ye | Xiaocheng Feng | Zekun Yuan | Xiachong Feng | Libo Qin | Lei Huang | Weitao Ma | Yichong Huang | Zhirui Zhang | Yunfei Lu | Xiaohui Yan | Duyu Tang | Dandan Tu | Bing Qin
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Current large language models (LLMs) often exhibit imbalanced multilingual capabilities due to their English-centric training corpora. To address this, existing fine-tuning approaches operating at the data-level (e.g., through data augmentation or distillation) typically introduce implicit cross-lingual alignment, overlooking the potential for more profound, latent-level cross-lingual interactions. In this work, we propose CC-Tuning, a novel multilingual fine-tuning paradigm that explicitly establishes a cross-lingual connection mechanism at the latent level. During training, CC-Tuning fuses the feed forward activations from both English and non-English inputs, enabling the model to benefit from both linguistic resources. This process is facilitated with a trainable Decision Maker that identifies beneficial activations. Furthermore, during inference, a Transform Matrix is utilized to simulate the cross-lingual connection under monolingual setting through representation transformation. Our experiments on six benchmarks covering 22 languages show that CC-Tuning outperforms vanilla SFT and offers a strong latent-level alternative to data-level augmentation methods. Further analysis also highlights the practicality of CC-Tuning and the potential of latent-level cross-lingual interactions in advancing the multilingual performance of LLMs.
CLAIM: Mitigating Multilingual Object Hallucination in Large Vision-Language Models with Cross-Lingual Attention Intervention
Zekai Ye | Qiming Li | Xiaocheng Feng | Libo Qin | Yichong Huang | Baohang Li | Kui Jiang | Yang Xiang | Zhirui Zhang | Yunfei Lu | Duyu Tang | Dandan Tu | Bing Qin
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zekai Ye | Qiming Li | Xiaocheng Feng | Libo Qin | Yichong Huang | Baohang Li | Kui Jiang | Yang Xiang | Zhirui Zhang | Yunfei Lu | Duyu Tang | Dandan Tu | Bing Qin
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Vision-Language Models (LVLMs) have demonstrated impressive multimodal abilities but remain prone to multilingual object hallucination, with a higher likelihood of generating responses inconsistent with the visual input when utilizing queries in non-English languages compared to English. Most existing approaches to address these rely on pretraining or fine-tuning, which are resource-intensive. In this paper, inspired by observing the disparities in cross-modal attention patterns across languages, we propose Cross-Lingual Attention Intervention for Mitigating multilingual object hallucination (CLAIM) in LVLMs, a novel near training-free method by aligning attention patterns. CLAIM first identifies language-specific cross-modal attention heads, then estimates language shift vectors from English to the target language, and finally intervenes in the attention outputs during inference to facilitate cross-lingual visual perception capability alignment. Extensive experiments demonstrate that CLAIM achieves an average improvement of 13.56% (up to 30% in Spanish) on the POPE and 21.75% on the hallucination subsets of the MME benchmark across various languages. Further analysis reveals that multilingual attention divergence is most prominent in intermediate layers, highlighting their critical role in multilingual scenarios.
DoCIA: An Online Document-Level Context Incorporation Agent for Speech Translation
Xinglin Lyu | Wei Tang | Yuang Li | Xiaofeng Zhao | Ming Zhu | Junhui Li | Yunfei Lu | Min Zhang | Daimeng Wei | Hao Yang | Min Zhang
Findings of the Association for Computational Linguistics: ACL 2025
Xinglin Lyu | Wei Tang | Yuang Li | Xiaofeng Zhao | Ming Zhu | Junhui Li | Yunfei Lu | Min Zhang | Daimeng Wei | Hao Yang | Min Zhang
Findings of the Association for Computational Linguistics: ACL 2025
Document-level context is crucial for handling discourse challenges in text-to-text document-level machine translation (MT). Despite the increased discourse challenges introduced by noise from automatic speech recognition (ASR), the integration of document-level context in speech translation (ST) remains insufficiently explored. In this paper, we develop DoCIA, an online framework that enhances ST performance by incorporating document-level context. DoCIA decomposes the ST pipeline into four stages. Document-level context is integrated into the ASR refinement, MT, and MT refinement stages through auxiliary LLM (large language model)-based modules. Furthermore, DoCIA leverages document-level information in a multi-level manner while minimizing computational overhead. Additionally, a simple yet effective determination mechanism is introduced to prevent hallucinations from excessive refinement, ensuring the reliability of the final results. Experimental results show that DoCIA significantly outperforms traditional ST baselines in both sentence and discourse metrics across four LLMs, demonstrating its effectiveness in improving ST performance.
Enhancing Speech Large Language Models with Prompt-Aware Mixture of Audio Encoders
Weiqiao Shan | Yuang Li | Yuhao Zhang | Yingfeng Luo | Chen Xu | Xiaofeng Zhao | Long Meng | Yunfei Lu | Min Zhang | Hao Yang | Tong Xiao | JingBo Zhu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Weiqiao Shan | Yuang Li | Yuhao Zhang | Yingfeng Luo | Chen Xu | Xiaofeng Zhao | Long Meng | Yunfei Lu | Min Zhang | Hao Yang | Tong Xiao | JingBo Zhu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Connecting audio encoders with large language models (LLMs) allows the LLM to perform various audio understanding tasks, such as automatic speech recognition (ASR) and audio captioning (AC). Most research focuses on training an adapter layer to generate a unified audio feature for the LLM. However, different tasks may require distinct features that emphasize either semantic or acoustic aspects, making task-specific audio features more desirable. In this paper, we propose Prompt-aware Mixture (PaM) to enhance the Speech LLM that uses multiple audio encoders. Our approach involves using different experts to extract different features based on the prompt that indicates different tasks. Experiments demonstrate that with PaM, only one Speech LLM surpasses the best performances achieved by all single-encoder Speech LLMs on ASR, speaker number verification, and AC tasks. PaM also outperforms other feature fusion baselines, such as concatenation and averaging.
Search
Fix author
Co-authors
- Xiaocheng Feng (冯骁骋) 6
- Bing Qin (秦兵) 6
- Dandan Tu 6
- Xiachong Feng 3
- Yichong Huang 3
- Qiming Li 3
- Libo Qin 3
- Yangfan Ye 3
- Zekun Yuan 3
- Ruihan Chen 2
- Kun Chen 2
- Qichen Hong 2
- Lei Huang (黄磊) 2
- Baohang Li 2
- Yuang Li 2
- Weitao Ma (马伟涛) 2
- Haoyu Ren 2
- Duyu Tang 2
- Hao Yang 2
- Zekai Ye 2
- Zhirui Zhang 2
- Min Zhang 2
- Xiaofeng Zhao 2
- Yuxuan Gu 1
- Kui Jiang 1
- Junhui Li (李军辉) 1
- Yingfeng Luo 1
- Xinglin Lyu 1
- Yixuan Ma (马翊轩) 1
- Long Meng 1
- Weiqiao Shan 1
- Wei Tang 1
- Zihe Tong 1
- Daimeng Wei 1
- Yang Xiang 1
- Tong Xiao (肖桐) 1
- Chen Xu 1
- Xiaohui Yan 1
- Xiaoliang Yang 1
- Min Zhang 1
- Yuhao Zhang 1
- Weihong Zhong 1
- Zekun Zhou 1
- Ming Zhu 1
- JingBo Zhu (朱靖波) 1