2025
pdf
bib
abs
iTool: Reinforced Fine-Tuning with Dynamic Deficiency Calibration for Advanced Tool Use
Yirong Zeng
|
Xiao Ding
|
Yuxian Wang
|
Weiwen Liu
|
Yutai Hou
|
Wu Ning
|
Xu Huang
|
Duyu Tang
|
Dandan Tu
|
Bing Qin
|
Ting Liu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Augmenting large language models (LLMs) with external tools is a promising approach to enhance their capabilities, especially for complex tasks. Synthesizing tool-use data through real-world simulations is an effective way to achieve this. However, our investigation reveals that training gains significantly decay as synthetic data increases. The model struggles to benefit from more synthetic data, and it can not equip the model with advanced tool-use capabilities in complex scenarios. Moreover, we discovered that the above limitation usually manifests as a fragment deficiency (i.e., parameter errors) in response. To this end, we propose an iterative reinforced fine-tuning strategy designed to alleviate this limitation. This strategy involves: (1) enhancing the diversity of response for synthetic data through path exploration of Monte Carlo Tree Search. (2) iteratively pinpointing the model’s deficiency by constructing fine-grained preference pairs, and then improving it by preference optimization algorithms for targeted improvement. The experiments show that our method achieves 13.11% better performance than the same-size base model. It achieves an improvement of 6.5% in complex scenarios compared to the baseline, and it also outperforms larger open-source and closed-source models.
pdf
bib
abs
GlotEval: A Test Suite for Massively Multilingual Evaluation of Large Language Models
Hengyu Luo
|
Zihao Li
|
Joseph Attieh
|
Sawal Devkota
|
Ona de Gibert
|
Xu Huang
|
Shaoxiong Ji
|
Peiqin Lin
|
Bhavani Sai Praneeth Varma Mantina
|
Ananda Sreenidhi
|
Raúl Vázquez
|
Mengjie Wang
|
Samea Yusofi
|
Fei Yuan
|
Jörg Tiedemann
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Large language models (LLMs) are advancing at an unprecedented pace globally, with regions increasingly adopting these models for applications in their primary languages. Evaluating these models in diverse linguistic environments, especially in low-resource languages, has become a major challenge for academia and industry. Existing evaluation frameworks suffer from inconsistency across different benchmarks, being disproportionately focused on English and a handful of high-resource languages, thereby overlooking the realistic performance of LLMs in multilingual and lower-resource scenarios. To address this critical challenge of fragmented and inconsistent multilingual evaluation, we introduce GlotEval, a unified and lightweight framework that systematically integrates 27 benchmarks under a standardized ISO 639-3 language identifier system, allowing for seamless incorporation of new benchmarks. Supporting nine key tasks (machine translation, text classification, summarization, open-ended generation, reading comprehension, sequence labeling, intrinsic evaluation, instruction following and reasoning), spanning over dozens to hundreds of languages, GlotEval uniquely enables language-specific, cross-benchmark analysis and non-English-centric evaluations at a scale previously less practical for many researchers. This enables a precise diagnosis of model strengths and weaknesses in diverse linguistic contexts. A multilingual translation case study demonstrates GlotEval’s applicability for multilingual and language-specific evaluations.
pdf
bib
abs
QUITO-X: A New Perspective on Context Compression from the Information Bottleneck Theory
Yihang Wang
|
Xu Huang
|
Bowen Tian
|
Yueyang Su
|
Lei Yu
|
Huaming Liao
|
Yixing Fan
|
Jiafeng Guo
|
Xueqi Cheng
Findings of the Association for Computational Linguistics: EMNLP 2025
Generative large language models ( LLMs) have achieved remarkable success in various industrial applications, owing to their promising In-Context Learning capabilities. However, the issue of long context in complex tasks poses a significant barrier to their wider adoption, manifested in two main aspects: (i) The excessively long context leads to high costs and inference delays. (ii) A substantial amount of task-irrelevant information introduced by long contexts exacerbates the “lost in the middle” problem. Existing methods compress context by removing redundant tokens using metrics such as self-information or perplexity ( PPL ), which is inconsistent with the objective of retaining the most important tokens when conditioning on a given query. In this study, we introduce information bottleneck theory (IB) to model the problem, offering a novel perspective that thoroughly addresses the essential properties required for context compression. Additionally, we propose a cross-attention-based approach to approximate mutual information in IB, which can be flexibly replaced with suitable alternatives in different scenarios. Extensive experiments on four datasets demonstrate that our method achieves a 25% increase in compression rate compared to the state-of-the-art, while maintaining question answering performance. In particular, the context compressed by our method even outperform the full context in some cases.
pdf
bib
abs
ACEBench: A Comprehensive Evaluation of LLM Tool Usage
Chen Chen
|
Xinlong Hao
|
Weiwen Liu
|
Xu Huang
|
Xingshan Zeng
|
Shuai Yu
|
Dexun Li
|
Yuefeng Huang
|
Xiangcheng Liu
|
Wang Xinzhi
|
Wu Liu
Findings of the Association for Computational Linguistics: EMNLP 2025
Large Language Models (LLMs) have demonstrated significant potential in decision-making and reasoning, particularly when integrated with various tools to effectively solve complex problems. However, existing benchmarks for evaluating LLMs’ tool usage face several limitations: (1) limited evaluation scenarios, often lacking assessments in real multi-turn dialogue contexts; (2) narrow evaluation dimensions, with insufficient detailed assessments of how LLMs use tools; and (3) reliance on LLMs or real API executions for evaluation, which introduces significant overhead. To address these challenges, we introduce ACEBench, a comprehensive benchmark for assessing tool usage in LLMs. ACEBench categorizes data into three primary types based on evaluation methodology: Normal, Special, and Agent. “Normal” evaluates tool usage in basic scenarios; “Special” evaluates tool usage in situations with ambiguous or incomplete instructions; “Agent” evaluates tool usage through multi-agent interactions to simulate real-world, multi-turn dialogues. We conducted extensive experiments using ACEBench, analyzing various LLMs in-depth and providing a more granular examination of error causes across different data types.
pdf
bib
abs
BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models
Xu Huang
|
Wenhao Zhu
|
Hanxu Hu
|
Conghui He
|
Lei Li
|
Shujian Huang
|
Fei Yuan
Findings of the Association for Computational Linguistics: EMNLP 2025
Existing multilingual benchmarks focus primarily on language understanding tasks. There is a lack of benchmarks to measure comprehensive critical capabilities of large language models (LLMs) across diverse languages, including instruction following, reasoning, code generation, and long context understanding. To bridge this gap, we develop BenchMAX, a multi-way multilingual benchmark that covers 10 diverse tasks, to evaluate LLMs’ general abilities across many languages. To ensure high data quality, each sample is post-edited by three native annotators after machine-translating from English into 16 languages. Extensive experiments on BenchMAX reveal uneven utilization of core capabilities across languages, emphasizing the performance gaps that scaling model size alone does not resolve. BenchMAX serves as a comprehensive multilingual evaluation platform, providing a promising test bed to promote the development of multilingual language models. The dataset and code are publicly accessible.
pdf
bib
abs
Findings of the WMT25 Terminology Translation Task: Terminology is Useful Especially for Good MTs
Kirill Semenov
|
Xu Huang
|
Vilém Zouhar
|
Nathaniel Berger
|
Dawei Zhu
|
Arturo Oncevay
|
Pinzhen Chen
Proceedings of the Tenth Conference on Machine Translation
The WMT25 Terminology Translation Task releases new resources in high-stakes domains and investigates the capabilities of translation systems to accurately and consistently translate specialized terms. This year, we feature new domain and language coverage over previous editions, introducing two distinct tracks: (1) sentence-level translation in the information technology domain for English→German, English→Russian, and English→Spanish, and (2) document-level translation in the finance domain for English↔Traditional Chinese with a document-level one-to-many dictionary. Participants are challenged to translate texts under three modes: no terminology, proper terminology, and random terminology, allowing for a causal analysis of terminology utility. Evaluation combines overall quality, terminology accuracy, and terminology consistency. This shared task attracted broad participation, with 13 teams submitting 20 systems in Track 1 and 4 teams participating in Track 2. The results show that providing proper terminology consistently boosts both overall translation quality and term accuracy, whereas reliance on random terminology yields smaller gains. Despite the near-saturation of sentence-level benchmarks, document-level finance translation still fallsshort, indicating an urgent need for long-form evaluation and more robust metrics tailored to professional domains.
2024
pdf
bib
abs
EDDA: An Encoder-Decoder Data Augmentation Framework for Zero-Shot Stance Detection
Daijun Ding
|
Li Dong
|
Zhichao Huang
|
Guangning Xu
|
Xu Huang
|
Bo Liu
|
Liwen Jing
|
Bowen Zhang
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Stance detection aims to determine the attitude expressed in text towards a given target. Zero-shot stance detection (ZSSD) has emerged to classify stances towards unseen targets during inference. Recent data augmentation techniques for ZSSD increase transferable knowledge between targets through text or target augmentation. However, these methods exhibit limitations. Target augmentation lacks logical connections between generated targets and source text, while text augmentation relies solely on training data, resulting in insufficient generalization. To address these issues, we propose an encoder-decoder data augmentation (EDDA) framework. The encoder leverages large language models and chain-of-thought prompting to summarize texts into target-specific if-then rationales, establishing logical relationships. The decoder generates new samples based on these expressions using a semantic correlation word replacement strategy to increase syntactic diversity. We also analyze the generated expressions to develop a rationale-enhanced network that fully utilizes the augmented data. Experiments on benchmark datasets demonstrate our approach substantially improves over state-of-the-art ZSSD techniques. The proposed EDDA framework increases semantic relevance and syntactic variety in augmented texts while enabling interpretable rationale-based learning.
2022
pdf
bib
abs
Sentiment Interpretable Logic Tensor Network for Aspect-Term Sentiment Analysis
Bowen Zhang
|
Xu Huang
|
Zhichao Huang
|
Hu Huang
|
Baoquan Zhang
|
Xianghua Fu
|
Liwen Jing
Proceedings of the 29th International Conference on Computational Linguistics
Aspect-term sentiment analysis (ATSA) is an important task that aims to infer the sentiment towards the given aspect-terms. It is often required in the industry that ATSA should be performed with interpretability, computational efficiency and high accuracy. However, such an ATSA method has not yet been developed. This study aims to develop an ATSA method that fulfills all these requirements. To achieve the goal, we propose a novel Sentiment Interpretable Logic Tensor Network (SILTN). SILTN is interpretable because it is a neurosymbolic formalism and a computational model that supports learning and reasoning about data with a differentiable first-order logic language (FOL). To realize SILTN with high inferring accuracy, we propose a novel learning strategy called the two-stage syntax knowledge distillation (TSynKD). Using widely used datasets, we experimentally demonstrate that the proposed TSynKD is effective for improving the accuracy of SILTN, and the SILTN has both high interpretability and computational efficiency.