uppdf
bib
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Wanxiang Che
|
Joyce Nabende
|
Ekaterina Shutova
|
Mohammad Taher Pilehvar
pdf
bib
abs
EcomScriptBench: A Multi-task Benchmark for E-commerce Script Planning via Step-wise Intention-Driven Product Association
Weiqi Wang
|
Limeng Cui
|
Xin Liu
|
Sreyashi Nag
|
Wenju Xu
|
Chen Luo
|
Sheikh Muhammad Sarwar
|
Yang Li
|
Hansu Gu
|
Hui Liu
|
Changlong Yu
|
Jiaxin Bai
|
Yifan Gao
|
Haiyang Zhang
|
Qi He
|
Shuiwang Ji
|
Yangqiu Song
Goal-oriented script planning, or the ability to devise coherent sequences of actions toward specific goals, is commonly employed by humans to plan for typical activities. In e-commerce, customers increasingly seek LLM-based assistants to generate scripts and recommend products at each step, thereby facilitating convenient and efficient shopping experiences. However, this capability remains underexplored due to several challenges, including the inability of LLMs to simultaneously conduct script planning and product retrieval, difficulties in matching products caused by semantic discrepancies between planned actions and search queries, and a lack of methods and benchmark data for evaluation. In this paper, we step forward by formally defining the task of E-commerce Script Planning (EcomScript) as three sequential subtasks. We propose a novel framework that enables the scalable generation of product-enriched scripts by associating products with each step based on the semantic similarity between the actions and their purchase intentions. By applying our framework to real-world e-commerce data, we construct the very first large-scale EcomScript dataset, EcomScriptBench, which includes 605,229 scripts sourced from 2.4 million products. Human annotations are then conducted to provide gold labels for a sampled subset, forming an evaluation benchmark. Extensive experiments reveal that current (L)LMs face significant challenges with EcomScript tasks, even after fine-tuning, while injecting product purchase intentions improves their performance.
pdf
bib
abs
GraphNarrator: Generating Textual Explanations for Graph Neural Networks
Bo Pan
|
Zhen Xiong
|
Guanchen Wu
|
Zheng Zhang
|
Yifei Zhang
|
Yuntong Hu
|
Liang Zhao
Graph representation learning has garnered significant attention due to its broad applications in various domains, such as recommendation systems and social network analysis. Despite advancements in graph learning methods, challenges still remain in explainability when graphs are associated with semantic features. In this paper, we present GraphNarrator, the first method designed to generate natural language explanations for Graph Neural Networks. GraphNarrator employs a generative language model that maps input-output pairs to explanations reflecting the model’s decision-making process. To address the lack of ground truth explanations to train the model, we propose first generating pseudo-labels that capture the model’s decisions from saliency-based explanations, then using Expert Iteration to iteratively train the pseudo-label generator based on training objectives on explanation quality. The high-quality pseudo-labels are finally utilized to train an end-to-end explanation generator model. Extensive experiments are conducted to demonstrate the effectiveness of GraphNarrator in producing faithful, concise, and human-preferred natural language explanations.
pdf
bib
abs
M-RewardBench: Evaluating Reward Models in Multilingual Settings
Srishti Gureja
|
Lester James Validad Miranda
|
Shayekh Bin Islam
|
Rishabh Maheshwary
|
Drishti Sharma
|
Gusti Triandi Winata
|
Nathan Lambert
|
Sebastian Ruder
|
Sara Hooker
|
Marzieh Fadaee
Reward models (RMs) have driven the state-of-the-art performance of LLMs today by enabling the integration of human feedback into the language modeling process. However, RMs are primarily trained and evaluated in English, and their capabilities in multilingual settings remain largely understudied. In this work, we conduct a systematic evaluation of several reward models in multilingual settings. We first construct the first-of-its-kind multilingual RM evaluation benchmark, M-RewardBench, consisting of 2.87k preference instances for 23 typologically diverse languages, that tests the chat, safety, reasoning, and translation capabilities of RMs. We then rigorously evaluate a wide range of reward models on M-RewardBench, offering fresh insights into their performance across diverse languages. We identify a significant gap in RMs’ performances between English and non-English languages and show that RM preferences can change substantially from one language to another. We also present several findings on how different multilingual aspects impact RM performance. Specifically, we show that the performance of RMs is improved with improved translation quality. Similarly, we demonstrate that the models exhibit better performance for high-resource languages. We release M-RewardBench dataset and the codebase in this study to facilitate a better understanding of RM evaluation in multilingual settings.
pdf
bib
abs
ELABORATION: A Comprehensive Benchmark on Human-LLM Competitive Programming
Xinwei Yang
|
Zhaofeng Liu
|
Chen Huang
|
Jiashuai Zhang
|
Tong Zhang
|
Yifan Zhang
|
Wenqiang Lei
While recent research increasingly emphasizes the value of human-LLM collaboration in competitive programming and proposes numerous empirical methods, a comprehensive understanding remains elusive due to the fragmented nature of existing studies and their use of diverse, application-specific human feedback. Thus, our work serves a three-fold purpose: First, we present the first taxonomy of human feedback consolidating the entire programming process, which promotes fine-grained evaluation. Second, we introduce ELABORATIONSET, a novel programming dataset specifically designed for human-LLM collaboration, meticulously annotated to enable large-scale simulated human feedback and facilitate cost-effective real human interaction studies. Third, we introduce ELABORATION, a novel benchmark to facilitate a thorough assessment of human-LLM competitive programming. With ELABORATION, we pinpoint strengthes and weaknesses of existing methods, thereby setting the foundation for furture improvement. Our dataset and code will be openly released.
pdf
bib
abs
The Impossibility of Fair LLMs
Jacy Reese Anthis
|
Kristian Lum
|
Michael Ekstrand
|
Avi Feller
|
Chenhao Tan
The rise of general-purpose artificial intelligence (AI) systems, particularly large language models (LLMs), has raised pressing moral questions about how to reduce bias and ensure fairness at scale. Researchers have documented a sort of “bias” in the significant correlations between demographics (e.g., race, gender) in LLM prompts and responses, but it remains unclear how LLM fairness could be evaluated with more rigorous definitions, such as group fairness or fair representations. We analyze a variety of technical fairness frameworks and find inherent challenges in each that make the development of a fair LLM intractable. We show that each framework either does not logically extend to the general-purpose AI context or is infeasible in practice, primarily due to the large amounts of unstructured training data and the many potential combinations of human populations, use cases, and sensitive attributes. These inherent challenges would persist for general-purpose AI, including LLMs, even if empirical challenges, such as limited participatory input and limited measurement methods, were overcome. Nonetheless, fairness will remain an important type of model evaluation, and there are still promising research directions, particularly the development of standards for the responsibility of LLM developers, context-specific evaluations, and methods of iterative, participatory, and AI-assisted evaluation that could scale fairness across the diverse contexts of modern human-AI interaction.
pdf
bib
abs
Intuitive Fine-Tuning: Towards Simplifying Alignment into a Single Process
Ermo Hua
|
Biqing Qi
|
Kaiyan Zhang
|
Kai Tian
|
Xingtai Lv
|
Ning Ding
|
Bowen Zhou
Supervised Fine-Tuning (SFT) and Preference Optimization (PO) are key processes for aligning Language Models (LMs) with human preferences post pre-training. While SFT excels in efficiency and PO in effectiveness, they are often combined sequentially without integrating their optimization objectives. This approach ignores the opportunities to bridge their paradigm gap and take the strengths from both. In this paper, we interpret SFT and PO with two sub-processes — *Preference Estimation* and *Transition Optimization* — defined at token level within the Markov Decision Process (MDP). This modeling shows that SFT is only a special case of PO with inferior estimation and optimization. PO estimates the model’s preference by its entire generation, while SFT only scores model’s subsequent predicted tokens based on prior tokens from ground truth answer. These priors deviates from model’s distribution, hindering the preference estimation and transition optimization. Building on this view, we introduce **Intuitive Fine-Tuning (IFT)** to integrate SFT and PO into a single process. Through a temporal residual connection, IFT brings better estimation and optimization by capturing LMs’ intuitive sense of its entire answers. But it solely relies on a single policy and the same volume of non-preference-labeled data as SFT. Our experiments show that IFT performs comparably or even superiorly to SFT and some typical PO methods across several tasks, particularly those requires generation, reasoning, and fact-following abilities. An explainable Frozen Lake game further validates the effectiveness of IFT for getting competitive policy.
pdf
bib
abs
Bias in Language Models: Beyond Trick Tests and Towards RUTEd Evaluation
Kristian Lum
|
Jacy Reese Anthis
|
Kevin Robinson
|
Chirag Nagpal
|
Alexander Nicholas D’Amour
Standard bias benchmarks used for large language models (LLMs) measure the association between social attributes in model inputs and single-word model outputs. We test whether these benchmarks are robust to lengthening the model outputs via a more realistic user prompt, in the commonly studied domain of gender-occupation bias, as a step towards measuring Realistic Use and Tangible Effects (i.e., RUTEd evaluations). From the current literature, we adapt three standard metrics of next-word prediction (neutrality, skew, and stereotype), and we develop analogous RUTEd evaluations in three contexts of real-world LLM use: children’s bedtime stories, user personas, and English language learning exercises. We find that standard bias metrics have no significant correlation with long-form output metrics. For example, selecting the least biased model based on the standard “trick tests” coincides with selecting the least biased model based on longer output no more than random chance. There may not yet be evidence to justify standard benchmarks as reliable proxies of real-world biases, and we encourage further development of context-specific RUTEd evaluations.
pdf
bib
abs
Sliding Windows Are Not the End: Exploring Full Ranking with Long-Context Large Language Models
Wenhan Liu
|
Xinyu Ma
|
Yutao Zhu
|
Ziliang Zhao
|
Shuaiqiang Wang
|
Dawei Yin
|
Zhicheng Dou
Large Language Models (LLMs) have shown exciting performance in listwise passage ranking. Due to the limited input length, existing methods often adopt the sliding window strategy. Such a strategy, though effective, is inefficient as it involves repetitive and serialized processing, which usually re-evaluates relevant passages multiple times. As a result, it incurs redundant API costs, which are proportional to the number of inference tokens. The development of long-context LLMs enables the full ranking of all passages within a single inference, avoiding redundant API costs. In this paper, we conduct a comprehensive study of long-context LLMs for ranking tasks in terms of efficiency and effectiveness. Surprisingly, our experiments reveal that full ranking with long-context LLMs can deliver superior performance in the supervised fine-tuning setting with a huge efficiency improvement. Furthermore, we identify two limitations of fine-tuning the full ranking model based on existing methods: (1) sliding window strategy fails to produce a full ranking list as a training label, and (2) the language modeling loss cannot emphasize top-ranked passage IDs in the label. To alleviate these issues, we propose a new complete listwise label construction approach and a novel importance-aware learning objective for full ranking. Experiments show the superior performance of our method over baselines.
pdf
bib
abs
The Impact of Auxiliary Patient Data on Automated Chest X-Ray Report Generation and How to Incorporate It
Aaron Nicolson
|
Shengyao Zhuang
|
Jason Dowling
|
Bevan Koopman
This study investigates the integration of diverse patient data sources into multimodal language models for automated chest X-ray (CXR) report generation. Traditionally, CXR report generation relies solely on data from a patient’s CXR exam, overlooking valuable information from patient electronic health records. Utilising the MIMIC-CXR and MIMIC-IV-ED datasets, we investigate the use of patient data from emergency department (ED) records — such as vital signs measured and medicines reconciled during an ED stay — for CXR report generation, with the aim of enhancing diagnostic accuracy. We also investigate conditioning CXR report generation on the clinical history section of radiology reports, which has been overlooked in the literature. We introduce a novel approach to transform these heterogeneous data sources into patient data embeddings that prompt a multimodal language model (CXRMate-ED). Our comprehensive evaluation indicates that using a broader set of patient data significantly enhances diagnostic accuracy. The model, training code, and dataset are publicly available.
pdf
bib
abs
CLEME2.0: Towards Interpretable Evaluation by Disentangling Edits for Grammatical Error Correction
Jingheng Ye
|
Zishan Xu
|
Yinghui Li
|
Linlin Song
|
Qingyu Zhou
|
Hai-Tao Zheng
|
Ying Shen
|
Wenhao Jiang
|
Hong-Gee Kim
|
Ruitong Liu
|
Xin Su
|
Zifei Shan
The paper focuses on the interpretability of Grammatical Error Correction (GEC) evaluation metrics, which received little attention in previous studies. To bridge the gap, we introduce **CLEME2.0**, a reference-based metric describing four fundamental aspects of GEC systems: hit-correction, wrong-correction, under-correction, and over-correction. They collectively contribute to exposing critical qualities and locating drawbacks of GEC systems. Evaluating systems by combining these aspects also leads to superior human consistency over other reference-based and reference-less metrics. Extensive experiments on two human judgment datasets and six reference datasets demonstrate the effectiveness and robustness of our method, achieving a new state-of-the-art result. Our codes are released at https://github.com/THUKElab/CLEME.
pdf
bib
abs
StrucText-Eval: Evaluating Large Language Model’s Reasoning Ability in Structure-Rich Text
Zhouhong Gu
|
Haoning Ye
|
Xingzhou Chen
|
Zeyang Zhou
|
Hongwei Feng
|
Yanghua Xiao
The effective utilization of structured data, integral to corporate data strategies, has been challenged by the rise of large language models (LLMs) capable of processing unstructured information. This shift prompts the question: can LLMs interpret structured data directly in its unstructured form? We propose an automatic evaluation data generation method for assessing LLMs’ reasoning capabilities on structure-rich text to explore this. Our approach supports 8 structured languages and 29 tasks, generating data with adjustable complexity through controllable nesting and structural width. We introduce StrucText-Eval, a benchmark containing 5,800 pre-generated and annotated samples designed to evaluate how well LLMs understand and reason through structured text. StrucText-Eval is divided into two suites: a regular Test suite (3,712 samples) and a Test-Hard suite (2,088 samples), the latter emphasizing the gap between human and model performance on more complex tasks. Experimental results show that while open-source LLMs achieve a maximum accuracy of 74.9% on the standard dataset, their performance drops significantly to 45.8% on the harder dataset. In contrast, human participants reach an accuracy of 92.6% on StrucText-Eval-Hard, highlighting LLMs’ current limitations in handling intricate structural information.
pdf
bib
abs
Literature Meets Data: A Synergistic Approach to Hypothesis Generation
Haokun Liu
|
Yangqiaoyu Zhou
|
Mingxuan Li
|
Chenfei Yuan
|
Chenhao Tan
AI holds promise for transforming scientific processes, including hypothesis generation. Prior work on hypothesis generation can be broadly categorized into theory-driven and data-driven approaches. While both have proven effective in generating novel and plausible hypotheses, it remains an open question whether they can complement each other. To address this, we develop the first method that combines literature-based insights with data to perform LLM-powered hypothesis generation. We apply our method on five different datasets and demonstrate that integrating literature and data outperforms other baselines (8.97% over few-shot, 15.75% over literature-based alone, and 3.37% over data-driven alone). Additionally, we conduct the first human evaluation to assess the utility of LLM-generated hypotheses in assisting human decision-making on two challenging tasks: deception detection and AI generated content detection. Our results show that human accuracy improves significantly by 7.44% and 14.19% on these tasks, respectively. These findings suggest that integrating literature-based and data-driven approaches provides a comprehensive and nuanced framework for hypothesis generation and could open new avenues for scientific inquiry.
pdf
bib
abs
GAPO: Learning Preferential Prompt through Generative Adversarial Policy Optimization
Zhouhong Gu
|
Xingzhou Chen
|
Xiaoran Shi
|
Tao Wang
|
Suhang Zheng
|
Tianyu Li
|
Hongwei Feng
|
Yanghua Xiao
Recent advances in large language models have highlighted the critical need for precise control over model outputs through predefined constraints. While existing methods attempt to achieve this through either direct instruction-response synthesis or preferential response optimization, they often struggle with constraint understanding and adaptation. This limitation becomes particularly evident when handling fine-grained constraints, leading to either hallucination or brittle performance. We introduce Generative Adversarial Policy Optimization (GAPO), a novel framework that combines GAN-based training dynamics with an encoder-only reward model to progressively learn and adapt to increasingly complex constraints. GAPO leverages adversarial training to automatically generate training samples of varying difficulty while utilizing the encoder-only architecture to better capture prompt-response relationships. Extensive experiments demonstrate GAPO’s superior performance across multiple benchmarks, particularly in scenarios requiring fine-grained constraint handling, where it significantly outperforms existing methods like PPO, DPO, and KTO. Our results suggest that GAPO’s unique approach to preferential prompt learning offers a more robust and effective solution for controlling LLM outputs.
pdf
bib
abs
Tree-of-Evolution: Tree-Structured Instruction Evolution for Code Generation in Large Language Models
Ziyang Luo
|
Kaixin Li
|
Hongzhan Lin
|
Yuchen Tian
|
Mohan Kankanhalli
|
Jing Ma
Data synthesis has become a crucial research area in large language models (LLMs), especially for generating high-quality instruction fine-tuning data to enhance downstream performance. In code generation, a key application of LLMs, manual annotation of code instruction data is costly. Recent methods, such as Code Evol-Instruct and OSS-Instruct, leverage LLMs to synthesize large-scale code instruction data, significantly improving LLM coding capabilities. However, these approaches face limitations due to unidirectional synthesis and randomness-driven generation, which restrict data quality and diversity. To overcome these challenges, we introduce Tree-of-Evolution (ToE), a novel framework that models code instruction synthesis process with a tree structure, exploring multiple evolutionary paths to alleviate the constraints of unidirectional generation. Additionally, we propose optimization-driven evolution, which refines each generation step based on the quality of the previous iteration. Experimental results across five widely-used coding benchmarks—HumanEval, MBPP, EvalPlus, LiveCodeBench, and BigCodeBench—demonstrate that base models fine-tuned on just 75k data synthesized by our method achieve comparable or superior performance to the state-of-the-art open-weight Code LLM, Qwen2.5-Coder-Instruct, which was fine-tuned on millions of samples.
pdf
bib
abs
Delving into Multilingual Ethical Bias: The MSQAD with Statistical Hypothesis Tests for Large Language Models
Seunguk Yu
|
Juhwan Choi
|
YoungBin Kim
Despite the recent strides in large language models, studies have underscored the existence of social biases within these systems. In this paper, we delve into the validation and comparison of the ethical biases of LLMs concerning globally discussed and potentially sensitive topics, hypothesizing that these biases may arise from language-specific distinctions. Introducing the Multilingual Sensitive Questions & Answers Dataset (**MSQAD**), we collected news articles from Human Rights Watch covering 17 topics, and generated socially sensitive questions along with corresponding responses in multiple languages. We scrutinized the biases of these responses across languages and topics, employing two statistical hypothesis tests. The results showed that the null hypotheses were rejected in most cases, indicating biases arising from cross-language differences. It demonstrates that ethical biases in responses are widespread across various languages, and notably, these biases were prevalent even among different LLMs. By making the proposed MSQAD openly available, we aim to facilitate future research endeavors focused on examining cross-language biases in LLMs and their variant models.
pdf
bib
abs
ReSCORE: Label-free Iterative Retriever Training for Multi-hop Question Answering with Relevance-Consistency Supervision
Dosung Lee
|
Wonjun Oh
|
Boyoung Kim
|
Minyoung Kim
|
Joonsuk Park
|
Paul Hongsuck Seo
Multi-hop question answering (MHQA) involves reasoning across multiple documents to answer complex questions. Dense retrievers typically outperform sparse methods like BM25 by leveraging semantic embeddings in many tasks; however, they require labeled query-document pairs for fine-tuning, which poses a significant challenge in MHQA due to the complexity of the reasoning steps. To overcome this limitation, we introduce Retriever Supervision with Consistency and Relevance (ReSCORE), a novel method for training dense retrievers for MHQA without the need for labeled documents. ReSCORE leverages large language models to measure document-question relevance with answer consistency and utilizes this information to train a retriever within an iterative question-answering framework. Evaluated on three MHQA benchmarks, our extensive experiments demonstrate the effectiveness of ReSCORE, with significant improvements in retrieval performance that consequently lead to state-of-the-art Exact Match and F1 scores for MHQA.
pdf
bib
abs
FACT-AUDIT: An Adaptive Multi-Agent Framework for Dynamic Fact-Checking Evaluation of Large Language Models
Hongzhan Lin
|
Yang Deng
|
Yuxuan Gu
|
Wenxuan Zhang
|
Jing Ma
|
See-Kiong Ng
|
Tat-Seng Chua
Large Language Models (LLMs) have significantly advanced the fact-checking studies. However, existing automated fact-checking evaluation methods rely on static datasets and classification metrics, which fail to automatically evaluate the justification production and uncover the nuanced limitations of LLMs in fact-checking. In this work, we introduce FACT-AUDIT, an agent-driven framework that adaptively and dynamically assesses LLMs’ fact-checking capabilities. Leveraging importance sampling principles and multi-agent collaboration, FACT-AUDIT generates adaptive and scalable datasets, performs iterative model-centric evaluations, and updates assessments based on model-specific responses. By incorporating justification production alongside verdict prediction, this framework provides a comprehensive and evolving audit of LLMs’ factual reasoning capabilities, to investigate their trustworthiness. Extensive experiments demonstrate that FACT-AUDIT effectively differentiates among state-of-the-art LLMs, providing valuable insights into model strengths and limitations in model-centric fact-checking analysis.
pdf
bib
abs
Statistical Deficiency for Task Inclusion Estimation
Loïc Fosse
|
Frederic Bechet
|
Benoit Favre
|
Géraldine Damnati
|
Gwénolé Lecorvé
|
Maxime Darrin
|
Philippe Formont
|
Pablo Piantanida
Tasks are central in machine learning, as they are the most natural objects to assess the capabilities of current models. The trend is to build general models able to address any task. Even though transfer learning and multitask learning try to leverage the underlying task space, no well-founded tools are available to study its structure. This study proposes a theoretically grounded setup to define the notion of task and to compute the inclusion between two tasks from a statistical deficiency point of view. We propose a tractable proxy as information sufficiency to estimate the degree of inclusion between tasks, show its soundness on synthetic data, and use it to reconstruct empirically the classic NLP pipeline.
pdf
bib
abs
Towards Robust and Efficient Federated Low-Rank Adaptation with Heterogeneous Clients
Jabin Koo
|
Minwoo Jang
|
Jungseul Ok
Federated fine-tuning for Large Language Models (LLMs) has recently gained attention due to the heavy communication overhead of transmitting large model updates. Low Rank Adaptation (LoRA) has been proposed as a solution, yet its application in federated learning is complicated by discordance in aggregation. Existing methods addressing this discordance often suffer from performance degradation at low ranks in heterogeneous data settings. In response, we introduce LoRA-A^2 (Low Rank Adaptation with Alternating freeze and Adaptive rank selection), which demonstrates robustness in challenging settings with low ranks and high data heterogeneity. Our experimental findings reveal that LoRA-A^2 maintains performance even under extreme heterogeneity and low rank conditions, achieving up to a 99.8% reduction in uploaded parameters compared to full fine-tuning without compromising performance. This adaptive mechanism boosts robustness and communication efficiency in federated fine-tuning, enabling the practical deployment of LLMs in resource-constrained environments.
pdf
bib
abs
LLM-Powered Test Case Generation for Detecting Bugs in Plausible Programs
Kaibo Liu
|
Zhenpeng Chen
|
Yiyang Liu
|
Jie M. Zhang
|
Mark Harman
|
Yudong Han
|
Yun Ma
|
Yihong Dong
|
Ge Li
|
Gang Huang
Detecting tricky bugs in plausible programs, those that pass existing test suites yet still contain bugs, remains a significant challenge in software testing. To address this problem, we propose TrickCatcher, an LLM-powered approach to generating test cases for uncovering bugs in plausible programs. TrickCatcher operates in three stages: First, it uses an LLM to generate program variants based on the program under test (PUT) and its specification. Second, it employs an LLM to construct an input generator from the specification for producing test inputs. Finally, these inputs are executed on both the PUT and its program variants to detect inconsistencies in their outputs. We evaluate TrickCatcher on two datasets, TrickyBugs and EvalPlus, which include 366 human-written and 151 AI-generated plausible programs with tricky bugs. TrickCatcher achieves recall, precision, and F1 scores that are 1.80×, 2.65×, and 1.66× those of the state-of-the-art baselines, respectively. Code and data used are available at https://github.com/RinCloud/TrickCatcher/.
pdf
bib
abs
Capture the Key in Reasoning to Enhance CoT Distillation Generalization
Chengwei Dai
|
Kun Li
|
Wei Zhou
|
Songlin Hu
As Large Language Models (LLMs) scale up and gain powerful Chain-of-Thoughts (CoTs) reasoning abilities, practical resource constraints drive efforts to distill these capabilities into more compact Smaller Language Models (SLMs). We find that CoTs consist mainly of simple reasoning forms, with a small proportion (4.7%) of key reasoning steps that truly impact conclusions. However, previous distillation methods typically involve supervised fine-tuning student SLMs only on correct CoTs data produced by teacher LLMs, resulting in students struggling to learn the key, instead imitating the teacher’s reasoning forms and making errors or omissions in reasoning. To address these issues, drawing an analogy to human learning, where analyzing mistakes according to correct solutions often reveals the crucial steps leading to successes or failures, we propose mistakE-Driven key reasonIng step distillaTion (EDIT), a novel method that further aids SLMs learning key reasoning steps rather than mere simple fine-tuning. Firstly, to expose the crucial steps in CoTs, we carefully design specific prompts to generate dual CoTs data with similar reasoning paths but divergent conclusions. Then, we apply the minimum edit distance algorithm on the dual CoTs data to locate these key steps and optimize the likelihood on these tokens. Extensive experiments and analysis validate the effectiveness of EDIT across both in-domain(IND) and out-of-domain(OOD) benchmark reasoning datasets.
pdf
bib
abs
How to Enable Effective Cooperation Between Humans and NLP Models: A Survey of Principles, Formalizations, and Beyond
Chen Huang
|
Yang Deng
|
Wenqiang Lei
|
Jiancheng Lv
|
Tat-Seng Chua
|
Jimmy Huang
With the advancement of large language models (LLMs), intelligent models have evolved from mere tools to autonomous agents with their own goals and strategies for cooperating with humans. This evolution has birthed a novel paradigm in NLP, i.e., human-model cooperation, that has yielded remarkable progress in numerous NLP tasks in recent years. In this paper, we take the first step to present a thorough review of human-model cooperation, exploring its principles, formalizations, and open challenges. In particular, we introduce a new taxonomy that provides a unified perspective to summarize existing approaches. Also, we discuss potential frontier areas and their corresponding challenges. We regard our work as an entry point, paving the way for more breakthrough research in this regard.
pdf
bib
abs
Enhancing Hyperbole and Metaphor Detection with Their Bidirectional Dynamic Interaction and Emotion Knowledge
Li Zheng
|
Sihang Wang
|
Hao Fei
|
Zuquan Peng
|
Fei Li
|
Jianming Fu
|
Chong Teng
|
Donghong Ji
Text-based hyperbole and metaphor detection are of great significance for natural language processing (NLP) tasks. However, due to their semantic obscurity and expressive diversity, it is rather challenging to identify them. Existing methods mostly focus on superficial text features, ignoring the associations of hyperbole and metaphor as well as the effect of implicit emotion on perceiving these rhetorical devices. To implement these hypotheses, we propose an emotion-guided hyperbole and metaphor detection framework based on bidirectional dynamic interaction (EmoBi). Firstly, the emotion analysis module deeply mines the emotion connotations behind hyperbole and metaphor. Next, the emotion-based domain mapping module identifies the target and source domains to gain a deeper understanding of the implicit meanings of hyperbole and metaphor. Finally, the bidirectional dynamic interaction module enables the mutual promotion between hyperbole and metaphor. Meanwhile, a verification mechanism is designed to ensure detection accuracy and reliability. Experiments show that EmoBi outperforms all baseline methods on four datasets. Specifically, compared to the current SoTA, the F1 score increased by 28.1% for hyperbole detection on the TroFi dataset and 23.1% for metaphor detection on the HYPO-L dataset. These results, underpinned by in-depth analyses, underscore the effectiveness and potential of our approach for advancing hyperbole and metaphor detection.
pdf
bib
abs
UniICL: An Efficient ICL Framework Unifying Compression, Selection, and Generation
Jun Gao
|
Qi Lv
|
Zili Wang
|
Tianxiang Wu
|
Ziqiang Cao
|
Wenjie Li
In-context learning (ICL) enhances the reasoning abilities of Large Language Models (LLMs) by prepending a few demonstrations. It motivates researchers to introduce more examples to provide additional contextual information for the generation. However, existing methods show a significant limitation due to the problem of excessive growth in context length which causes a large hardware burden. Additionally, shallow-relevant examples selected by out-off-shelf tools hinder LLMs from capturing useful contextual information for generation. In this paper, to approach these limitations, we propose UniICL, a novel Unified ICL framework that unifies demonstration compression, demonstration selection, and final response generation. Furthermore, to avoid repeated compression of the same demonstration and boost inference efficiency, we design a tailored compression strategy that allows UniICL caching compression results into Demonstration Bank(DB). Extensive out-of-domain evaluations prove the advantages of UniICL in both effectiveness and efficiency.
pdf
bib
abs
BelarusianGLUE: Towards a Natural Language Understanding Benchmark for Belarusian
Maksim Aparovich
|
Volha Harytskaya
|
Vladislav Poritski
|
Oksana Volchek
|
Pavel Smrz
In the epoch of multilingual large language models (LLMs), it is still challenging to evaluate the models’ understanding of lower-resourced languages, which motivates further development of expert-crafted natural language understanding benchmarks. We introduce BelarusianGLUE — a natural language understanding benchmark for Belarusian, an East Slavic language, with ≈15K instances in five tasks: sentiment analysis, linguistic acceptability, word in context, Winograd schema challenge, textual entailment. A systematic evaluation of BERT models and LLMs against this novel benchmark reveals that both types of models approach human-level performance on easier tasks, such as sentiment analysis, but there is a significant gap in performance between machine and human on a harder task — Winograd schema challenge. We find the optimal choice of model type to be task-specific: e.g. BERT models underperform on textual entailment task but are competitive for linguistic acceptability. We release the datasets (https://hf.co/datasets/maaxap/BelarusianGLUE) and evaluation code (https://github.com/maaxap/BelarusianGLUE).
pdf
bib
abs
A Survey on Foundation Language Models for Single-cell Biology
Fan Zhang
|
Hao Chen
|
Zhihong Zhu
|
Ziheng Zhang
|
Zhenxi Lin
|
Ziyue Qiao
|
Yefeng Zheng
|
Xian Wu
The recent advancements in language models have significantly catalyzed progress in computational biology. A growing body of research strives to construct unified foundation models for single-cell biology, with language models serving as the cornerstone. In this paper, we systematically review the developments in foundation language models designed specifically for single-cell biology. Our survey offers a thorough analysis of various incarnations of single-cell foundation language models, viewed through the lens of both pre-trained language models (PLMs) and large language models (LLMs). This includes an exploration of data tokenization strategies, pre-training/tuning paradigms, and downstream single-cell data analysis tasks. Additionally, we discuss the current challenges faced by these pioneering works and speculate on future research directions. Overall, this survey provides a comprehensive overview of the existing single-cell foundation language models, paving the way for future research endeavors.
pdf
bib
abs
RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios
Ruiwen Zhou
|
Wenyue Hua
|
Liangming Pan
|
Sitao Cheng
|
Xiaobao Wu
|
En Yu
|
William Yang Wang
This paper introduces RuleArena, a novel and challenging benchmark designed to evaluate the ability of large language models (LLMs) to follow complex, real-world rules in reasoning. Covering three practical domains – airline baggage fees, NBA transactions, and tax regulations – RuleArena assesses LLMs’ proficiency in handling intricate natural language instructions that demand long-context understanding, logical reasoning, and accurate mathematical computation. Two key attributes distinguish RuleArena from traditional rule-based reasoning benchmarks: (1) it extends beyond standard first-order logic representations, and (2) it is grounded in authentic, practical scenarios, providing insights into the suitability and reliability of LLMs for real-world applications. Our findings reveal several notable limitations in LLMs: (1) they struggle to identify and apply the appropriate rules, frequently becoming confused by similar but distinct regulations, (2) they cannot consistently perform accurate mathematical computations, even when they correctly identify the relevant rules, and (3) in general, they perform poorly in the benchmark. We also observe a significant performance boost when LLMs are provided with external tools for oracle math and logic operations. These results highlight significant challenges and promising research directions in advancing LLMs’ rule-guided reasoning capabilities in real-life applications. Our codes and data are publicly available on https://github.com/skyriver-2000/rulearena.
pdf
bib
abs
Extending LLM Context Window with Adaptive Grouped Positional Encoding: A Training-Free Method
Xinhao Xu
|
Jiaxin Li
|
Hui Chen
|
Zijia Lin
|
Jungong Han
|
Guiguang Ding
Processing long input remains a significant challenge for large language models (LLMs) due to the scarcity of large-scale long-context training data and the high computational cost of training models for extended context windows. In this paper, we propose **Ada**ptive **Gro**uped **P**ositional **E**ncoding (AdaGroPE), a training-free, plug-and-play method to enhance long-context understanding in existing LLMs. AdaGroPE progressively increases the reuse count of relative positions as the distance grows and dynamically adapts the positional encoding mapping to sequence length, thereby fully exploiting the range of pre-trained position embeddings. Its design is consistent with the principles of rotary position embedding (RoPE) and aligns with human perception of relative distance, enabling robust performance in real-world settings with variable-length inputs. Extensive experiments across various benchmarks demonstrate that our AdaGroPE consistently achieves state-of-the-art performance, surpassing baseline methods and even outperforming LLMs inherently designed for long-context processing on certain tasks.
pdf
bib
abs
Semantic Exploration with Adaptive Gating for Efficient Problem Solving with Language Models
Sungjae Lee
|
Hyejin Park
|
Jaechang Kim
|
Jungseul Ok
Recent advancements in large language models (LLMs) have shown remarkable potential in various complex tasks requiring multi-step reasoning methods like tree search to explore diverse reasoning paths. However, existing methods often suffer from computational inefficiency and redundancy. First, they overlook the diversity of task difficulties, leading to unnecessarily extensive searches even for easy tasks. Second, they neglect the semantics of reasoning paths, resulting in redundant exploration of semantically identical paths. To address these limitations, we propose Semantic Exploration with Adaptive Gating (SEAG), a computationally efficient method. SEAG employs an adaptive gating mechanism that dynamically decides whether to conduct a tree search, based on the confidence level of answers from a preceding simple reasoning method. Furthermore, its tree-based exploration consolidates semantically identical reasoning steps, reducing redundant explorations while maintaining or even improving accuracy. Our extensive experiments demonstrate that SEAG significantly improves accuracy by 4.3% on average while requiring only 31% of computational costs compared to existing tree search-based methods on complex reasoning benchmarks including GSM8K and ARC with diverse language models such as Llama2, Llama3, and Mistral. Our code is available at https://github.com/ml-postech/SEAG-semantic-exploration-with-adaptive-gating.
pdf
bib
abs
HotelMatch-LLM: Joint Multi-Task Training of Small and Large Language Models for Efficient Multimodal Hotel Retrieval
Arian Askari
|
Emmanouil Stergiadis
|
Ilya Gusev
|
Moran Beladev
We present HotelMatch-LLM, a multimodal dense retrieval model for the travel domain that enables natural language property search, addressing the limitations of traditional travel search engines which require users to start with a destination and editing search parameters. HotelMatch-LLM features three key innovations: (1) Domain-specific multi-task optimization with three novel retrieval, visual, and language modeling objectives; (2) Asymmetrical dense retrieval architecture combining a small language model (SLM) for efficient online query processing and a large language model (LLM) for embedding hotel data; and (3) Extensive image processing to handle all property image galleries. Experiments on four diverse test sets show HotelMatch-LLM significantly outperforms state-of-the-art models, including VISTA and MARVEL. Specifically, on the test set—main query type—we achieve 0.681 for HotelMatch-LLM compared to 0.603 for the most effective baseline, MARVEL. Our analysis highlights the impact of our multi-task optimization, the generalizability of HotelMatch-LLM across LLM architectures, and its scalability for processing large image galleries.
pdf
bib
abs
Can Multimodal Large Language Models Understand Spatial Relations?
Jingping Liu
|
Ziyan Liu
|
Zhedong Cen
|
Yan Zhou
|
Yinan Zou
|
Weiyan Zhang
|
Haiyun Jiang
|
Tong Ruan
Spatial relation reasoning is a crucial task for multimodal large language models (MLLMs) to understand the objective world. However, current benchmarks have issues like relying on bounding boxes, ignoring perspective substitutions, or allowing questions to be answered using only the model’s prior knowledge without image understanding. To address these issues, we introduce SpatialMQA, a human-annotated spatial relation reasoning benchmark based on COCO2017, which enables MLLMs to focus more on understanding images in the objective world. To ensure data quality, we design a well-tailored annotation procedure, resulting in SpatialMQA consisting of 5,392 samples. Based on this benchmark, a series of closed- and open-source MLLMs are implemented and the results indicate that the current state-of-the-art MLLM achieves only 48.14% accuracy, far below the human-level accuracy of 98.40%. Extensive experimental analyses are also conducted, suggesting the future research directions. The benchmark and codes are available at https://huggingface.co/datasets/liuziyan/SpatialMQA.
pdf
bib
abs
S3 - Semantic Signal Separation
Márton Kardos
|
Jan Kostkan
|
Kenneth Enevoldsen
|
Arnault-Quentin Vermillet
|
Kristoffer Nielbo
|
Roberta Rocca
Topic models are useful tools for discovering latent semantic structures in large textual corpora. Recent efforts have been oriented at incorporating contextual representations in topic modeling and have been shown to outperform classical topic models. These approaches are typically slow, volatile, and require heavy preprocessing for optimal results. We present Semantic Signal Separation (S3), a theory-driven topic modeling approach in neural embedding spaces. S3 conceptualizes topics as independent axes of semantic space and uncovers these by decomposing contextualized document embeddings using Independent Component Analysis. Our approach provides diverse and highly coherent topics, requires no preprocessing, and is demonstrated to be the fastest contextual topic model, being, on average, 4.5x faster than the runner-up BERTopic. We offer an implementation of S3, and all contextual baselines, in the Turftopic Python package.
pdf
bib
abs
TrimLLM: Progressive Layer Dropping for Domain-Specific LLMs
Lanxiang Hu
|
Tajana Rosing
|
Hao Zhang
Specializing large language models (LLMs) for local deployment in domain-specific use cases is necessary for strong performance while meeting latency and privacy constraints. However, conventional task-specific adaptation approaches do not show simultaneous memory saving and inference speedup at deployment time. Practical compression techniques like quantization and pruning require dedicated hardware or kernel support to achieve measured inference speedup. We develop TrimLLM based on the layer-wise specialization phenomenon we empirically observed and verified on contemporary LLMs. TrimLLM reduces the depth of LLMs via progressive layer dropping. We show it retains LLMs’ capacity in specific domains and achieves inference speedup irrespective of hardware and deep learning frameworks. We evaluated TrimLLM on LLMs of various sizes for inference; models adapted on medical, legal, and financial datasets all demonstrate 2.1 - 5.7× inference speedup on consumer GPUs and up to 3.1× speedup on A100 when compared to state-of-the-art model compression algorithms, with no loss in accuracy at 50∼ 60% model compression ratio.
pdf
bib
abs
JuStRank: Benchmarking LLM Judges for System Ranking
Ariel Gera
|
Odellia Boni
|
Yotam Perlitz
|
Roy Bar-Haim
|
Lilach Eden
|
Asaf Yehudai
Given the rapid progress of generative AI, there is a pressing need to systematically compare and choose between the numerous models and configurations available. The scale and versatility of such evaluations make the use of LLM-based judges a compelling solution for this challenge. Crucially, this approach requires first to validate the quality of the LLM judge itself. Previous work has focused on instance-based assessment of LLM judges, where a judge is evaluated over a set of responses, or response pairs, while being agnostic to their source systems. We argue that this setting overlooks critical factors affecting system-level ranking, such as a judge’s positive or negative bias towards certain systems. To address this gap, we conduct the first large-scale study of LLM judges as system rankers. System scores are generated by aggregating judgment scores over multiple system outputs, and the judge’s quality is assessed by comparing the resulting system ranking to a human-based ranking. Beyond overall judge assessment, our analysis provides a fine-grained characterization of judge behavior, including their decisiveness and bias.
pdf
bib
abs
Generating Diverse Training Samples for Relation Extraction with Large Language Models
Zexuan Li
|
Hongliang Dai
|
Piji Li
Using Large Language Models (LLMs) to generate training data can potentially be a preferable way to improve zero or few-shot NLP tasks. However, many problems remain to be investigated for this direction. For the task of Relation Extraction (RE), we find that samples generated by directly prompting LLMs may easily have high structural similarities with each other. They tend to use a limited variety of phrasing while expressing the relation between a pair of entities. Therefore, in this paper, we study how to effectively improve the diversity of the training samples generated with LLMs for RE, while also maintaining their correctness. We first try to make the LLMs produce dissimilar samples by directly giving instructions in In-Context Learning (ICL) prompts. Then, we propose an approach to fine-tune LLMs for diversity training sample generation through Direct Preference Optimization (DPO). Our experiments on commonly used RE datasets show that both attempts can improve the quality of the generated training data. We also find that comparing with directly performing RE with an LLM, training a non-LLM RE model with its generated samples may lead to better performance.
pdf
bib
abs
MultiSocial: Multilingual Benchmark of Machine-Generated Text Detection of Social-Media Texts
Dominik Macko
|
Jakub Kopál
|
Robert Moro
|
Ivan Srba
Recent LLMs are able to generate high-quality multilingual texts, indistinguishable for humans from authentic human-written ones. Research in machine-generated text detection is however mostly focused on the English language and longer texts, such as news articles, scientific papers or student essays. Social-media texts are usually much shorter and often feature informal language, grammatical errors, or distinct linguistic items (e.g., emoticons, hashtags). There is a gap in studying the ability of existing methods in detection of such texts, reflected also in the lack of existing multilingual benchmark datasets. To fill this gap we propose the first multilingual (22 languages) and multi-platform (5 social media platforms) dataset for benchmarking machine-generated text detection in the social-media domain, called MultiSocial. It contains 472,097 texts, of which about 58k are human-written and approximately the same amount is generated by each of 7 multilingual LLMs. We use this benchmark to compare existing detection methods in zero-shot as well as fine-tuned form. Our results indicate that the fine-tuned detectors have no problem to be trained on social-media texts and that the platform selection for training matters.
pdf
bib
abs
Efficient and Accurate Prompt Optimization: the Benefit of Memory in Exemplar-Guided Reflection
Cilin Yan
|
Jingyun Wang
|
Lin Zhang
|
Ruihui Zhao
|
Xiaopu Wu
|
Kai Xiong
|
Qingsong Liu
|
Guoliang Kang
|
Yangyang Kang
Automatic prompt engineering aims to enhance the generation quality of large language models (LLMs). Recent works utilize feedbacks generated from erroneous cases to guide the prompt optimization. During inference, they may further retrieve several semantically-related exemplars and concatenate them to the optimized prompts to improve the performance. However, those works only utilize the feedback at the current step, ignoring historical and unseleccted feedbacks which are potentially beneficial. Moreover, the selection of exemplars only considers the general semantic relationship and may not be optimal in terms of task performance and matching with the optimized prompt. In this work, we propose an Exemplar-Guided Reflection with Memory mechanism (ERM) to realize more efficient and accurate prompt optimization. Specifically, we design an exemplar-guided reflection mechanism where the feedback generation is additionally guided by the generated exemplars. We further build two kinds of memory to fully utilize the historical feedback information and support more effective exemplar retrieval. Empirical evaluations show our method surpasses previous state-of-the-arts with less optimization steps, i.e., improving F1 score by 10.1 on LIAR dataset, and reducing half of the optimization steps on ProTeGi.
pdf
bib
abs
Evaluation of LLM Vulnerabilities to Being Misused for Personalized Disinformation Generation
Aneta Zugecova
|
Dominik Macko
|
Ivan Srba
|
Robert Moro
|
Jakub Kopál
|
Katarína Marcinčinová
|
Matúš Mesarčík
The capabilities of recent large language models (LLMs) to generate high-quality content indistinguishable by humans from human-written texts raises many concerns regarding their misuse. Previous research has shown that LLMs can be effectively misused for generating disinformation news articles following predefined narratives. Their capabilities to generate personalized (in various aspects) content have also been evaluated and mostly found usable. However, a combination of personalization and disinformation abilities of LLMs has not been comprehensively studied yet. Such a dangerous combination should trigger integrated safety filters of the LLMs, if there are some. This study fills this gap by evaluating vulnerabilities of recent open and closed LLMs, and their willingness to generate personalized disinformation news articles in English. We further explore whether the LLMs can reliably meta-evaluate the personalization quality and whether the personalization affects the generated-texts detectability. Our results demonstrate the need for stronger safety-filters and disclaimers, as those are not properly functioning in most of the evaluated LLMs. Additionally, our study revealed that the personalization actually reduces the safety-filter activations; thus effectively functioning as a jailbreak. Such behavior must be urgently addressed by LLM developers and service providers.
pdf
bib
abs
EscapeBench: Towards Advancing Creative Intelligence of Language Model Agents
Cheng Qian
|
Peixuan Han
|
Qinyu Luo
|
Bingxiang He
|
Xiusi Chen
|
Yuji Zhang
|
Hongyi Du
|
Jiarui Yao
|
Xiaocheng Yang
|
Denghui Zhang
|
Yunzhu Li
|
Heng Ji
Language model agents excel in long-session planning and reasoning, but existing benchmarks primarily focus on goal-oriented tasks with explicit objectives, neglecting creative adaptation in unfamiliar environments. To address this, we introduce EscapeBench—a benchmark suite of room escape game environments designed to challenge agents with creative reasoning, unconventional tool use, and iterative problem-solving to uncover implicit goals. Our results show that current LM models, despite employing working memory and Chain-of-Thought reasoning, achieve only 15% average progress without hints, highlighting their limitations in creativity. To bridge this gap, we propose EscapeAgent, a framework designed to enhance creative reasoning through Foresight (innovative tool use) and Reflection (identifying unsolved tasks). Experiments show that EscapeAgent can execute action chains over 1,000 steps while maintaining logical coherence. It navigates and completes games with up to 40% fewer steps and hints, performs robustly across difficulty levels, and achieves higher action success rates with more efficient and innovative puzzle-solving strategies.
pdf
bib
abs
BPP-Search: Enhancing Tree of Thought Reasoning for Mathematical Modeling Problem Solving
Teng Wang
|
Wing Yin Yu
|
Zhenqi He
|
Zehua Liu
|
HaileiGong HaileiGong
|
Han Wu
|
Xiongwei Han
|
Wei Shi
|
Ruifeng She
|
Fangzhou Zhu
|
Tao Zhong
LLMs exhibit advanced reasoning capabilities, offering the potential to transform natural language questions into mathematical models. However, existing open-source datasets in operations research domain lack detailed annotations of the modeling process, such as variable definitions, focusing solely on objective values, which hinders reinforcement learning applications. To address this, we release the StructuredOR dataset, annotated with comprehensive labels that capture the complete mathematical modeling process. We further propose BPP-Search, an algorithm that integrates reinforcement learning into a tree-of-thought structure using Beam search, a Process reward model, and a pairwise Preference algorithm. This approach enables efficient exploration of tree structures, avoiding exhaustive search while improving accuracy. Extensive experiments on StructuredOR, NL4OPT, and MAMO-ComplexLP datasets show that BPP-Search significantly outperforms state-of-the-art methods. In tree-based reasoning, BPP-Search excels in accuracy and efficiency, enabling faster retrieval of correct solutions. The StructuredOR dataset is available on Huggingface https://huggingface.co/datasets/LLM4OR/StructuredOR and GitHub https://github.com/LLM4OR/StructuredOR.
pdf
bib
abs
LACA: Improving Cross-lingual Aspect-Based Sentiment Analysis with LLM Data Augmentation
Jakub Šmíd
|
Pavel Priban
|
Pavel Kral
Cross-lingual aspect-based sentiment analysis (ABSA) involves detailed sentiment analysis in a target language by transferring knowledge from a source language with available annotated data. Most existing methods depend heavily on often unreliable translation tools to bridge the language gap. In this paper, we propose a new approach that leverages a large language model (LLM) to generate high-quality pseudo-labelled data in the target language without the need for translation tools. First, the framework trains an ABSA model to obtain predictions for unlabelled target language data. Next, LLM is prompted to generate natural sentences that better represent these noisy predictions than the original text. The ABSA model is then further fine-tuned on the resulting pseudo-labelled dataset. We demonstrate the effectiveness of this method across six languages and five backbone models, surpassing previous state-of-the-art translation-based approaches. The proposed framework also supports generative models, and we show that fine-tuned LLMs outperform smaller multilingual models.
pdf
bib
abs
Fusing Highly Specialized Language Models for Comprehensive Expertise
Ning Ding
|
Yulin Chen
|
Ganqu Cui
|
Xingtai Lv
|
Weilin Zhao
|
Kaiyan Zhang
|
Ruobing Xie
|
Bowen Zhou
|
Zhiyuan Liu
|
Maosong Sun
Underlying data distributions of natural language, programming code, and mathematical symbols vary vastly, presenting a complex challenge for large language models (LLMs) that strive to achieve high performance across all three domains simultaneously. Achieving a very high level of proficiency for an LLM within a specific domain often requires extensive training with relevant corpora, which is typically accompanied by a sacrifice in performance in other domains. In this paper, we aim to “play the dealt cards well” and propose to fuse models that are already highly-specialized directly. The proposed fusing framework, , consists of different distinct specialists that are already sufficiently trained on different domains (we mainly focus on language, coding, and mathematics in this paper). A token-level gating mechanism is introduced to blend the specialists’ outputs. A two-stage training strategy accompanied by balanced sampling is designed to ensure stability. To effectively train the fused model, we further construct a high-quality supervised instruction tuning dataset, , which includes text, code, and mathematical content. This dataset comprises approximately 300,000 instructions and covers a wide range of topics in each domain. Experiments show that our model could simultaneously achieve mastery of the three crucial domains.
pdf
bib
abs
HybGRAG: Hybrid Retrieval-Augmented Generation on Textual and Relational Knowledge Bases
Meng-Chieh Lee
|
Qi Zhu
|
Costas Mavromatis
|
Zhen Han
|
Soji Adeshina
|
Vassilis N. Ioannidis
|
Huzefa Rangwala
|
Christos Faloutsos
Given a semi-structured knowledge base (SKB), where text documents are interconnected by relations, how can we effectively retrieve relevant information to answer user questions?Retrieval-Augmented Generation (RAG) retrieves documents to assist large language models (LLMs) in question answering; while Graph RAG (GRAG) uses structured knowledge bases as its knowledge source.However, many questions require both textual and relational information from SKB — referred to as “hybrid” questions — which complicates the retrieval process and underscores the need for a hybrid retrieval method that leverages both information.In this paper, through our empirical analysis, we identify key insights that show why existing methods may struggle with hybrid question answering (HQA) over SKB. Based on these insights, we propose HybGRAG for HQA, consisting of a retriever bank and a critic module, with the following advantages:1. Agentic, it automatically refines the output by incorporating feedback from the critic module, 2. Adaptive, it solves hybrid questions requiring both textual and relational information with the retriever bank,3. Interpretable, it justifies decision making with intuitive refinement path, and4. Effective, it surpasses all baselines on HQA benchmarks.In experiments on the STaRK benchmark, HybGRAG achieves significant performance gains, with an average relative improvement in Hit@1 of 51%.
pdf
bib
abs
Re-ranking Using Large Language Models for Mitigating Exposure to Harmful Content on Social Media Platforms
Rajvardhan Oak
|
Muhammad Haroon
|
Claire Wonjeong Jo
|
Magdalena Wojcieszak
|
Anshuman Chhabra
Social media platforms utilize Machine Learning (ML) and Artificial Intelligence (AI) powered recommendation algorithms to maximize user engagement, which can result in inadvertent exposure to harmful content. Current moderation efforts, reliant on classifiers trained with extensive human-annotated data, struggle with scalability and adapting to new forms of harm. To address these challenges, we propose a novel re-ranking approach using Large Language Models (LLMs) in zero-shot and few-shot settings. Our method dynamically assesses and re-ranks content sequences, effectively mitigating harmful content exposure without requiring extensive labeled data. Alongside traditional ranking metrics, we also introduce two new metrics to evaluate the effectiveness of re-ranking in reducing exposure to harmful content. Through experiments on three datasets, three models and across three configurations, we demonstrate that our LLM-based approach significantly outperforms existing proprietary moderation approaches, offering a scalable and adaptable solution for harm mitigation.
pdf
bib
abs
Aligning AI Research with the Needs of Clinical Coding Workflows: Eight Recommendations Based on US Data Analysis and Critical Review
Yidong Gan
|
Maciej Rybinski
|
Ben Hachey
|
Jonathan K. Kummerfeld
Clinical coding is crucial for healthcare billing and data analysis. Manual clinical coding is labour-intensive and error-prone, which has motivated research towards full automation of the process. However, our analysis, based on US English electronic health records and automated coding research using these records, shows that widely used evaluation methods are not aligned with real clinical contexts. For example, evaluations that focus on the top 50 most common codes are an oversimplification, as there are thousands of codes used in practice. This position paper aims to align AI coding research more closely with practical challenges of clinical coding. Based on our analysis, we offer eight specific recommendations, suggesting ways to improve current evaluation methods. Additionally, we propose new AI-based methods beyond automated coding, suggesting alternative approaches to assist clinical coders in their workflows.
pdf
bib
abs
MIND: A Multi-agent Framework for Zero-shot Harmful Meme Detection
Ziyan Liu
|
Chunxiao Fan
|
Haoran Lou
|
Yuexin Wu
|
Kaiwei Deng
The rapid expansion of memes on social media has highlighted the urgent need for effective approaches to detect harmful content. However, traditional data-driven approaches struggle to detect new memes due to their evolving nature and the lack of up-to-date annotated data. To address this issue, we propose MIND, a multi-agent framework for zero-shot harmful meme detection that does not rely on annotated data. MIND implements three key strategies: 1) We retrieve similar memes from an unannotated reference set to provide contextual information. 2) We propose a bi-directional insight derivation mechanism to extract a comprehensive understanding of similar memes. 3) We then employ a multi-agent debate mechanism to ensure robust decision-making through reasoned arbitration. Extensive experiments on three meme datasets demonstrate that our proposed framework not only outperforms existing zero-shot approaches but also shows strong generalization across different model architectures and parameter scales, providing a scalable solution for harmful meme detection.
pdf
bib
abs
EvoWiki: Evaluating LLMs on Evolving Knowledge
Wei Tang
|
Yixin Cao
|
Yang Deng
|
Jiahao Ying
|
Bo Wang
|
Yizhe Yang
|
Yuyue Zhao
|
Qi Zhang
|
Xuanjing Huang
|
Yu-Gang Jiang
|
Yong Liao
Knowledge utilization is a critical aspect of LLMs, and understanding how they adapt to evolving knowledge is essential for their effective deployment. However, existing benchmarks are predominantly static, failing to capture the evolving nature of LLMs and knowledge, leading to inaccuracies and vulnerabilities such as contamination. In this paper, we introduce EvoWiki, an evolving dataset designed to reflect knowledge evolution by categorizing information into stable, evolved, and uncharted states. EvoWiki is fully auto-updatable, enabling precise evaluation of continuously changing knowledge and newly released LLMs. Through experiments with Retrieval-Augmented Generation (RAG) and Continual Learning (CL), we evaluate how effectively LLMs adapt to evolving knowledge. Our results indicate that current models often struggle with evolved knowledge, frequently providing outdated or incorrect responses. Moreover, the dataset highlights a synergistic effect between RAG and CL, demonstrating their potential to better adapt to evolving knowledge. EvoWiki provides a robust benchmark for advancing future research on the knowledge evolution capabilities of large language models.
pdf
bib
abs
Rethinking Repetition Problems of LLMs in Code Generation
Yihong Dong
|
Yuchen Liu
|
Xue Jiang
|
Bin Gu
|
Zhi Jin
|
Ge Li
With the advent of neural language models, the performance of code generation has been significantly boosted. However, the problem of repetitions during the generation process continues to linger. Previous work has primarily focused on content repetition, which is merely a fraction of the broader repetition problem in code generation. A more prevalent and challenging problem is structural repetition. In structural repetition, the repeated code appears in various patterns but possesses a fixed structure, which can be inherently reflected in grammar. In this paper, we formally define structural repetition and propose an efficient decoding approach called RPG, which stands for Repetition Penalization based on Grammar, to alleviate the repetition problems in code generation for LLMs. Specifically, RPG first leverages grammar rules to identify repetition problems during code generation, and then strategically decays the likelihood of critical tokens that contribute to repetitions, thereby mitigating them in code generation. To facilitate this study, we construct a new dataset CodeRepetEval to comprehensively evaluate approaches for mitigating the repetition problems in code generation. Extensive experimental results demonstrate that RPG substantially outperforms the best-performing baselines on CodeRepetEval dataset as well as HumanEval and MBPP benchmarks, effectively reducing repetitions and enhancing the quality of generated code.
pdf
bib
abs
PunchBench: Benchmarking MLLMs in Multimodal Punchline Comprehension
Kun Ouyang
|
Yuanxin Liu
|
Shicheng Li
|
Yi Liu
|
Hao Zhou
|
Fandong Meng
|
Jie Zhou
|
Xu Sun
Multimodal punchlines, which involve humor or sarcasm conveyed in image-caption pairs, are a popular way of communication on online multimedia platforms. With the rapid development of multimodal large language models (MLLMs), it is essential to assess their ability to effectively comprehend these punchlines. However, existing benchmarks on punchline comprehension suffer from three major limitations: 1) language shortcuts that allow models to solely rely on text, 2) lack of question diversity, and 3) narrow focus on a specific domain of multimodal content (e.g., cartoon). To address these limitations, we introduce a multimodal **Punch**line comprehension **Bench**mark, named **PunchBench**, which is tailored for accurate and comprehensive evaluation of punchline comprehension. To enhance the evaluation accuracy, we generate synonymous and antonymous captions by modifying original captions, which mitigates the impact of shortcuts in the captions. To provide a comprehensive evaluation, PunchBench incorporates diverse question formats and image-captions from various domains. On this basis, we conduct extensive evaluations and reveal a significant gap between state-of-the-art MLLMs and humans in punchline comprehension. To improve punchline comprehension, we propose Simple-to-Complex Chain-of-Question (SC-CoQ) strategy, enabling the models to incrementally address complicated questions by first mastering simple ones. SC-CoQ effectively enhances the performance of various MLLMs on PunchBench, surpassing in-context learning and chain-of-thought.
pdf
bib
abs
ProcessBench: Identifying Process Errors in Mathematical Reasoning
Chujie Zheng
|
Zhenru Zhang
|
Beichen Zhang
|
Runji Lin
|
Keming Lu
|
Bowen Yu
|
Dayiheng Liu
|
Jingren Zhou
|
Junyang Lin
As language models regularly make mistakes when solving math problems, automated identification of errors in the reasoning process becomes increasingly significant for their scalable oversight. In this paper, we introduce ProcessBench for measuring the ability to identify erroneous steps in mathematical reasoning. It consists of 3,400 test cases, primarily focused on competition- and Olympiad-level math problems. Each test case contains a step-by-step solution with error location annotated by human experts. Models are required to identify the earliest step that contains an error, or conclude that all steps are correct. We conduct extensive evaluation on ProcessBench, involving two types of models: process reward models (PRMs) and critic models, where for the latter we prompt general language models to critique each solution step by step. We draw two main observations: (1) Existing PRMs typically fail to generalize to more challenging math problems beyond GSM8K and MATH. They underperform both critic models (i.e., prompted general language models) and our own trained PRM that is straightforwardly fine-tuned on the PRM800K dataset. (2) The best open-source model, QwQ-32B-Preview, has demonstrated the critique capability competitive with the proprietary model GPT-4o, despite that it still lags behind the reasoning-specialized o1-mini. We hope ProcessBench can foster future research in reasoning process assessment, paving the way toward scalable oversight of language models.
pdf
bib
abs
Model Extrapolation Expedites Alignment
Chujie Zheng
|
Ziqi Wang
|
Heng Ji
|
Minlie Huang
|
Nanyun Peng
Given the high computational cost of preference alignment training of large language models (LLMs), exploring efficient methods to reduce the training overhead remains an important and compelling research problem. Motivated by the observation that alignment training typically involves only small parameter changes without injecting new knowledge into models, we propose a straightforward method called ExPO (model extrapolation) to expedite LLMs’ alignment with human preferences. Given a partially-trained model and its initial SFT checkpoint, ExPO improves the implicit optimization objective of alignment training by simply amplifying the parameter change based on a first-order approximation, without any additional training overhead. Through controlled experiments, we demonstrate that ExPO boosts a DPO model trained with only 20% steps to outperform the fully-trained one. Moreover, we show that ExPO notably improves existing open-source LLMs (ranging from 1.8B to 70B parameters) on the leading AlpacaEval 2.0 and MT-Bench benchmarks, which highlights ExPO’s broader utility in efficiently enhancing LLM alignment.
pdf
bib
abs
ATLANTIS: Weak-to-Strong Learning via Importance Sampling
Yi Liu
|
Guoyin Wang
|
Shicheng Li
|
Feifan Song
|
Xu Sun
Supervised fine-tuning (SFT) enables large language models to align with training data for better performance in many aspects. Nevertheless, the gap between the distribution of current datasets from human annotations or model generations and the real-world data distribution heavily limits the capacities and potentials of models. As a result, we propose a new SFT technique, ATLANTIS, to bridge the gap. We adopt importance sampling to estimate the optimal data distribution in the real world from existing training datasets because the former is hard to sample from. Furthermore, we introduce an extra small model and reference model to estimate the sampling ratio through the probability gap between them. We evaluate our method with benchmarks in knowledge & understanding and preference aspects. The experiment results prove that ATLANTIS can bring consistent and significant improvements to models’ performance. What’s more, our method can be flexibly transferred among models with different structures. Our analyses demonstrate that our method is well-compatible with other SFT techniques to further enhance models’ capacities and has great potential to be combined with existing training frameworks.
pdf
bib
abs
MPVStance: Mitigating Hallucinations in Stance Detection with Multi-Perspective Verification
ZhaoDan Zhang
|
Zhao Zhang
|
Jin Zhang
|
Hui Xu
|
Xueqi Cheng
Stance detection is a pivotal task in Natural Language Processing (NLP), identifying textual attitudes toward various targets. Despite advances in using Large Language Models (LLMs), challenges persist due to hallucination-models generating plausible yet inaccurate content. Addressing these challenges, we introduce MPVStance, a framework that incorporates Multi-Perspective Verification (MPV) with Retrieval-Augmented Generation (RAG) across a structured five-step verification process. Our method enhances stance detection by rigorously validating each response from factual accuracy, logical consistency, contextual relevance, and other perspectives. Extensive testing on the SemEval-2016 and VAST datasets, including scenarios that challenge existing methods and comprehensive ablation studies, demonstrates that MPVStance significantly outperforms current models. It effectively mitigates hallucination issues and sets new benchmarks for reliability and accuracy in stance detection, particularly in zero-shot, few-shot, and challenging scenarios.
pdf
bib
abs
Personality-Guided Code Generation Using Large Language Models
Yaoqi Guo
|
Zhenpeng Chen
|
Jie M. Zhang
|
Yang Liu
|
Yun Ma
Code generation, the automatic creation of source code from natural language descriptions, has garnered significant attention due to its potential to streamline software development. Inspired by research that links task-personality alignment with improved development outcomes, we conduct an empirical study on personality-guided code generation using large language models (LLMs). Specifically, we investigate how emulating personality traits appropriate to the coding tasks affects LLM performance. We extensively evaluate this approach using seven widely adopted LLMs across four representative datasets. Our results show that personality guidance significantly enhances code generation accuracy, with improved pass rates in 23 out of 28 LLM-dataset combinations. Notably, in 11 cases, the improvement exceeds 5%, and in 5 instances, it surpasses 10%, with the highest gain reaching 12.9%. Additionally, personality guidance can be easily integrated with other prompting strategies to further boost performance.
pdf
bib
abs
PsyDT: Using LLMs to Construct the Digital Twin of Psychological Counselor with Personalized Counseling Style for Psychological Counseling
Haojie Xie
|
Yirong Chen
|
Xiaofen Xing
|
Jingkai Lin
|
Xiangmin Xu
Currently, large language models (LLMs) have made significant progress in the field of psychological counseling. However, existing mental health LLMs overlook a critical issue where they do not consider the fact that different psychological counselors exhibit different personal styles, including linguistic style and therapy techniques, etc. As a result, these LLMs fail to satisfy the individual needs of clients who seek different counseling styles. To help bridge this gap, we propose PsyDT, a novel framework using LLMs to construct the Digital Twin of Psychological counselor with personalized counseling style. Compared to the time-consuming and costly approach of collecting a large number of real-world counseling cases to create a specific counselor’s digital twin, our framework offers a faster and more cost-effective solution. To construct PsyDT, we utilize dynamic one-shot learning by using GPT-4 to capture counselor’s unique counseling style, mainly focusing on linguistic style and therapy techniques. Subsequently, using existing single-turn long-text dialogues with client’s questions, GPT-4 is guided to synthesize multi-turn dialogues of specific counselor. Finally, we fine-tune the LLMs on the synthetic dataset, PsyDTCorpus, to achieve the digital twin of psychological counselor with personalized counseling style. Experimental results indicate that our proposed PsyDT framework can synthesize multi-turn dialogues that closely resemble real-world counseling cases and demonstrate better performance compared to other baselines, thereby show that our framework can effectively construct the digital twin of psychological counselor with a specific counseling style.
pdf
bib
abs
BIPro: Zero-shot Chinese Poem Generation via Block Inverse Prompting Constrained Generation Framework
Xu Zou
Recently, generative pre-trained models have made significant strides, particularly highlighted by the release of ChatGPT and GPT-4, which exhibit superior cross-domain capabilities. However, these models still face challenges on constrained writing tasks like poem generation under open-domain titles via direct generation.In response to this challenge, we introduce Block Inverse Prompting (BIPro) constrained generation framework. BIPro leverages two block inverse prompting methods, revise and rewrite. This inference scaling approach mimics the process of human text writing using block generative models. It significantly improves the zero-shot generation quality on the constrained generation task of open-domain traditional-form Chinese poem generation. Based on a less powerful block generative model GLM-10B-Chinese, poems composed via BIPro without priming or additional training outperform both much larger direct generative systems like GPT-4 or GLM-4 and domain-specific systems such as Yusheng, Shisanbai, or Baidu Poetry Helper in human evaluation by proficient poets. BIPro considerably narrows the gap between AI-generated works and short-listed human literary arts in another human evaluation, unveiling the promising potential of inference scaling in improving the quality of constrained generation. It is open-sourced and available as an agent in chatglm app.
pdf
bib
abs
LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating
Chao Deng
|
Jiale Yuan
|
Pi Bu
|
Peijie Wang
|
Zhong-Zhi Li
|
Jian Xu
|
Xiao-Hui Li
|
Yuan Gao
|
Jun Song
|
Bo Zheng
|
Cheng-Lin Liu
Large vision language models (LVLMs) have improved the document understanding capabilities remarkably, enabling the handling of complex document elements, longer contexts, and a wider range of tasks. However, existing document understanding benchmarks have been limited to handling only a small number of pages and fail to provide a comprehensive analysis of layout elements locating. In this paper, we first define three primary task categories: Long Document Understanding, numerical Reasoning, and cross-element Locating, and then propose a comprehensive benchmark—LongDocURL—integrating above three primary tasks and comprising 20 sub-tasks categorized based on different primary tasks and answer evidences. Furthermore, we develop a semi-automated construction pipeline and collect 2,325 high-quality question-answering pairs, covering more than 33,000 pages of documents, significantly outperforming existing benchmarks. Subsequently, we conduct comprehensive evaluation experiments on both open-source and closed- source models across 26 different configurations, revealing critical performance gaps in this field. The code and data: https://github.com/dengc2023/LongDocURL.
pdf
bib
abs
ObfusLM: Privacy-preserving Language Model Service against Embedding Inversion Attacks
Yu Lin
|
Ruining Yang
|
Yunlong Mao
|
Qizhi Zhang
|
Jue Hong
|
Quanwei Cai
|
Ye Wu
|
Huiqi Liu
|
Zhiyu Chen
|
Bing Duan
|
Sheng Zhong
As the rapid expansion of Machine Learning as a Service (MLaaS) for language models, concerns over the privacy of client inputs during inference or fine-tuning have correspondingly escalated. Recently, solutions have been proposed to safeguard client privacy by obfuscation techniques. However, the solutions incur notable decline in model utility and mainly focus on classification tasks, rendering them impractical for real-world applications. Moreover, recent studies reveal that these obfuscation, if not well designed, is susceptible to embedding inversion attacks (EIAs). In this paper, we devise ObfusLM, a privacy-preserving MLaaS framework for both classification and generation tasks. ObfusLM leverages a model obfuscation module to achieve privacy protection for both classification and generation tasks. Based on (k, 𝜖)-anonymity, ObfusLM includes novel obfuscation algorithms to reach provable security against EIAs. Extensive experiments show that ObfusLM outperforms existing works in utility by 10% with a nearly 80% resistance rate against EIAs.
pdf
bib
abs
Interlocking-free Selective Rationalization Through Genetic-based Learning
Federico Ruggeri
|
Gaetano Signorelli
A popular end-to-end architecture for selective rationalization is the select-then-predict pipeline, comprising a generator to extract highlights fed to a predictor. Such a cooperative system suffers from suboptimal equilibrium minima due to the dominance of one of the two modules, a phenomenon known as interlocking. While several contributions aimed at addressing interlocking, they only mitigate its effect, often by introducing feature-based heuristics, sampling, and ad-hoc regularizations. We present GenSPP, the first interlocking-free architecture for selective rationalization that does not require any learning overhead, as the above-mentioned. GenSPP avoids interlocking by performing disjoint training of the generator and predictor via genetic global search. Experiments on a synthetic and a real-world benchmark show that our model outperforms several state-of-the-art competitors.
pdf
bib
abs
Re-identification of De-identified Documents with Autoregressive Infilling
Lucas Georges Gabriel Charpentier
|
Pierre Lison
Documents revealing sensitive information about individuals must typically be de-identified. This de-identification is often done by masking all mentions of personally identifiable information (PII), thereby making it more difficult to uncover the identity of the person(s) in question. To investigate the robustness of de-identification methods, we present a novel, RAG-inspired approach that attempts the reverse process of re-identification based on a database of documents representing background knowledge. Given a text in which personal identifiers have been masked, the re-identification proceeds in two steps. A retriever first selects from the background knowledge passages deemed relevant for the re-identification. Those passages are then provided to an infilling model which seeks to infer the original content of each text span. This process is repeated until all masked spans are replaced. We evaluate the re-identification on three datasets (Wikipedia biographies, court rulings and clinical notes). Results show that (1) as many as 80% of de-identified text spans can be successfully recovered and (2) the re-identification accuracy increases along with the level of background knowledge.
pdf
bib
abs
Modeling Uncertainty in Composed Image Retrieval via Probabilistic Embeddings
Haomiao Tang
|
Jinpeng Wang
|
Yuang Peng
|
GuangHao Meng
|
Ruisheng Luo
|
Bin Chen
|
Long Chen
|
Yaowei Wang
|
Shu-Tao Xia
Composed Image Retrieval (CIR) enables users to search for images using multimodal queries that combine text and reference images. While metric learning methods have shown promise, they rely on deterministic point embeddings that fail to capture the inherent uncertainty in the input data, in which user intentions may be imprecisely specified or open to multiple interpretations. We address this challenge by reformulating CIR through our proposed Composed Probabilistic Embedding (CoPE) framework, which represents both queries and targets as Gaussian distributions in latent space rather than fixed points. Through careful design of probabilistic distance metrics and hierarchical learning objectives, CoPE explicitly captures uncertainty at both instance and feature levels, enabling more flexible, nuanced, and robust matching that can handle polysemy and ambiguity in search intentions. Extensive experiments across multiple benchmarks demonstrate that CoPE effectively quantifies both quality and semantic uncertainties within Composed Image Retrieval, achieving state-of-the-art performance on recall rate. Code: https://github.com/tanghme0w/ACL25-CoPE.
pdf
bib
abs
Untie the Knots: An Efficient Data Augmentation Strategy for Long-Context Pre-Training in Language Models
Junfeng Tian
|
Da Zheng
|
Yang Chen
|
Rui Wang
|
Colin Zhang
|
Debing Zhang
Large language models (LLM) have prioritized expanding the context window from which models can incorporate more information. However, training models to handle long contexts presents significant challenges. These include the scarcity of high-quality natural long-context data, the potential for performance degradation on short-context tasks, and the reduced training efficiency associated with attention mechanisms. In this paper, we introduce Untie the Knots (UtK), a novel data augmentation strategy employed during the continue pre-training phase, designed to efficiently enable LLMs to gain long-context capabilities without the need to modify the existing data mixture. In particular, we chunk the documents, shuffle the chunks, and create a complex and knotted structure of long texts; LLMs are then trained to untie these knots and identify relevant segments within seemingly chaotic token sequences. This approach greatly improves the model’s performance by accurately attending to relevant information in long context and the training efficiency is also largely increased. We conduct extensive experiments on models with 7B and 72B parameters, trained on 20 billion tokens, demonstrating that UtK achieves 75% and 84.5% accurracy on RULER at 128K context length, significantly outperforming other long context strategies. The trained models will open-source for further research.
pdf
bib
abs
APPL: A Prompt Programming Language for Harmonious Integration of Programs and Large Language Model Prompts
Honghua Dong
|
Qidong Su
|
Yubo Gao
|
Zhaoyu Li
|
Yangjun Ruan
|
Gennady Pekhimenko
|
Chris J. Maddison
|
Xujie Si
Large Language Models (LLMs) have become increasingly capable of handling diverse tasks with the aid of well-crafted prompts and integration of external tools, but as task complexity rises, the workflow involving LLMs can be complicated and thus challenging to implement and maintain. To address this challenge, we propose APPL, A Prompt Programming Language that acts as a bridge between computer programs and LLMs, allowing seamless embedding of prompts into Python functions, and vice versa. APPL provides an intuitive and Python-native syntax, an efficient parallelized runtime with asynchronous semantics, and a tracing module supporting effective failure diagnosis and replaying without extra costs. We demonstrate that APPL programs are intuitive, concise, and efficient through representative scenarios including Chain-of-Thought with self-consistency (CoT-SC) and ReAct tool-use agent. We further use LLMs to judge the language design between APPL and previous work, where the results indicate that codes written in APPL are more readable and intuitive. Our code, tutorial and documentation are available at https://github.com/appl-team/appl.
pdf
bib
abs
Evaluating Lexical Proficiency in Neural Language Models
Cristiano Ciaccio
|
Alessio Miaschi
|
Felice Dell’Orletta
We present a novel evaluation framework designed to assess the lexical proficiency and linguistic creativity of Transformer-based Language Models (LMs). We validate the framework by analyzing the performance of a set of LMs of different sizes, in both mono- and multilingual configuration, across tasks involving the generation, definition, and contextual usage of lexicalized words, neologisms, and nonce words. To support these evaluations, we developed a novel dataset of lexical entries for the Italian language, including curated definitions and usage examples sourced from various online platforms. The results highlight the robustness and effectiveness of our framework in evaluating multiple dimensions of LMs’ linguistic understanding and offer an insight, through the assessment of their linguistic creativity, on the lexical generalization abilities of LMs.
pdf
bib
abs
Autoregressive Speech Synthesis without Vector Quantization
Lingwei Meng
|
Long Zhou
|
Shujie Liu
|
Sanyuan Chen
|
Bing Han
|
Shujie Hu
|
Yanqing Liu
|
Jinyu Li
|
Sheng Zhao
|
Xixin Wu
|
Helen M. Meng
|
Furu Wei
We present MELLE, a novel continuous-valued token based language modeling approach for text-to-speech synthesis (TTS). MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition, bypassing the need for vector quantization, which is typically designed for audio compression and sacrifices fidelity compared to continuous representations. Specifically, (i) instead of cross-entropy loss, we apply regression loss with a proposed spectrogram flux loss function to model the probability distribution of the continuous-valued tokens; (ii) we have incorporated variational inference into MELLE to facilitate sampling mechanisms, thereby enhancing the output diversity and model robustness. Experiments demonstrate that, compared to the two-stage codec language model VALL-E and its variants, the single-stage MELLE mitigates robustness issues by avoiding the inherent flaws of sampling vector-quantized codes, achieves superior performance across multiple metrics, and, most importantly, offers a more streamlined paradigm. The demos of our work are provided at https://aka.ms/melle.
pdf
bib
abs
Cuckoo: An IE Free Rider Hatched by Massive Nutrition in LLM’s Nest
Letian Peng
|
Zilong Wang
|
Feng Yao
|
Jingbo Shang
Massive high-quality data, both pre-training raw texts and post-training annotations, have been carefully prepared to incubate advanced large language models (LLMs). In contrast, for information extraction (IE), pre-training data, such as BIO-tagged sequences, are hard to scale up. We show that IE models can act as free riders on LLM resources by reframing next-token prediction into extraction for tokens already present in the context. Specifically, our proposed next tokens extraction (NTE) paradigm learns a versatile IE model, Cuckoo, with 102.6M extractive data converted from LLM’s pre-training and post-training data. Under the few-shot setting, Cuckoo adapts effectively to traditional and complex instruction-following IE with better performance than existing pre-trained IE models. As a free rider, Cuckoo can naturally evolve with the ongoing advancements in LLM data preparation, benefiting from improvements in LLM training pipelines without additional manual effort.
pdf
bib
abs
FedEx-LoRA: Exact Aggregation for Federated and Efficient Fine-Tuning of Large Language Models
Raghav Singhal
|
Kaustubh Ponkshe
|
Praneeth Vepakomma
Low-Rank Adaptation (LoRA) is a popular technique for efficient fine-tuning of foundation models. However, applying LoRA in federated learning environments, where data is distributed across multiple clients, presents unique challenges. Existing methods rely on traditional federated averaging of LoRA adapters, resulting in inexact updates. To address this, we propose Federated Exact LoRA, or FedEx-LoRA, which adds a residual error term to the pre-trained frozen weight matrix. Our approach achieves exact updates with minimal computational and communication overhead, preserving LoRA’s efficiency. We evaluate the method on various models across arithmetic reasoning, commonsense reasoning, natural language understanding and natural language generation tasks, showing consistent performance gains over state-of-the-art methods across multiple settings. Through extensive analysis, we quantify that the deviations in updates from the ideal solution are significant, highlighting the need for exact aggregation. Our method’s simplicity, efficiency, and broad applicability position it as a promising solution for accurate and effective federated fine-tuning of foundation models.
pdf
bib
abs
Measuring Social Biases in Masked Language Models by Proxy of Prediction Quality
Rahul Zalkikar
|
Kanchan Chandra
Innovative transformer-based language models produce contextually-aware token embeddings and have achieved state-of-the-art performance for a variety of natural language tasks, but have been shown to encode unwanted biases for downstream applications. In this paper, we evaluate the social biases encoded by transformers trained with the masked language modeling objective using proposed proxy functions within an iterative masking experiment to measure the quality of transformer models’ predictions and assess the preference of MLMs towards disadvantaged and advantaged groups. We find that all models encode concerning social biases. We compare bias estimations with those produced by other evaluation methods using benchmark datasets and assess their alignment with human annotated biases. We extend previous work by evaluating social biases introduced after retraining an MLM under the masked language modeling objective and find proposed measures produce more accurate and sensitive estimations of biases based on relative preference for biased sentences between models, while other methods tend to underestimate biases after retraining on sentences biased towards disadvantaged groups.
pdf
bib
abs
Capturing Author Self Beliefs in Social Media Language
Siddharth Mangalik
|
Adithya V Ganesan
|
Abigail B. Wheeler
|
Nicholas Kerry
|
Jeremy D. W. Clifton
|
H. Schwartz
|
Ryan L. Boyd
Measuring the prevalence and dimensions of self beliefs is essential for understanding human self-perception and various psychological outcomes. In this paper, we develop a novel task for classifying language that contains explicit or implicit mentions of the author’s self beliefs. We contribute a set of 2,000 human-annotated self beliefs, 100,000 LLM-labeled examples, and 10,000 surveyed self belief paragraphs. We then evaluate several encoder-based classifiers and training routines for this task. Our trained model, SelfAwareNet, achieved an AUC of 0.944, outperforming 0.839 from OpenAI’s state-of-the-art GPT-4o model. Using this model we derive data-driven categories of self beliefs and demonstrate their ability to predict valence, depression, anxiety, and stress. We release the resulting self belief classification model and annotated datasets for use in future research.
pdf
bib
abs
Neural Topic Modeling with Large Language Models in the Loop
Xiaohao Yang
|
He Zhao
|
Weijie Xu
|
Yuanyuan Qi
|
Jueqing Lu
|
Dinh Phung
|
Lan Du
Topic modeling is a fundamental task in natural language processing, allowing the discovery of latent thematic structures in text corpora. While Large Language Models (LLMs) have demonstrated promising capabilities in topic discovery, their direct application to topic modeling suffers from issues such as incomplete topic coverage, misalignment of topics, and inefficiency. To address these limitations, we propose LLM-ITL, a novel LLM-in-the-loop framework that integrates LLMs with Neural Topic Models (NTMs). In LLM-ITL, global topics and document representations are learned through the NTM. Meanwhile, an LLM refines these topics using an Optimal Transport (OT)-based alignment objective, where the refinement is dynamically adjusted based on the LLM’s confidence in suggesting topical words for each set of input words. With the flexibility of being integrated into many existing NTMs, the proposed approach enhances the interpretability of topics while preserving the efficiency of NTMs in learning topics and document representations. Extensive experiments demonstrate that LLM-ITL helps NTMs significantly improve their topic interpretability while maintaining the quality of document representation. Our code and datasets are available athttps://github.com/Xiaohao-Yang/LLM-ITL
pdf
bib
abs
HALoGEN: Fantastic LLM Hallucinations and Where to Find Them
Abhilasha Ravichander
|
Shrusti Ghela
|
David Wadden
|
Yejin Choi
Despite their impressive ability to generate high-quality and fluent text, generative large language models (LLMs) also produce hallucinations: statements that are misaligned with established world knowledge or provided input context. However, measuring hallucination can be challenging, as having humans verify model generations on-the-fly is both expensive and time-consuming. In this work, we release HALoGEN, a comprehensive hallucination benchmark consisting of: (1) 10,923 prompts for generative models spanning nine domains including programming, scientific attribution, and summarization, and (2) automatic high-precision verifiers for each use case that decompose LLM generations into atomic units, and verify each unit against a high-quality knowledge source. We use this framework to evaluate ~150,000 generations from 14 language models, finding that even the best-performing models are riddled with hallucinations (sometimes up to 86% of generated atomic facts depending on the domain). We further define a novel error classification for LLM hallucinations based on whether they likely stem from incorrect recollection of training data (Type A errors), or incorrect knowledge in training data (Type B errors), or are fabrication (Type C errors). We hope our framework provides a foundation to enable the principled study of why generative models hallucinate, and advances the development of trustworthy large language models.
pdf
bib
abs
Synergizing LLMs with Global Label Propagation for Multimodal Fake News Detection
Shuguo Hu
|
Jun Hu
|
Huaiwen Zhang
Large Language Models (LLMs) can assist multimodal fake news detection by predicting pseudo labels. However, LLM-generated pseudo labels alone demonstrate poor performance compared to traditional detection methods, making their effective integration non-trivial. In this paper, we propose Global Label Propagation Network with LLM-based Pseudo Labeling (GLPN-LLM) for multimodal fake news detection, which integrates LLM capabilities via label propagation techniques. The global label propagation can utilize LLM-generated pseudo labels, enhancing prediction accuracy by propagating label information among all samples. For label propagation, a mask-based mechanism is designed to prevent label leakage during training by ensuring that training nodes do not propagate their own labels back to themselves. Experimental results on benchmark datasets show that by synergizing LLMs with label propagation, our model achieves superior performance over state-of-the-art baselines.
pdf
bib
abs
“Yes, My LoRD.” Guiding Language Model Extraction with Locality Reinforced Distillation
Zi Liang
|
Qingqing Ye
|
Yanyun Wang
|
Sen Zhang
|
Yaxin Xiao
|
RongHua Li
|
Jianliang Xu
|
Haibo Hu
Model extraction attacks (MEAs) on large language models (LLMs) have received increasing attention in recent research. However, existing attack methods typically adapt the extraction strategies originally developed for deep neural networks (DNNs). They neglect the underlying inconsistency between the training tasks of MEA and LLM alignment, leading to suboptimal attack performance. To tackle this issue, we propose Locality Reinforced Distillation (LoRD), a novel model extraction algorithm specifically designed for LLMs. In particular, LoRD employs a newly defined policy-gradient-style training task that utilizes the responses of victim model as the signal to guide the crafting of preference for the local model. Theoretical analyses demonstrate that I) The convergence procedure of LoRD in model extraction is consistent with the alignment procedure of LLMs, and II) LoRD can reduce query complexity while mitigating watermark protection through our exploration-based stealing. Extensive experiments validate the superiority of our method in extracting various state-of-the-art commercial LLMs. Our code is available at: https://github.com/liangzid/LoRD-MEA.
pdf
bib
abs
Jailbreak Large Vision-Language Models Through Multi-Modal Linkage
Yu Wang
|
Xiaofei Zhou
|
Yichen Wang
|
Geyuan Zhang
|
Tianxing He
With the rapid advancement of Large Vision-Language Models (VLMs), concerns about their potential misuse and abuse have grown rapidly. Prior research has exposed VLMs’ vulnerability to jailbreak attacks, where carefully crafted inputs can lead the model to produce content that violates ethical and legal standards. However, current jailbreak methods often fail against cutting-edge models such as GPT-4o. We attribute this to the over-exposure of harmful content and the absence of stealthy malicious guidance. In this work, we introduce a novel jailbreak framework: Multi-Modal Linkage (MML) Attack. Drawing inspiration from cryptography, MML employs an encryption-decryption process across text and image modalities to mitigate the over-exposure of malicious information. To covertly align the model’s output with harmful objectives, MML leverages a technique we term evil alignment, framing the attack within the narrative context of a video game development scenario. Extensive experiments validate the effectiveness of MML. Specifically, MML jailbreaks GPT-4o with attack success rates of 99.40% on SafeBench, 98.81% on MM-SafeBench, and 99.07% on HADES-Dataset. Our code is available at https://github.com/wangyu-ovo/MML.
pdf
bib
abs
Wait, that’s not an option: LLMs Robustness with Incorrect Multiple-Choice Options
Gracjan Góral
|
Emilia Wiśnios
|
Piotr Sankowski
|
Paweł Budzianowski
This work introduces a novel framework for evaluating LLMs’ capacity to balance instruction-following with critical reasoning when presented with multiple-choice questions containing no valid answers. Through systematic evaluation across arithmetic, domain-specific knowledge, and high-stakes medical decision tasks, we demonstrate that post-training aligned models often default to selecting invalid options, while base models exhibit improved refusal capabilities that scale with model size. Our analysis reveals that alignment techniques, though intended to enhance helpfulness, can inadvertently impair models’ reflective judgment–the ability to override default behaviors when faced with invalid options. We additionally conduct a parallel human study showing similar instruction-following biases, with implications for how these biases may propagate through human feedback datasets used in alignment. We provide extensive ablation studies examining the impact of model size, training techniques, and prompt engineering. Our findings highlight fundamental tensions between alignment optimization and preservation of critical reasoning capabilities, with important implications for developing more robust AI systems for real-world deployment.
pdf
bib
abs
The Hidden Attention of Mamba Models
Ameen Ali Ali
|
Itamar Zimerman
|
Lior Wolf
The Mamba layer offers an efficient selective state-space model (SSM) that is highly effective in modeling multiple domains, includingNLP, long-range sequence processing, and computer vision. Selective SSMs are viewed as dual models, in which one trains in parallel on the entire sequence via an IO-aware parallel scan, and deploys in an autoregressive manner. We add a third view and show that such models can be viewed as attention-driven models. This new perspective enables us to empirically and theoretically compare the underlying mechanisms to that of the attention in transformers and allows us to peer inside the inner workings of the Mamba model with explainability methods. Our code is publicly available.
pdf
bib
abs
KV-Latent: Dimensional-level KV Cache Reduction with Frequency-aware Rotary Positional Embedding
Shi Luohe
|
Zuchao Li
|
Lefei Zhang
|
Baoyuan Qi
|
Liu Guoming
|
Hai Zhao
Large language models (LLMs) based on Transformer Decoders have become the preferred choice for conversational generative AI. Despite the overall superiority of the Decoder architecture, the gradually increasing Key-Value (KV) cache during inference has emerged as a primary efficiency bottleneck, both in aspects of memory consumption and data transfer bandwidth limitations. To address these challenges, we propose a paradigm called KV-Latent. By down-sampling the Key-Value vector dimensions into a latent space, we can significantly reduce the KV Cache footprint and improve inference speed, only with a small amount of extra training, less than 1% of pre-training takes. Besides, we enhanced the stability of Rotary Positional Embedding applied on lower-dimensional vectors by modifying its frequency sampling mechanism, avoiding noise introduced by higher frequencies while retaining position attenuation. Our experiments, including both models with Grouped Query Attention and those without, have yielded satisfactory results. Finally, we conducted comparative experiments to study the impact of separately reducing Key and Value components on model’s performance. Our approach allows for the construction of more efficient language model systems, and opens the new possibility on KV Cache saving and efficient LLMs.
pdf
bib
abs
LEANCODE: Understanding Models Better for Code Simplification of Pre-trained Large Language Models
Yan Wang
|
Ling Ding
|
Tien N Nguyen
|
Shaohua Wang
|
Yanan Zheng
Large Language Models for code often entail significant computational complexity, which grows significantly with the length of the input code sequence. We propose LeanCode for code simplification to reduce training and prediction time, leveraging code contexts in utilizing attention scores to represent the tokens’ importance. We advocate for the selective removal of tokens based on the average context-aware attention scores rather than average scores across all inputs. LeanCode uses the attention scores of ‘CLS’ tokens within the encoder for classification tasks, such as code search. It also employs the encoder-decoder attention scores to determine token significance for sequence-to-sequence tasks like code summarization. Our evaluation shows LeanCode‘s superiority over the SOTAs DietCode and SlimCode, with improvements of 60% and 16% for code search, and 29% and 27% for code summarization, respectively.
pdf
bib
abs
MARS: Benchmarking the Metaphysical Reasoning Abilities of Language Models with a Multi-task Evaluation Dataset
Weiqi Wang
|
Yangqiu Song
To enable Large Language Models (LLMs) to function as conscious agents with generalizable reasoning capabilities, it is crucial that they possess the ability to ***comprehend situational changes (transitions) in distribution*** triggered by environmental factors or actions from other agents. Despite its fundamental significance, this ability remains underexplored due to the complexity of modeling infinite possible changes in an event and their associated distributions, coupled with the lack of benchmark data with situational transitions. Addressing these gaps, we propose a novel formulation of ***reasoning with distributional changes as a three-step discriminative process***, termed as ***MetAphysical ReaSoning***. We then introduce the first-ever benchmark, **MARS**, comprising three tasks corresponding to each step. These tasks systematically assess LLMs’ capabilities in reasoning the plausibility of (i) changes in actions, (ii) states caused by changed actions, and (iii) situational transitions driven by changes in action. Extensive evaluations with 20 (L)LMs of varying sizes and methods indicate that all three tasks in this process pose significant challenges, even after fine-tuning. Further analyses reveal potential causes for the underperformance of LLMs and demonstrate that pre-training on large-scale conceptualization taxonomies can potentially enhance LMs’ metaphysical reasoning capabilities. Our data and models are publicly accessible at https://github.com/HKUST-KnowComp/MARS.
pdf
bib
abs
Ask-Before-Detection: Identifying and Mitigating Conformity Bias in LLM-Powered Error Detector for Math Word Problem Solutions
Hang Li
|
Tianlong Xu
|
Kaiqi Yang
|
Yucheng Chu
|
Yanling Chen
|
Yichi Song
|
Qingsong Wen
|
Hui Liu
The rise of large language models (LLMs) offers new opportunities for automatic error detection in education, particularly for math word problems (MWPs). While prior studies demonstrate the promise of LLMs as error detectors, they overlook the presence of multiple valid solutions for a single MWP. Our preliminary analysis reveals a significant performance gap between conventional and alternative solutions in MWPs, a phenomenon we term conformity bias in this work. To mitigate this bias, we introduce the Ask-Before-Detect (AskBD) framework, which generates adaptive reference solutions using LLMs to enhance error detection. Experiments on 200 examples of GSM8K show that AskBD effectively mitigates bias and improves performance, especially when combined with reasoning-enhancing techniques like chain-of-thought prompting.
pdf
bib
abs
Real-time Factuality Assessment from Adversarial Feedback
Sanxing Chen
|
Yukun Huang
|
Bhuwan Dhingra
We show that existing evaluations for assessing the factuality of news from conventional sources, such as claims on fact-checking websites, result in high accuracies over time for LLM-based detectors—even after their knowledge cutoffs. This suggests that recent popular false information from such sources can be easily identified due to its likely presence in pre-training/retrieval corpora or the emergence of salient, yet shallow, patterns in these datasets. Instead, we argue that a proper factuality evaluation dataset should test a model’s ability to reason about current events by retrieving and reading related evidence. To this end, we develop a novel pipeline that leverages natural language feedback from a RAG-based detector to iteratively modify real-time news into deceptive variants that challenge LLMs. Our iterative rewrite decreases the binary classification ROC-AUC by an absolute 17.5 percent for a strong RAG-based GPT-4o detector. Our experiments reveal the important role of RAG in both evaluating and generating challenging news examples, as retrieval-free LLM detectors are vulnerable to unseen events and adversarial attacks, while feedback from RAG-based evaluation helps discover more deceitful patterns.
pdf
bib
abs
Improve Vision Language Model Chain-of-thought Reasoning
Ruohong Zhang
|
Bowen Zhang
|
Yanghao Li
|
Haotian Zhang
|
Zhiqing Sun
|
Zhe Gan
|
Yinfei Yang
|
Ruoming Pang
|
Yiming Yang
Chain-of-thought (CoT) reasoning in vision language models (VLMs) is crucial for improving interpretability and trustworthiness. However, current training recipes often relying on datasets dominated by short annotations with minimal rationales. In this work, we show that training VLM on short answers leads to poor generalization on reasoning tasks that require more detailed explanations. To address this limitation, we propose a two-stage post-training strategy that extends the usage of short answer data for enhanced CoT reasoning. First, we augment short answers with CoT reasoning generated by GPT-4o, enhancing the VLM’s CoT capabilities through fine-tuning. Second, we leverage short answers as outcome rewards for reinforcement learning. Specifically, short answers are used as correctness indicators to construct positive (correct) and negative (incorrect) pairs from model-generated reasoning chains. These pairs are then used to calibrate the model’s reasoning via Direct Preference Optimization. Our experiments show significant improvements in CoT reasoning on benchmark datasets, along with enhanced generalization to direct answer prediction. This work provides a critical data resource for VLM CoT training and demonstrates the effectiveness of outcome rewards for multimodal models post-training.
pdf
bib
abs
On the Mutual Influence of Gender and Occupation in LLM Representations
Haozhe An
|
Connor Baumler
|
Abhilasha Sancheti
|
Rachel Rudinger
We examine LLM representations of gender for first names in various occupational contexts to study how occupations and the gender perception of first names in LLMs influence each other mutually. We find that LLMs’ first-name gender representations correlate with real-world gender statistics associated with the name, and are influenced by the co-occurrence of stereotypically feminine or masculine occupations. Additionally, we study the influence of first-name gender representations on LLMs in a downstream occupation prediction task and their potential as an internal metric to identify extrinsic model biases. While feminine first-name embeddings often raise the probabilities for female-dominated jobs (and vice versa for male-dominated jobs), reliably using these internal gender representations for bias detection remains challenging.
pdf
bib
abs
Disentangling Memory and Reasoning Ability in Large Language Models
Mingyu Jin
|
Weidi Luo
|
Sitao Cheng
|
Xinyi Wang
|
Wenyue Hua
|
Ruixiang Tang
|
William Yang Wang
|
Yongfeng Zhang
Large Language Models (LLMs) have demonstrated strong performance in handling complex tasks that require both extensive knowledge and reasoning abilities. However, the existing LLM inference pipeline operates as an opaque process without explicit separation between knowledge retrieval and reasoning steps, making the model’s decision-making process unclear and disorganized. Recent research has shown that this ambiguity will lead to issues such as knowledge forgetting, which significantly impact the reliability of LLMs. In this paper, we propose a novel language model inference paradigm that decomposes the complex inference process into two distinct and clear actions: (1) memory recall: which retrieves relevant knowledge in LLM, and (2) reasoning: which performs reasoning steps based on the recalled knowledge. To facilitate this decomposition, we introduce two special tokens memory and reason, guiding the model to distinguish between steps that require knowledge retrieval and those that involve reasoning. Our experiment results show that this decomposition not only improves LLMs’ performance among utility benchmarks but also enhances interpretability during the inference process, enabling users to identify sources of error and refine model responses effectively. The code is available at: https://github.com/MingyuJ666/Disentangling-Memory-and-Reasoning.
pdf
bib
abs
Open-World Attribute Mining for E-Commerce Products with Multimodal Self-Correction Instruction Tuning
Jiaqi Li
|
Yanming Li
|
Xiaoli Shen
|
Chuanyi Zhang
|
Guilin Qi
|
Sheng Bi
In e-commerce, effective product Attribute Mining (AM) is essential for improving product features and aiding consumer decisions. However, current AM methods often focus on extracting attributes from unimodal text, underutilizing multimodal data. In this paper, we propose a novel framework called Multimodal Self-Correction Instruction Tuning (MSIT) to mine new potential attributes from both images and text with Multimodal Large Language Models. The tuning process involves two datasets: Attribute Generation Tuning Data (AGTD) and Chain-of-Thought Tuning Data (CTTD). AGTD is constructed utilizing in-context learning with a small set of seed attributes, aiding the MLLM in accurately extracting attribute-value pairs from multimodal information. To introduce explicit reasoning and improve the extraction in accuracy, we construct CTTD, which incorporates a structured 5-step reasoning process for self-correction. Finally, we employ a 3-stage inference process to filter out redundant attributes and sequentially validate each generated attribute. Comprehensive experimental results on two datasets show that MSIT outperforms state-of-the-art methods. We will release our code and data in the near future.
pdf
bib
abs
Normalized AOPC: Fixing Misleading Faithfulness Metrics for Feature Attributions Explainability
Joakim Edin
|
Andreas Geert Motzfeldt
|
Casper L. Christensen
|
Tuukka Ruotsalo
|
Lars Maaløe
|
Maria Maistro
Deep neural network predictions are notoriously difficult to interpret. Feature attribution methods aim to explain these predictions by identifying the contribution of each input feature. Faithfulness, often evaluated using the area over the perturbation curve (AOPC), reflects feature attributions’ accuracy in describing the internal mechanisms of deep neural networks. However, many studies rely on AOPC to compare faithfulness across different models, which we show can lead to false conclusions about models’ faithfulness. Specifically, we find that AOPC is sensitive to variations in the model, resulting in unreliable cross-model comparisons. Moreover, AOPC scores are difficult to interpret in isolation without knowing the model-specific lower and upper limits. To address these issues, we propose a normalization approach, Normalized AOPC (NAOPC), enabling consistent cross-model evaluations and more meaningful interpretation of individual scores. Our experiments demonstrate that this normalization can radically change AOPC results, questioning the conclusions of earlier studies and offering a more robust framework for assessing feature attribution faithfulness. Our code is available at https://github.com/JoakimEdin/naopc.
pdf
bib
abs
Takin-VC: Expressive Zero-Shot Voice Conversion via Adaptive Hybrid Content Encoding and Enhanced Timbre Modeling
Yang Yuguang
|
Yu Pan
|
Jixun Yao
|
Xiang Zhang
|
Jianhao Ye
|
Hongbin Zhou
|
Lei Xie
|
Lei Ma
|
Jianjun Zhao
Expressive zero-shot voice conversion (VC) is a critical and challenging task that aims to transform the source timbre into an arbitrary unseen speaker while preserving the original content and expressive qualities. Despite recent progress in zero-shot VC, there remains considerable potential for improvements in speaker similarity and speech naturalness. Moreover, existing zero-shot VC systems struggle to fully reproduce paralinguistic information in highly expressive speech, such as breathing, crying, and emotional nuances, limiting their practical applicability. To address these issues, we propose Takin-VC, a novel expressive zero-shot VC framework via adaptive hybrid content encoding and memory-augmented context-aware timbre modeling. Specifically, we introduce an innovative hybrid content encoder that incorporates an adaptive fusion module, capable of effectively integrating quantized features of the pre-trained WavLM and HybridFormer in an implicit manner, so as to extract precise linguistic features while enriching paralinguistic elements. For timbre modeling, we propose advanced memory-augmented and context-aware modules to generate high-quality target timbre features and fused representations that seamlessly align source content with target timbre. To enhance real-time performance, we advocate a conditional flow matching model to reconstruct the Mel-spectrogram of the source speech. Experimental results show that our Takin-VC consistently surpasses state-of-the-art VC systems, achieving notable improvements in terms of speech naturalness, speech expressiveness, and speaker similarity, while offering enhanced inference speed.
pdf
bib
abs
LangSAMP: Language-Script Aware Multilingual Pretraining
Yihong Liu
|
Haotian Ye
|
Chunlan Ma
|
Mingyang Wang
|
Hinrich Schuetze
Recent multilingual pretrained language models (mPLMs) often avoid using language embeddings – learnable vectors assigned to individual languages. However, this places a significant burden on token representations to encode all language-specific information, which may hinder language neutrality. To address this limitation, we propose Language-Script Aware Multilingual Pretraining (LangSAMP), a method that incorporates both language and script embeddings to enhance representation learning. Specifically, we integrate these embeddings into the output of the Transformer blocks before passing the final representations to the language modeling head for prediction. We apply LangSAMP to the continual pretraining of XLM-R on a highly multilingual corpus covering more than 500 languages. The resulting model consistently outperforms the baseline in zero-shot crosslingual transfer across diverse downstream tasks. Extensive analysis reveals that language and script embeddings capture language- and script-specific nuances, which benefits more language-neutral representations, proven by improved pairwise cosine similarity. In our case study, we also show that language and script embeddings can be used to select better source languages for crosslingual transfer. We make our code and models publicly available at
https://github.com/cisnlp/LangSAMP.
pdf
bib
abs
RelationalCoder: Rethinking Complex Tables via Programmatic Relational Transformation
Haoyu Dong
|
Yue Hu
|
Huailiang Peng
|
Yanan Cao
Semi-structured tables, with their varied layouts and formatting artifacts, remain a major obstacle for automated data processing and analytics. To address these challenges, we propose RelationalCoder, which uniformly converts semi-structured tables into relational data, enabling smooth integration with the rich ecosystem of data processing and analytics tools. By leveraging SQL code, RelationalCoder prevents schema errors and markedly improves normalization quality across multiple relational tables.To address the challenge of large tables, we propose a new technique called Loop Reference Decoding (LRD): it identifies expandable groups—repeating regions of similar structure and semantics—and replicates each group using a concise loop over its repetitive region by referencing cell addresses, rather than regenerating each individual cell. This design substantially reduces output length from 𝒪(N × M)—proportional to the table’s height (N) and width (M)—to approximately 𝒪(K), where K is the total number of unique cell types within detected expandable groups. As a result, LRD is highly scalable: the larger the input table, the greater the compression ratio. It scales seamlessly to extremely large tables, achieving output reductions of up to 100,000×.We further create the first human-labeled corpus for table transformation, created with a cost-efficient, actively supervised pipeline. Extensive experiments on HiTab and MultiHiertt show that RelationalCoder not only enables programmatic symbolic reasoning but also boosts QA accuracy—raising Llama-2 and Mistral models by more than 20%, and GPT-4o by over 4%. Project page: https://github.com/haoyudong/RelationalCoder.
pdf
bib
abs
Algorithmic Fidelity of Large Language Models in Generating Synthetic German Public Opinions: A Case Study
Bolei Ma
|
Berk Yoztyurk
|
Anna-Carolina Haensch
|
Xinpeng Wang
|
Markus Herklotz
|
Frauke Kreuter
|
Barbara Plank
|
Matthias Aßenmacher
In recent research, large language models (LLMs) have been increasingly used to investigate public opinions. This study investigates the algorithmic fidelity of LLMs, i.e., the ability to replicate the socio-cultural context and nuanced opinions of human participants. Using open-ended survey data from the German Longitudinal Election Studies (GLES), we prompt different LLMs to generate synthetic public opinions reflective of German subpopulations by incorporating demographic features into the persona prompts. Our results show that Llama performs better than other LLMs at representing subpopulations, particularly when there is lower opinion diversity within those groups. Our findings further reveal that the LLM performs better for supporters of left-leaning parties like The Greens and The Left compared to other parties, and matches the least with the right-party AfD. Additionally, the inclusion or exclusion of specific variables in the prompts can significantly impact the models’ predictions. These findings underscore the importance of aligning LLMs to more effectively model diverse public opinions while minimizing political biases and enhancing robustness in representativeness.
pdf
bib
abs
TUNA: Comprehensive Fine-grained Temporal Understanding Evaluation on Dense Dynamic Videos
Fanheng Kong
|
Jingyuan Zhang
|
Hongzhi Zhang
|
Shi Feng
|
Daling Wang
|
Linhao Yu
|
Xingguang Ji
|
Yu Tian
|
V. W.
|
Fuzheng Zhang
Videos are unique in their integration of temporal elements, including camera, scene, action, and attribute, along with their dynamic relationships over time. However, existing benchmarks for video understanding often treat these properties separately or narrowly focus on specific aspects, overlooking the holistic nature of video content. To address this, we introduce TUNA, a temporal-oriented benchmark for fine-grained understanding on dense dynamic videos, with two complementary tasks: captioning and QA. Our TUNA features diverse video scenarios and dynamics, assisted by interpretable and robust evaluation criteria. We evaluate several leading models on our benchmark, providing fine-grained performance assessments across various dimensions. This evaluation reveals key challenges in video temporal understanding, such as limited action description, inadequate multi-subject understanding, and insensitivity to camera motion, offering valuable insights for improving video understanding models.
pdf
bib
abs
Self-Instructed Derived Prompt Generation Meets In-Context Learning: Unlocking New Potential of Black-Box LLMs
Zhuo Li
|
Yuhao Du
|
Jinpeng Hu
|
Xiang Wan
|
Anningzhe Gao
Improving prompt quality is crucial for enhancing the performance of large language models (LLMs), particularly for Black-Box models like GPT4. Existing prompt refinement methods, while effective, often suffer from semantic inconsistencies between refined and original prompts, and fail to maintain users’ real intent. To address these challenges, we propose a self-instructed in-context learning framework that generates reliable derived prompts, keeping semantic consistency with the original prompts. Specifically, our framework incorporates a reinforcement learning mechanism, enabling direct interaction with the response model during prompt generation to better align with human preferences. We then formulate the querying as an in-context learning task, combining responses from LLMs with derived prompts to create a contextual demonstration for the original prompt. This approach effectively enhances alignment, reduces semantic discrepancies, and activates the LLM’s in-context learning ability for generating more beneficial response. Extensive experiments demonstrate that the proposed method not only generates better derived prompts but also significantly enhances LLMs’ ability to deliver more effective responses, particularly for Black-Box models like GPT4.
pdf
bib
abs
Binary Classifier Optimization for Large Language Model Alignment
Seungjae Jung
|
Gunsoo Han
|
Daniel Wontae Nam
|
Kyoung-Woon On
In real-world services such as ChatGPT, aligning models based on user feedback is crucial for improving model performance. However, due to the simplicity and convenience of providing feedback, users typically offer only basic binary signals, such as ‘thumbs-up’ or ‘thumbs-down’. Most existing alignment research, on the other hand, relies on preference-based approaches that require both positive and negative responses as a pair. We propose Binary Classifier Optimization (BCO), a technique that effectively aligns LLMs using only binary feedback. BCO trains a binary classifier, where the logit serves as an implicit reward, effectively minimizing the Direct Preference Optimization (DPO) loss. We demonstrate that the binary cross-entropy loss employed in classifier training acts as an upper bound for the DPO loss. Additionally, a novel reward shift technique further minimizes the gap between the losses. We validate our methodology in two settings: first, on a paired preference dataset, where our method performs on par with DPO; and second, on a Likert-5 scale annotation dataset which stems from real users’ queries. Our model consistently demonstrates effective and robust alignment across four base LLMs and three different datasets, showcasing the strength of our approach to learning from binary signals.
pdf
bib
abs
UnSeenTimeQA: Time-Sensitive Question-Answering Beyond LLMs’ Memorization
Md Nayem Uddin
|
Amir Saeidi
|
Divij Handa
|
Agastya Seth
|
Tran Cao Son
|
Eduardo Blanco
|
Steven Corman
|
Chitta Baral
This paper introduces UnSeenTimeQA, a novel data contamination-free time-sensitive question-answering (TSQA) benchmark. It differs from existing TSQA benchmarks by avoiding web-searchable queries grounded in the real world. We present a series of time-sensitive event scenarios based on synthetically generated facts. It requires large language models (LLMs) to engage in genuine temporal reasoning without depending on the factual knowledge acquired during the pre-training phase. Our data generation framework enables on-demand generation of new samples, mitigating the risk of data leakage. We designed three types of time-sensitive questions to test LLMs’ temporal reasoning abilities over sequential and parallel event occurrences. Our evaluation of five LLMs on synthetic fact-based TSQA reveals mixed results: while they perform well on simpler subsets, their overall performance remains inferior as compared to real world fact-based TSQA. Error analysis indicates that LLMs face difficulties in reasoning over long-range event dependencies and parallel events.
pdf
bib
abs
From Information to Insight: Leveraging LLMs for Open Aspect-Based Educational Summarization
Yang Zhong
|
Diane Litman
This paper addresses the challenge of aspect-based summarization in education by introducing Reflective ASPect-based summarization (ReflectASP), a novel dataset that summarizes student reflections on STEM lectures. Despite the promising performance of large language models in general summarization, their application to nuanced aspect-based summaries remains under-explored. ReflectASP eases the exploration of open-aspect-based summarization (OABS), overcoming the limitations of current datasets and comes with ample human annotations. We benchmarked different types of zero-shot summarization methods and proposed two refinement methods to improve summaries, supported by both automatic and human manual evaluations. Additionally, we analyzed suggestions and revisions made during the refinement process, offering a fine-grained study of the editing strategies employed by these methods. We make our models, dataset, and all human evaluation results available at https://github.com/cs329yangzhong/ReflectASP.
pdf
bib
abs
AfriMed-QA: A Pan-African, Multi-Specialty, Medical Question-Answering Benchmark Dataset
Charles Nimo
|
Tobi Olatunji
|
Abraham Toluwase Owodunni
|
Tassallah Abdullahi
|
Emmanuel Ayodele
|
Mardhiyah Sanni
|
Ezinwanne C. Aka
|
Folafunmi Omofoye
|
Foutse Yuehgoh
|
Timothy Faniran
|
Bonaventure F. P. Dossou
|
Moshood O. Yekini
|
Jonas Kemp
|
Katherine A Heller
|
Jude Chidubem Omeke
|
Chidi Asuzu Md
|
Naome A Etori
|
Aïmérou Ndiaye
|
Ifeoma Okoh
|
Evans Doe Ocansey
|
Wendy Kinara
|
Michael L. Best
|
Irfan Essa
|
Stephen Edward Moore
|
Chris Fourie
|
Mercy Nyamewaa Asiedu
Recent advancements in large language model (LLM) performance on medical multiplechoice question (MCQ) benchmarks have stimulated interest from healthcare providers and patients globally. Particularly in low-andmiddle-income countries (LMICs) facing acute physician shortages and lack of specialists, LLMs offer a potentially scalable pathway to enhance healthcare access and reduce costs. However, their effectiveness in the Global South, especially across the African continent, remains to be established. In this work, we introduce AfriMed-QA , the first largescale Pan-African English multi-specialty medical Question-Answering (QA) dataset, 15,000 questions (open and closed-ended) sourced from over 60 medical schools across 16 countries, covering 32 medical specialties. We further evaluate 30 LLMs across multiple axes including correctness and demographic bias. Our findings show significant performance variation across specialties and geographies, MCQ performance clearly lags USMLE (MedQA). We find that biomedical LLMs underperform general models and smaller edge-friendly LLMs struggle to achieve a passing score. Interestingly, human evaluations show a consistent consumer preference for LLM answers and explanations when compared with clinician answers.
pdf
bib
abs
Root Defense Strategies: Ensuring Safety of LLM at the Decoding Level
Xinyi Zeng
|
Yuying Shang
|
Jiawei Chen
|
Jingyuan Zhang
|
Yu Tian
Large language models (LLMs) have demonstrated immense utility across various industries. However, as LLMs advance, the risk of harmful outputs increases due to incorrect or malicious prompts. While current methods effectively address jailbreak risks, they share common limitations: 1) Judging harmful outputs from the prefill-level lacks utilization of the model’s decoding outputs, leading to relatively lower effectiveness and robustness. 2) Rejecting potentially harmful outputs based on a single evaluation can significantly impair the model’s helpfulness. To address the above issues, we examine LLMs’ capability to recognize harmful outputs, revealing and quantifying their proficiency in assessing the danger of previous tokens. Motivated by pilot experiment results, we design a robust defense mechanism at the decoding level. Our novel decoder-oriented, step-by-step defense architecture corrects the outputs of harmful queries directly rather than rejecting them outright. We introduce speculative decoding to enhance usability and facilitate deployment to boost safe decoding speed. Extensive experiments demonstrate that our approach improves model security without compromising reasoning speed. Notably, our method leverages the model’s ability to discern hazardous information, maintaining its helpfulness compared to existing methods.
pdf
bib
abs
In-the-wild Audio Spatialization with Flexible Text-guided Localization
Tianrui Pan
|
Jie Liu
|
Zewen Huang
|
Jie Tang
|
Gangshan Wu
Binaural audio enriches immersive experiences by enabling the perception of the spatial locations of sounding objects in AR, VR, and embodied AI applications. While existing audio spatialization methods can generally map any available monaural audio to binaural audio signals, they often lack the flexible and interactive control needed in complex multi-object user-interactive environments. To address this, we propose a Text-guided Audio Spatialization (TAS) framework that utilizes diverse text prompts and evaluates our model from unified generation and comprehension perspectives. Due to the limited availability of high-quality, large-scale stereo data, we construct the SpatialTAS dataset, which encompasses 376,000 simulated binaural audio samples to facilitate the training of our model. Our model learns binaural differences guided by 3D spatial location and relative position prompts, enhanced with flipped-channel audio. Experimental results show that our model can generate high quality binaural audios for various audio types on both simulated and real-recorded datasets. Besides, we establish an assessment model based on Llama-3.1-8B, which evaluates the semantic accuracy of spatial locations through a spatial reasoning task. Results demonstrate that by utilizing text prompts for flexible and interactive control, we can generate binaural audio with both high quality and semantic consistency in spatial locations.
pdf
bib
abs
L4Q: Parameter Efficient Quantization-Aware Fine-Tuning on Large Language Models
Hyesung Jeon
|
Yulhwa Kim
|
Jae-Joon Kim
Due to the high memory and computational costs associated with large language models (LLMs), model compression techniques such as quantization, which reduces inference costs, and parameter-efficient fine-tuning (PEFT) methods like Low-Rank Adaptation (LoRA), which reduce training costs, have gained significant popularity. This trend has spurred active research into quantization-aware PEFT techniques, aimed at maintaining model accuracy while minimizing memory overhead during both inference and training. Previous quantization-aware PEFT methods typically apply post-training quantiation (PTQ) to pre-trained LLMs, followed by PEFT to recover accuracy loss. Meanwhile, this approach has limitations in recovering the accuracy loss. In this paper, we propose L4Q, a method that integrates Quantization-Aware Training (QAT) with LoRA. By employing a memory-optimized layer design, L4Q significantly reduces QAT’s memory overhead, making its training cost comparable to LoRA, while preserving the advantage of QAT in producing fully quantized LLMs with high accuracy. Our experiments demonstrate that this combined approach to quantization and fine-tuning achieves superior accuracy compared to decoupled fine-tuning schemes, particularly in 4-bit and 3-bit quantization, positioning L4Q as an efficient QAT solution. Using the LLaMA and Mistral models with instructional datasets, we showcase L4Q’s capabilities in language tasks and few-shot learning.
pdf
bib
abs
Second Language (Arabic) Acquisition of LLMs via Progressive Vocabulary Expansion
Jianqing Zhu
|
Huang Huang
|
Zhihang Lin
|
Juhao Liang
|
Zhengyang Tang
|
Khalid Almubarak
|
Mosen Alharthi
|
Bang An
|
Juncai He
|
Xiangbo Wu
|
Fei Yu
|
Junying Chen
|
Ma Zhuoheng
|
Yuhao Du
|
He Zhang
|
Saied Alshahrani
|
Emad A. Alghamdi
|
Lian Zhang
|
Ruoyu Sun
|
Haizhou Li
|
Benyou Wang
|
Jinchao Xu
This paper addresses the critical need for democratizing large language models (LLM) in the Arab world, a region that has seen slower progress in developing models comparable to state-of-the-art offerings like GPT-4 or GPT-3.5, due to a predominant focus on mainstream languages (e.g., English and Chinese). One practical objective for Arabic LLMs is to utilize Arabic-specific vocabulary in the tokenizer to accelerate decoding. However, using a different vocabulary often leads to degradation of the model’s learned knowledge, since many words become out-of-vocabulary (OOV) at the beginning of training. Inspired by the vocabulary learning during Second Language (Arabic) Acquisition for humans, the released AraLLaMA employs progressive vocabulary expansion, which is implemented by a modified BPE algorithm that progressively extends the Arabic subwords in its dynamic vocabulary during training, thereby balancing the OOV ratio at every stage. The ablation study demonstrated the effectiveness of Progressive Vocabulary Expansion.Moreover, AraLLaMA achieves decent performance comparable to the best Arabic LLMs across a variety of Arabic benchmarks. Our model weights are available at:
https://github.com/FreedomIntelligence/AraLLaMa.
pdf
bib
abs
What Really Matters in Many-Shot Attacks? An Empirical Study of Long-Context Vulnerabilities in LLMs
Sangyeop Kim
|
Yohan Lee
|
Yongwoo Song
|
Kimin Lee
We investigate long-context vulnerabilities in Large Language Models (LLMs) through Many-Shot Jailbreaking (MSJ). Our experiments utilize context length of up to 128K tokens. Through comprehensive analysis with various many-shot attack settings with different instruction styles, shot density, topic, and format, we reveal that context length is the primary factor determining attack effectiveness. Critically, we find that successful attacks do not require carefully crafted harmful content. Even repetitive shots or random dummy text can circumvent model safety measures, suggesting fundamental limitations in long-context processing capabilities of LLMs. The safety behavior of well-aligned models becomes increasingly inconsistent with longer contexts. These findings highlight significant safety gaps in context expansion capabilities of LLMs, emphasizing the need for new safety mechanisms.
pdf
bib
abs
ECERC: Evidence-Cause Attention Network for Multi-Modal Emotion Recognition in Conversation
Tao Zhang
|
Zhenhua Tan
Multi-modal Emotion Recognition in Conversation (MMERC) aims to identify speakers’ emotional states using multi-modal conversational data, significant for various domains. MMERC requires addressing emotional causes: contextual factors that influence emotions, alongside emotional evidence directly expressed in the target utterance. Existing methods primarily model general conversational dependencies, such as sequential utterance relationships or inter-speaker dynamics, but fall short in capturing diverse and detailed emotional causes, including emotional contagion, influences from others, and self-referenced or externally introduced events. To address these limitations, we propose the Evidence-Cause Attention Network for Multi-Modal Emotion Recognition in Conversation (ECERC). ECERC integrates emotional evidence with contextual causes through five stages: Evidence Gating extracts and refines emotional evidence across modalities; Cause Encoding captures causes from conversational context; Evidence-Cause Interaction uses attention to integrate evidence with diverse causes, generating rich candidate features for emotion inference; Feature Gating adaptively weights contributions of candidate features; and Emotion Classification classifies emotions. We evaluate ECERC on two widely used benchmark datasets, IEMOCAP and MELD. Experimental results show that ECERC achieves competitive performance in weighted F1-score and accuracy, demonstrating its effectiveness in MMERC
pdf
bib
abs
CompileAgent: Automated Real-World Repo-Level Compilation with Tool-Integrated LLM-based Agent System
Li Hu
|
Guoqiang Chen
|
Xiuwei Shang
|
Shaoyin Cheng
|
Benlong Wu
|
LiGangyang LiGangyang
|
Xu Zhu
|
Weiming Zhang
|
Nenghai Yu
With open-source projects growing in size and complexity, manual compilation becomes tedious and error-prone, highlighting the need for automation to improve efficiency and accuracy. However, the complexity of compilation instruction search and error resolution makes automatic compilation challenging. Inspired by the success of LLM-based agents in various fields, we propose CompileAgent, the first LLM-based agent framework dedicated to repo-level compilation. CompileAgent integrates five tools and a flow-based agent strategy, enabling interaction with software artifacts for compilation instruction search and error resolution. To measure the effectiveness of our method, we design a public repo-level benchmark CompileAgentBench, and we also design two baselines for comparison by combining two compilation-friendly schemes. The performance on this benchmark shows that our method significantly improves the compilation success rate, ranging from 10% to 71%. Meanwhile, we evaluate the performance of CompileAgent under different agent strategies and verify the effectiveness of the flow-based strategy. Additionally, we emphasize the scalability of CompileAgent, further expanding its application prospects. The complete code and data are available at https://github.com/Ch3nYe/AutoCompiler.
pdf
bib
abs
Beyond Demographics: Fine-tuning Large Language Models to Predict Individuals’ Subjective Text Perceptions
Matthias Orlikowski
|
Jiaxin Pei
|
Paul Röttger
|
Philipp Cimiano
|
David Jurgens
|
Dirk Hovy
People naturally vary in their annotations for subjective questions and some of this variation is thought to be due to the person’s sociodemographic characteristics. LLMs have also been used to label data, but recent work has shown that models perform poorly when prompted with sociodemographic attributes, suggesting limited inherent sociodemographic knowledge. Here, we ask whether LLMs can be trained to be accurate sociodemographic models of annotator variation. Using a curated dataset of five tasks with standardized sociodemographics, we show that models do improve in sociodemographic prompting when trained but that this performance gain is largely due to models learning annotator-specific behaviour rather than sociodemographic behaviours. Across all tasks, our results suggest that models learn little meaningful connection between sociodemographics and annotation, raising doubts about the current use of LLMs for simulating sociodemographic variation and behaviour.
pdf
bib
abs
Exploring Forgetting in Large Language Model Pre-Training
Chonghua Liao
|
Ruobing Xie
|
Xingwu Sun
|
Haowen Sun
|
Zhanhui Kang
Catastrophic forgetting remains a formidable obstacle to building an omniscient model in large language models (LLMs). Despite the pioneering research on task-level forgetting in LLM fine-tuning, there is scant focus on forgetting during pre-training. We systematically explored the existence and measurement of forgetting in pre-training, questioning traditional metrics such as perplexity (PPL) and introducing new metrics to better detect entity memory retention. Based on our revised assessment of forgetting metrics, we explored low-cost, straightforward methods to mitigate forgetting during the pre-training phase. In addition, we carefully analyzed the learning curves, offering insights into the dynamics of forgetting. Extensive evaluations and analyses on forgetting of pre-training could facilitate future research on LLMs.
pdf
bib
abs
Bias in the Mirror : Are LLMs opinions robust to their own adversarial attacks
Virgile Rennard
|
Christos Xypolopoulos
|
Michalis Vazirgiannis
Large language models (LLMs) inherit biases from their training data and alignment processes, influencing their responses in subtle ways. While many studies have examined these biases, little work has explored their robustness during interactions. In this paper, we introduce a novel approach where two instances of an LLM engage in self-debate, arguing opposing viewpoints to persuade a neutral version of the model. Through this, we evaluate how firmly biases hold and whether models are susceptible to reinforcing misinformation or shifting to harmful viewpoints. Our experiments span multiple LLMs of varying sizes, origins, and languages, providing deeper insights into bias persistence and flexibility across linguistic and cultural contexts.
pdf
bib
abs
AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents
Yifan Xu
|
Xiao Liu
|
Xueqiao Sun
|
Siyi Cheng
|
Hao Yu
|
Hanyu Lai
|
Shudan Zhang
|
Dan Zhang
|
Jie Tang
|
Yuxiao Dong
Autonomous agents have become increasingly important for interacting with the real world. Android agents, in particular, have been a frequently-mentioned interaction method. However, existing studies for training and evaluating Android agents lack systematic research on both open-source and closed-source models. In this work, we propose AndroidLab as a systematic Android agent framework. It includes an operation environment with different modalities, action space, and a reproducible benchmark. It supports both large language models (LLMs) and multimodal models (LMMs) in the same action space. AndroidLab benchmark includes predefined Android virtual devices and 138 tasks across nine apps built on these devices. By using the AndroidLab environment, we develop an Android Instruction dataset and train six open-source LLMs and LMMs, lifting the average success rates from 4.59% to 21.50% for LLMs and from 1.93% to 13.28% for LMMs. AndroidLab is open-sourced and publicly available at https://github.com/THUDM/Android-Lab.
pdf
bib
abs
Modular Sentence Encoders: Separating Language Specialization from Cross-Lingual Alignment
Yongxin Huang
|
Kexin Wang
|
Goran Glavaš
|
Iryna Gurevych
Multilingual sentence encoders (MSEs) are commonly obtained by training multilingual language models to map sentences from different languages into a shared semantic space. As such, they are subject to curse of multilinguality, a loss of monolingual representational accuracy due to parameter sharing. Another limitation of MSEs is the trade-off between different task performance: cross-lingual alignment training distorts the optimal monolingual structure of semantic spaces of individual languages, harming the utility of sentence embeddings in monolingual tasks; cross-lingual tasks, such as cross-lingual semantic similarity and zero-shot transfer for sentence classification, may also require conflicting cross-lingual alignment strategies. In this work, we address both issues by means of modular training of sentence encoders. We first train language-specific monolingual modules to mitigate negative interference between languages (i.e., the curse). We then align all non-English sentence embeddings to the English by training cross-lingual alignment adapters, preventing interference with monolingual specialization from the first step. We train the cross-lingual adapters with two different types of data to resolve the conflicting requirements of different cross-lingual tasks. Monolingual and cross-lingual results on semantic text similarity and relatedness, bitext mining and sentence classification show that our modular solution achieves better and more balanced performance across all the tasks compared to full-parameter training of monolithic multilingual sentence encoders, especially benefiting low-resource languages.
pdf
bib
abs
Multimodal Transformers are Hierarchical Modal-wise Heterogeneous Graphs
Yijie Jin
|
Junjie Peng
|
Xuanchao Lin
|
Haochen Yuan
|
Lan Wang
|
Cangzhi Zheng
Multimodal Sentiment Analysis (MSA) is a rapidly developing field that integrates multimodal information to recognize sentiments, and existing models have made significant progress in this area. The central challenge in MSA is multimodal fusion, which is predominantly addressed by Multimodal Transformers (MulTs). Although act as the paradigm, MulTs suffer from efficiency concerns. In this work, from the perspective of efficiency optimization, we propose and prove that MulTs are hierarchical modal-wise heterogeneous graphs (HMHGs), and we introduce the graph-structured representation pattern of MulTs. Based on this pattern, we propose an Interlaced Mask (IM) mechanism to design the Graph-Structured and Interlaced-Masked Multimodal Transformer (GsiT). It is formally equivalent to MulTs which achieves an efficient weight-sharing mechanism without information disorder through IM, enabling All-Modal-In-One fusion with only 1/3 of the parameters of pure MulTs. A kernel called Decomposition is implemented to ensure avoiding additional computational overhead. Moreover, it achieves significantly higher performance than traditional MulTs. To further validate the effectiveness of GsiT itself and the HMHG concept, we integrate them into multiple state-of-the-art models and demonstrate notable performance improvements and parameter reduction on widely used MSA datasets. Experimental results also demonstrate its effectiveness on other multimodal tasks. The code is available in https://github.com/drewjin/GsiT.git.
pdf
bib
abs
Have We Designed Generalizable Structural Knowledge Promptings? Systematic Evaluation and Rethinking
Yichi Zhang
|
Zhuo Chen
|
Lingbing Guo
|
Yajing Xu
|
Shaokai Chen
|
Mengshu Sun
|
Binbin Hu
|
Zhiqiang Zhang
|
Lei Liang
|
Wen Zhang
|
Huajun Chen
Large language models (LLMs) have demonstrated exceptional performance in text generation within current NLP research. However, the lack of factual accuracy is still a dark cloud hanging over the LLM skyscraper. Structural knowledge prompting (SKP) is a prominent paradigm to integrate external knowledge into LLMs by incorporating structural representations, achieving state-of-the-art results in many knowledge-intensive tasks. However, existing methods often focus on specific problems, lacking a comprehensive exploration of the generalization and capability boundaries of SKP. This paper aims to evaluate and rethink the generalization capability of the SKP paradigm from four perspectives including Granularity, Transferability, Scalability, and Universality. To provide a thorough evaluation, we introduce a novel multi-granular, multi-level benchmark called SUBARU, consisting of 9 different tasks with varying levels of granularity and difficulty. Through extensive experiments, we draw key conclusions regarding the generalization of SKP, offering insights to guide the future development and extension of the SKP paradigm.
pdf
bib
abs
LLäMmlein: Transparent, Compact and Competitive German-Only Language Models from Scratch
Jan Pfister
|
Julia Wunderle
|
Andreas Hotho
We transparently create two German-only decoder models, LLäMmlein 120M and 1B, from scratch and publish them, along with the training data, for the (German) NLP research community to use. The model training involved several key steps, including data preprocessing/filtering, the creation of a German tokenizer, the training itself, as well as the evaluation of the final models on various benchmarks, also against existing models. Throughout the training process, multiple checkpoints were saved in equal intervals and analyzed using the German SuperGLEBer benchmark to gain insights into the models’ learning process.Compared to state-of-the-art models on the SuperGLEBer benchmark, both LLäMmlein models performed competitively, consistently matching or surpassing models with similar parameter sizes. The results show that the models’ quality scales with size as expected, but performance improvements on some tasks plateaued early during training, offering valuable insights into resource allocation for future models.
pdf
bib
abs
Speaking Beyond Language: A Large-Scale Multimodal Dataset for Learning Nonverbal Cues from Video-Grounded Dialogues
Youngmin Kim
|
Jiwan Chung
|
Jisoo Kim
|
Sunghyun Lee
|
Sangkyu Lee
|
Junhyeok Kim
|
Cheoljong Yang
|
Youngjae Yu
Nonverbal communication is integral to human interaction, with gestures, facial expressions, and body language conveying critical aspects of intent and emotion. However, existing large language models (LLMs) fail to effectively incorporate these nonverbal elements, limiting their capacity to create fully immersive conversational experiences. We introduce MARS, a multimodal language model designed to understand and generate nonverbal cues alongside text, bridging this gap in conversational AI.Our key innovation is VENUS, a large-scale dataset comprising annotated videos with time-aligned text, facial expressions, and body language.Leveraging VENUS, we train MARS with a next-token prediction objective, combining text with vector-quantized nonverbal representations to achieve multimodal understanding and generation within a unified framework.Based on various analyses of the VENUS datasets, we validate its substantial scale and high effectiveness. Our quantitative and qualitative results demonstrate that MARS successfully generates text and nonverbal languages, corresponding to conversational input.Our dataset and code are available at https://github.com/winston1214/nonverbal-conversation.
pdf
bib
abs
How Much Do Encoder Models Know About Word Senses?
Simone Teglia
|
Simone Tedeschi
|
Roberto Navigli
Word Sense Disambiguation (WSD) is a key task in Natural Language Processing (NLP), involving selecting the correct meaning of a word based on its context. With Pretrained Language Models (PLMs) like BERT and DeBERTa now well established, significant progress has been made in understanding contextual semantics. Nevertheless, how well these models inherently disambiguate word senses remains uncertain. In this work, we evaluate several encoder-only PLMs across two popular inventories (i.e. WordNet and the Oxford Dictionary of English) by analyzing their ability to separate word senses without any task-specific fine-tuning. We compute centroids of word senses and measure similarity to assess performance across different layers. Our results show that DeBERTa-v3 delivers the best performance on the task, with the middle layers (specifically the 7th and 8th layers) achieving the highest accuracy, outperforming the output layer by approximately 15 percentage points. Our experiments also explore the inherent structure of WordNet and ODE sense inventories, highlighting their influence on the overall model behavior and performance. Finally, based on our findings, we develop a small, efficient model for the WSD task that attains robust performance while significantly reducing the carbon footprint.
pdf
bib
abs
When Backdoors Speak: Understanding LLM Backdoor Attacks Through Model-Generated Explanations
Huaizhi Ge
|
Yiming Li
|
Qifan Wang
|
Yongfeng Zhang
|
Ruixiang Tang
Large Language Models (LLMs) are known to be vulnerable to backdoor attacks, where triggers embedded in poisoned samples can maliciously alter LLMs’ behaviors. In this paper, we move beyond attacking LLMs and instead examine backdoor attacks through the novel lens of natural language explanations. Specifically, we leverage LLMs’ generative capabilities to produce human-readable explanations for their decisions, enabling direct comparisons between explanations for clean and poisoned samples. Our results show that backdoored models produce coherent explanations for clean inputs but diverse and logically flawed explanations for poisoned data, a pattern consistent across classification and generation tasks for different backdoor attacks. Further analysis reveals key insights into the explanation generation process. At the token level, explanation tokens associated with poisoned samples only appear in the final few transformer layers. At the sentence level, attention dynamics indicate that poisoned inputs shift attention away from the original input context during explanation generation. These findings enhance our understanding of backdoor mechanisms in LLMs and present a promising framework for detecting vulnerabilities through explainability.
pdf
bib
abs
HateDay: Insights from a Global Hate Speech Dataset Representative of a Day on Twitter
Manuel Tonneau
|
Diyi Liu
|
Niyati Malhotra
|
Scott A. Hale
|
Samuel Fraiberger
|
Victor Orozco-Olvera
|
Paul Röttger
To address the global challenge of online hate speech, prior research has developed detection models to flag such content on social media. However, due to systematic biases in evaluation datasets, the real-world effectiveness of these models remains unclear, particularly across geographies. We introduce HateDay, the first global hate speech dataset representative of social media settings, constructed from a random sample of all tweets posted on September 21, 2022 and covering eight languages and four English-speaking countries. Using HateDay, we uncover substantial variation in the prevalence and composition of hate speech across languages and regions. We show that evaluations on academic datasets greatly overestimate real-world detection performance, which we find is very low, especially for non-European languages. Our analysis identifies key drivers of this gap, including models’ difficulty to distinguish hate from offensive speech and a mismatch between the target groups emphasized in academic datasets and those most frequently targeted in real-world settings. We argue that poor model performance makes public models ill-suited for automatic hate speech moderation and find that high moderation rates are only achievable with substantial human oversight. Our results underscore the need to evaluate detection systems on data that reflects the complexity and diversity of real-world social media.
pdf
bib
abs
LegalAgentBench: Evaluating LLM Agents in Legal Domain
Haitao Li
|
Junjie Chen
|
Jingli Yang
|
Qingyao Ai
|
Wei Jia
|
Youfeng Liu
|
Kai Lin
|
Yueyue Wu
|
Guozhi Yuan
|
Yiran Hu
|
Wuyue Wang
|
Yiqun Liu
|
Minlie Huang
With the increasing intelligence and autonomy of LLM Agents, their potential applications in the legal domain are becoming increasingly apparent. However, existing general-domain benchmarks are unable to fully capture the complexity and subtle nuances inherent in real-world judicial cognition and decision-making. Therefore, we propose LegalAgentBench, a comprehensive benchmark specifically designed to evaluate LLM Agents in the Chinese legal domain. LegalAgentBench includes 17 corpora from real-world legal scenarios and provides 37 tools for interacting with external knowledge. To cover tasks of varying difficulty and types, we designed a scalable task construction process that enables a more precise evaluation of performance in both tool utilization and reasoning. Moreover, Beyond assessing performance through the success rate of final outcomes, LegalAgentBench incorporates keyword analysis during intermediate processes to calculate progress rates, facilitating a more fine-grained evaluation. We evaluated eight popular LLMs, highlighting the strengths, limitations, and potential areas for improvement of existing models and methods. LegalAgentBench sets a new benchmark for the practical application of LLMs in the legal domain, with its code and data available at https://github.com/CSHaitao/LegalAgentBench.
pdf
bib
abs
Inference Compute-Optimal Video Vision Language Models
Peiqi Wang
|
ShengYun Peng
|
Xuewen Zhang
|
Hanchao Yu
|
Yibo Yang
|
Lifu Huang
|
Fujun Liu
|
Qifan Wang
This work investigates the optimal allocation of inference compute across three key scaling factors in video vision language models: language model size, frame count, and the number of visual tokens per frame. While prior works typically focuses on optimizing model efficiency or improving performance without considering resource constraints, we instead identify optimal model configuration under fixed inference compute budgets. We conduct large-scale training sweeps and careful parametric modeling of task performance to identify the inference compute-optimal frontier. Our experiments reveal how task performance depends on scaling factors and finetuning data size, as well as how changes in data size shift the compute-optimal frontier. These findings translate to practical tips for selecting these scaling factors.
pdf
bib
abs
Steering into New Embedding Spaces: Analyzing Cross-Lingual Alignment Induced by Model Interventions in Multilingual Language Models
Anirudh Sundar
|
Sinead Williamson
|
Katherine Metcalf
|
Barry-John Theobald
|
Skyler Seto
|
Masha Fedzechkina
Aligned representations across languages is a desired property in multilingual large language models (mLLMs), as alignment can improve performance in cross-lingual tasks. Typically alignment requires fine-tuning a model, which is computationally expensive, and sizable language data, which often may not be available. A data-efficient alternative to fine-tuning is model interventions — a method for manipulating model activations to steer generation into the desired direction. We analyze the effect of a popular intervention (finding experts) on the alignment of cross-lingual representations in mLLMs. We identify the neurons to manipulate for a given language and introspect the embedding space of mLLMs pre- and post-manipulation. We show that modifying the mLLM’s activations changes its embedding space such that cross-lingual alignment is enhanced. Further, we show that the changes to the embedding space translate into improved downstream performance on retrieval tasks, with up to 2x improvements in top-1 accuracy on cross-lingual retrieval.
pdf
bib
abs
Digital Gatekeepers: Google’s Role in Curating Hashtags and Subreddits
Amrit Poudel
|
Yifan Ding
|
Tim Weninger
|
Jürgen Pfeffer
Search engines play a crucial role as digital gatekeepers, shaping the visibility of Web and social media content through algorithmic curation. This study investigates how search engines like Google selectively promotes or suppresses certain hashtags and subreddits, impacting the information users encounter. By comparing search engine results with nonsampled data from Reddit and Twitter/X, we reveal systematic biases in content visibility. Google’s algorithms tend to suppress subreddits and hashtags related to sexually explicit material, conspiracy theories, advertisements, and cryptocurrencies, while promoting content associated with higher engagement. These findings suggest that Google’s gatekeeping practices influence public discourse by curating the social media narratives available to users.
pdf
bib
abs
Behind Closed Words: Creating and Investigating the forePLay Annotated Dataset for Polish Erotic Discourse
Anna Kołos
|
Katarzyna Lorenc
|
Emilia Wiśnios
|
Agnieszka Karlińska
The surge in online content has created an urgent demand for robust detection systems, especially in non-English contexts where current tools demonstrate significant limitations. We introduce forePLay, a novel Polish-language dataset for erotic content detection, comprising over 24,000 annotated sentences. The dataset features a multidimensional taxonomy that captures ambiguity, violence, and socially unacceptable behaviors. Our comprehensive evaluation demonstrates that specialized Polish language models achieve superior performance compared to multilingual alternatives, with transformer-based architectures showing particular strength in handling imbalanced categories. The dataset and accompanying analysis establish essential frameworks for developing linguistically-aware content moderation systems, while highlighting critical considerations for extending such capabilities to morphologically complex languages.
pdf
bib
abs
Assessment and manipulation of latent constructs in pre-trained language models using psychometric scales
Maor Reuben
|
Ortal Slobodin
|
Idan-Chaim Cohen
|
Aviad Elyashar
|
Orna Braun-Lewensohn
|
Odeya Cohen
|
Rami Puzis
Human-like personality traits have recently been discovered in large language models, raising the hypothesis that their (known and as yet undiscovered) biases conform with human latent psychological constructs. While large conversational models may be tricked into answering psychometric questionnaires, the latent psychological constructs of thousands of simpler transformers, trained for other tasks, cannot be assessed because appropriate psychometric methods are currently lacking. Here, we show how standard psychological questionnaires can be reformulated into natural language inference prompts, and we provide a code library to support the psychometric assessment of arbitrary models. We demonstrate, using a sample of 88 publicly available models, the existence of human-like mental health-related constructs—including anxiety, depression, and the sense of coherence—which conform with standard theories in human psychology and show similar correlations and mitigation strategies. The ability to interpret and rectify the performance of language models by using psychological tools can boost the development of more explainable, controllable, and trustworthy models.
pdf
bib
abs
Did Translation Models Get More Robust Without Anyone Even Noticing?
Ben Peters
|
Andre Martins
Neural machine translation (MT) models achieve strong results across a variety of settings, but it is widely believed that they are highly sensitive to “noisy” inputs, such as spelling errors, abbreviations, and other formatting issues. In this paper, we revisit this insight in light of recent multilingual MT models and large language models (LLMs) applied to machine translation. Somewhat surprisingly, we show through controlled experiments that these models are far more robust to many kinds of noise than previous models, even when they perform similarly on clean data. This is notable because, even though LLMs have more parameters and more complex training processes than past models, none of the open ones we consider use any techniques specifically designed to encourage robustness. Next, we show that similar trends hold for social media translation experiments – LLMs are more robust to social media text. We include an analysis of the circumstances in which source correction techniques can be used to mitigate the effects of noise. Altogether, we show that robustness to many types of noise has increased.
pdf
bib
abs
Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset
Dan Su
|
Kezhi Kong
|
Ying Lin
|
Joseph Jennings
|
Brandon Norick
|
Markus Kliegl
|
Mostofa Patwary
|
Mohammad Shoeybi
|
Bryan Catanzaro
Recent English Common Crawl datasets like FineWeb-Edu and DCLM achieved significant benchmark gains via aggressive model-based filtering, but at the cost of removing 90% of data. This limits their suitability for long token horizon training, such as 15T tokens for Llama 3.1. In this paper, we show how to achieve better trade-offs between accuracy and data quantity by a combination of classifier ensembling, synthetic data rephrasing, and reduced reliance on heuristic filters. When training 8B parameter models for 1T tokens, using a high-quality subset of our data improves MMLU by 5.6 over DCLM, demonstrating the efficacy of our methods for boosting accuracies over a relatively short token horizon. Furthermore, our full 6.3T token dataset matches DCLM on MMLU, but contains four times more unique real tokens than DCLM. This unlocks state-of-the-art training over a long token horizon: an 8B parameter model trained for 15T tokens, of which 7.2T came from our dataset, is better than the Llama 3.1 8B model: +5 on MMLU, +3.1 on ARC-Challenge, and +0.5 on average across ten diverse tasks. The dataset is available at https://data.commoncrawl.org/contrib/Nemotron/Nemotron-CC/index.html.
pdf
bib
abs
Hierarchical Level-Wise News Article Clustering via Multilingual Matryoshka Embeddings
Hans William Alexander Hanley
|
Zakir Durumeric
Contextual large language model embeddings are increasingly utilized for topic modeling and clustering. However, current methods often scale poorly, rely on opaque similarity metrics, and struggle in multilingual settings. In this work, we present a novel, scalable, interpretable, hierarchical, and multilingual approach to clustering news articles and social media data. To do this, we first train multilingual Matryoshka embeddings that can determine story similarity at varying levels of granularity based on which subset of the dimensions of the embeddings is examined. This embedding model achieves state-of-the-art performance on the SemEval 2022 Task 8 test dataset (Pearson 𝜌 = 0.816). Once trained, we develop an efficient hierarchical clustering algorithm that leverages the hierarchical nature of Matryoshka embeddings to identify unique news stories, narratives, and themes. We conclude by illustrating how our approach can identify and cluster stories, narratives, and overarching themes within real-world news datasets.
pdf
bib
abs
Contrastive Perplexity for Controlled Generation: An Application in Detoxifying Large Language Models
Tassilo Klein
|
Moin Nabi
The generation of toxic content by large language models (LLMs) remains a critical challenge for the safe deployment of language technology. We propose a novel framework for implicit knowledge editing and controlled text generation by fine-tuning LLMs with a prototype-based contrastive perplexity objective. Central to our method is the construction of hard negatives—toxic outputs that are generated through adversarial paraphrasing to be semantically similar and model probability to their non-toxic counterparts. By training on these challenging and realistic pairs, our approach ensures robust and stable contrastive optimization. Experimental results in the domain of detoxification demonstrate that our method significantly reduces toxic generation while maintaining strong performance on downstream tasks such as commonsense reasoning and reading comprehension. Our findings highlight the effectiveness of exploiting hard negatives for attribute-aware fine-tuning.
pdf
bib
abs
INVESTORBENCH: A Benchmark for Financial Decision-Making Tasks with LLM-based Agent
Haohang Li
|
Yupeng Cao
|
Yangyang Yu
|
Shashidhar Reddy Javaji
|
Zhiyang Deng
|
Yueru He
|
Yuechen Jiang
|
Zining Zhu
|
K.p. Subbalakshmi
|
Jimin Huang
|
Lingfei Qian
|
Xueqing Peng
|
Jordan W. Suchow
|
Qianqian Xie
Recent advancements have underscored the potential of large language model (LLM)-based agents in financial decision-making. Despite this progress, the field currently encounters two main challenges: (1) the lack of a comprehensive LLM agent framework adaptable to a variety of financial tasks, and (2) the absence of standardized benchmarks and consistent datasets for assessing agent performance. To tackle these issues, we introduce InvestorBench, the first benchmark specifically designed for evaluating LLM-based agents in diverse financial decision-making contexts. InvestorBench enhances the versatility of LLM-enabled agents by providing a comprehensive suite of tasks applicable to different financial products, including single equities like stocks and cryptocurrencies, and exchange-traded funds (ETFs). Additionally, we assess the reasoning and decision-making capabilities of our agent framework using thirteen different LLMs as backbone models, across various market environments and tasks. Furthermore, we have curated a diverse collection of open-source, datasets and developed a comprehensive suite of environments for financial decision-making. This establishes a highly accessible platform for evaluating financial agents’ performance across various scenarios.
pdf
bib
abs
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference
Benjamin Warner
|
Antoine Chaffin
|
Benjamin Clavié
|
Orion Weller
|
Oskar Hallström
|
Said Taghadouini
|
Alexis Gallagher
|
Raja Biswas
|
Faisal Ladhak
|
Tom Aarsen
|
Griffin Thomas Adams
|
Jeremy Howard
|
Iacopo Poli
Encoder-only transformer models such as BERT offer a great performance-size tradeoff for retrieval and classification tasks with respect to larger decoder-only models. Despite being the workhorse of numerous production pipelines, there have been limited Pareto improvements to BERT since its release. In this paper, we introduce ModernBERT, bringing modern model optimizations to encoder-only models and representing a major Pareto improvement over older encoders. Trained on 2 trillion tokens with a native 8192 sequence length, ModernBERT models exhibit state-of-the-art results on a large pool of evaluations encompassing diverse classification tasks and both single and multi-vector retrieval on different domains (including code). In addition to strong downstream performance, ModernBERT is also the most speed and memory efficient encoder and is designed for inference on common GPUs.
pdf
bib
abs
Gender Inclusivity Fairness Index (GIFI): A Multilevel Framework for Evaluating Gender Diversity in Large Language Models
Zhengyang Shan
|
Emily Diana
|
Jiawei Zhou
We present a comprehensive evaluation of gender fairness in large language models (LLMs), focusing on their ability to handle both binary and non-binary genders. While previous studies primarily focus on binary gender distinctions, we introduce the Gender Inclusivity Fairness Index (GIFI), a novel and comprehensive metric that quantifies the diverse gender inclusivity of LLMs. GIFI consists of a wide range of evaluations at different levels, from simply probing the model with respect to provided gender pronouns to testing various aspects of model generation and cognitive behaviors under different gender assumptions, revealing biases associated with varying gender identifiers.We conduct extensive evaluations with GIFI on 20 prominent open-source and proprietary LLMs of varying sizes and capabilities, discovering significant variations in LLMs’ gender inclusivity. Our study highlights the importance of improving LLMs’ inclusivity, providing a critical benchmark for future advancements in gender fairness in generative models.
pdf
bib
abs
D.Va: Validate Your Demonstration First Before You Use It
Qi Zhang
|
Zhiqing Xiao
|
Ruixuan Xiao
|
Lirong Gao
|
Junbo Zhao
In-context learning (ICL) has demonstrated significant potential in enhancing the capabilities of large language models (LLMs) during inference. It’s well-established that ICL heavily relies on selecting effective demonstrations to achieve outputs that better align with the expected results. As for demonstration selection, previous approaches have typically relied on intuitive metrics to evaluate the effectiveness of demonstrations, which often results in limited robustness and poor cross-model generalization capabilities. To tackle these challenges, we propose a novel method, **D**emonstration **Va**lidation (**D.Va**), which integrates a demonstration validation perspective into this field. By introducing the demonstration validation mechanism, our method effectively identifies demonstrations that are both effective and highly generalizable. **D.Va** surpasses all existing retrieval-based in-context learning techniques across both natural language understanding (NLU) and natural language generation (NLG) tasks. Additionally, we demonstrate the robustness and generalizability of our approach across various language models and retrieval models.
pdf
bib
abs
Are Any-to-Any Models More Consistent Across Modality Transfers Than Specialists?
Jiwan Chung
|
Janghan Yoon
|
Junhyeong Park
|
Sangeyl Lee
|
Joowon Yang
|
Sooyeon Park
|
Youngjae Yu
Any-to-any generative models aim to enable seamless interpretation and generation across multiple modalities within a unified framework, yet their ability to preserve relationships across modalities remains uncertain. Do unified models truly achieve cross-modal coherence, or is this coherence merely perceived? To explore this, we introduce ACON, a dataset of 1,000 images (500 newly contributed) paired with captions, editing instructions, and Q&A pairs to evaluate cross-modal transfers rigorously. Using three consistency criteria—cyclic consistency, forward equivariance, and conjugated equivariance—our experiments reveal that any-to-any models do not consistently demonstrate greater cross-modal consistency than specialized models in pointwise evaluations such as cyclic consistency. However, equivariance evaluations uncover weak but observable consistency through structured analyses of the intermediate latent space enabled by multiple editing operations. We release our code and data at https://github.com/JiwanChung/ACON.
pdf
bib
abs
MAIN-RAG: Multi-Agent Filtering Retrieval-Augmented Generation
Chia-Yuan Chang
|
Zhimeng Jiang
|
Vineeth Rakesh
|
Menghai Pan
|
Chin-Chia Michael Yeh
|
Guanchu Wang
|
Mingzhi Hu
|
Zhichao Xu
|
Yan Zheng
|
Mahashweta Das
|
Na Zou
Large Language Models (LLMs) are becoming essential tools for various natural language processing tasks but often suffer from generating outdated or incorrect information. Retrieval-Augmented Generation (RAG) addresses this issue by incorporating external, real-time information retrieval to ground LLM responses. However, the existing RAG systems frequently struggle with the quality of retrieval documents, as irrelevant or noisy documents degrade performance, increase computational overhead, and undermine response reliability. To tackle this problem, we propose Multi-Agent Filtering Retrieval-Augmented Generation (MAIN-RAG), a training-free RAG framework that leverages multiple LLM agents to collaboratively filter and score retrieved documents. Specifically, MAIN-RAG introduces an adaptive filtering mechanism that dynamically adjusts the relevance filtering threshold based on score distributions, effectively minimizing noise while maintaining high recall of relevant documents. The proposed approach leverages inter-agent consensus to ensure robust document selection without requiring additional training data or fine-tuning. Experimental results across four QA benchmarks demonstrate that MAIN-RAG consistently outperforms traditional RAG approaches, achieving a 2–11% improvement in answer accuracy while reducing the number of irrelevant retrieved documents. Quantitative analysis further reveals that our approach achieves superior response consistency and answer accuracy over baseline methods, offering a competitive and practical alternative to training-based solutions.
pdf
bib
abs
Unraveling the Mechanics of Learning-Based Demonstration Selection for In-Context Learning
Hui Liu
|
Wenya Wang
|
Hao Sun
|
Chris Xing Tian
|
Chenqi Kong
|
Xin Dong
|
Haoliang Li
Large Language Models (LLMs) have demonstrated impressive in-context learning (ICL) capabilities from few-shot demonstration exemplars. Recent learning-based demonstration selection methods have proven beneficial to ICL by choosing more useful exemplars. While these methods generally assume they learn better similarity measurements between exemplars and test cases from the proxy task, what kinds of similarities are captured by them and are vital to performing ICL still need to be explored. To dive into this question, we analyze the working mechanism of learning-based demonstration selection methods and empirically identify two essential factors of their similarity measurements: 1) Integrating task-agnostic similarities of different levels between the input of exemplars and test cases; 2) Incorporating task-specific similarity between the output of exemplars and test cases. We validate these two findings through extensive quantitative analysis across ten datasets and various LLMs. Based on these insights, we introduce two simplified exemplar selection methods, MLSM and TTF, catering to task-agnostic and task-specific demands to eliminate costly data collection. The effectiveness of both methods evince our findings again and pave the way for future studies.
pdf
bib
abs
Direct Prompt Optimization with Continuous Representations
Yangkun Wang
|
Zihan Wang
|
Jingbo Shang
Prompt optimization for language models faces challenges due to the large discrete search space, the reliance on continuous gradient updates, and the need to round continuous representations into discrete prompts, which causes inflexibility and instability. Existing methods attempt to address these by constraining the search space and adopting greedy, incremental improvements, but they often fail to fully leverage historical gradient information. In this paper, we model the prompt optimization problem by the probability distribution of the prompt and present a novel approach that integrates greedy strategies into optimization with continuous representations. This approach can exploit historical gradient information to address the instability caused by rounding in existing methods. Our study indicates that using continuous representations can improve prompt optimization performance on both text classification and attack tasks, as well as models, including GPT-2, OPT, Vicuna, and LLaMA-2, and also be adaptable to models of different sizes.
pdf
bib
abs
uMedSum: A Unified Framework for Clinical Abstractive Summarization
Aishik Nagar
|
Yutong Liu
|
Andy T. Liu
|
Viktor Schlegel
|
Vijay Prakash Dwivedi
|
Arun-Kumar Kaliya-Perumal
|
Guna Pratheep Kalanchiam
|
Yili Tang
|
Robby T. Tan
Clinical abstractive summarization struggles to balance faithfulness and informativeness, sacrificing key information or introducing confabulations. Techniques like in-context learning and fine-tuning have improved overall summary quality orthogonally, without considering the above issue. Conversely, methods aimed at improving faithfulness and informativeness, such as model reasoning and self improvement, have not been systematically evaluated in the clinical domain. We address this gap by first performing a comprehensive benchmark and study of six advanced abstractive summarization methods across three datasets using five reference-based and reference-free metrics, with the latter specifically assessing faithfulness and informativeness. Based on its findings we then develop uMedSum, a modular hybrid framework introducing novel approaches for sequential confabulation removal and key information addition. Our work outperforms previous GPT-4-based state-of-the-art (SOTA) methods in both quantitative metrics and expert evaluations, achieving an 11.8% average improvement in dedicated faithfulness metrics over the previous SOTA. Doctors prefer uMedSum’s summaries 6 times more than previous SOTA in difficult cases containing confabulations or missing information. These results highlight uMedSum’s effectiveness and generalizability across various datasets and metrics, marking a significant advancement in clinical summarization. uMedSum toolkit is made available on GitHub.
pdf
bib
abs
GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement
Yifan Yang
|
Zheshu Song
|
Jianheng Zhuo
|
Mingyu Cui
|
Jinpeng Li
|
Bo Yang
|
Yexing Du
|
Ziyang Ma
|
Xunying Liu
|
Ziyuan Wang
|
Ke Li
|
Shuai Fan
|
Kai Yu
|
Wei-Qiang Zhang
|
Guoguo Chen
|
Xie Chen
The evolution of speech technology has been spurred by the rapid increase in dataset sizes. Traditional speech models generally depend on a large amount of labeled training data, which is scarce for low-resource languages. This paper presents GigaSpeech 2, a large-scale, multi-domain, multilingual speech recognition corpus. It is designed for low-resource languages and does not rely on paired speech and text data. GigaSpeech 2 comprises about 30,000 hours of automatically transcribed speech, including Thai, Indonesian, and Vietnamese, gathered from unlabeled YouTube videos. We also introduce an automated pipeline for data crawling, transcription, and label refinement. Specifically, this pipeline involves Whisper for initial transcription, MMS for forced alignment, and multi-dimensional filtering for data quality assurance. A modified Noisy Student Training is developed to further refine flawed pseudo labels iteratively, thereby enhancing model performance. Experimental results on our manually transcribed evaluation set and two public test sets from Common Voice and FLEURS confirm our corpus’s high quality and broad applicability. Notably, ASR models trained on GigaSpeech 2 can reduce the word error rate for Thai, Indonesian, and Vietnamese on our challenging and realistic YouTube test set by 25% to 40% compared to Whisper large-v3, with merely 10% model parameters. Furthermore, our ASR models trained on GigaSpeech 2 yield superior performance compared to commercial services. We hope that our newly introduced corpus and pipeline will open a new avenue for low-resource speech recognition and significantly facilitate research in this area.
pdf
bib
abs
Context-Aware Sentiment Forecasting via LLM-based Multi-Perspective Role-Playing Agents
Fanhang Man
|
Huandong Wang
|
Jianjie Fang
|
Zhaoyi Deng
|
Baining Zhao
|
Xinlei Chen
|
Yong Li
User sentiment on social media reveals underlying social trends, crises, and needs. Researchers have analyzed users’ past messages to track the evolution of sentiments and reconstruct sentiment dynamics. However, predicting the imminent sentiment response of users to ongoing events remains understudied. In this paper, we address the problem of sentiment forecasting on social media to predict users’ future sentiment based on event developments. We extract sentiment-related features to enhance modeling and propose a multi-perspective role-playing framework to simulate human response processes. Our preliminary results show significant improvements in sentiment forecasting at both microscopic and macroscopic levels.
pdf
bib
abs
TARGA: Targeted Synthetic Data Generation for Practical Reasoning over Structured Data
Xiang Huang
|
Jiayu Shen
|
Shanshan Huang
|
Sitao Cheng
|
Xiaxia Wang
|
Yuzhong Qu
Semantic parsing, which converts natural language queries into logic forms, plays a crucial role in reasoning within structured environments. However, existing methods encounter two significant challenges: reliance on extensive manually annotated datasets and limited generalization capability to unseen examples. To tackle these issues, we propose Targeted Synthetic Data Generation (Targa), a practical framework that dynamically generates high-relevance synthetic data without manual annotation. Starting from the pertinent entity and relation of a given question, we probe for the potential relevant queries through layer-wise expansion and cross-layer combination. Then, we generate corresponding natural language questions for these constructed queries to jointly serve as the synthetic demonstration for in-context learning. Experiments on multiple knowledge-based question answering (KBQA) datasets demonstrate that Targa, using only a 7B-parameter model, substantially outperforms existing non-fine-tuned methods that utilize close-sourced model, achieving notable improvements in F1 scores on GrailQA(+7.7) and KBQA-Agent(+12.2). Furthermore, Targa also exhibits superior sample efficiency, robustness, and generalization capabilities under non-I.I.D. settings.
pdf
bib
abs
AndroidGen: Building an Android Language Agent under Data Scarcity
Hanyu Lai
|
Junjie Gao
|
Xiao Liu
|
Yifan Xu
|
Shudan Zhang
|
Yuxiao Dong
|
Jie Tang
Large language models have opened up a world of possibilities for various NLP tasks, sparking optimism for the future. Despite their potential, LLMs have yet to be widely used as agents on real mobile devices. The main challenge is the need for high-quality data sources. Time constraints and labor intensity often hinder human annotation. On the other hand, existing LLMs exhibit inadequate completion rates and need a robust data filtration strategy. Given these challenges, we develop a framework called AndroidGen to enhance the capabilities of LLM-based agents under data scarcity. In addition, we leverage AndroidGen to collect trajectories given human tasks and train open-source LLMs on these trajectories to develop an open-source mobile agent without manually labeled trajectories. We extensively evaluate AndroidGen with AndroidWorld, AitW, and various popular applications, demonstrating its improvements and revealing potential areas for future improvement. Code, model, and data are available at https://github.com/THUDM/AndroidGen.
pdf
bib
abs
Prompt Candidates, then Distill: A Teacher-Student Framework for LLM-driven Data Annotation
Mingxuan Xia
|
Haobo Wang
|
Yixuan Li
|
Zewei Yu
|
Jindong Wang
|
Junbo Zhao
|
Runze Wu
Recently, Large Language Models (LLMs) have demonstrated significant potential for data annotation, markedly reducing the labor costs associated with downstream applications. However, existing methods mostly adopt an aggressive strategy by prompting LLM to determine a single gold label for each unlabeled sample. Due to the inherent uncertainty within LLMs, they often produce incorrect labels for difficult samples, severely compromising the data quality for downstream applications. Motivated by ambiguity aversion in human behaviors, we propose a novel candidate annotation paradigm wherein large language models are encouraged to output all possible labels when incurring uncertainty. To ensure unique labels are provided for downstream tasks, we develop a teacher-student framework CanDist that distills candidate annotations with a Small Language Model (SLM). We further provide a rigorous justification demonstrating that distilling candidate annotations from the teacher LLM offers superior theoretical guarantees compared to directly using single annotations. Extensive experiments across six text classification tasks validate the effectiveness of our proposed method. The source code is available at https://github.com/MingxuanXia/CanDist.
pdf
bib
abs
A Survey of Post-Training Scaling in Large Language Models
Hanyu Lai
|
Xiao Liu
|
Junjie Gao
|
Jiale Cheng
|
Zehan Qi
|
Yifan Xu
|
Shuntian Yao
|
Dan Zhang
|
Jinhua Du
|
Zhenyu Hou
|
Xin Lv
|
Minlie Huang
|
Yuxiao Dong
|
Jie Tang
Large language models (LLMs) have achieved remarkable proficiency in understanding and generating human natural languages, mainly owing to the “scaling law” that optimizes relationships among language modeling loss, model parameters, and pre-trained tokens. However, with the exhaustion of high-quality internet corpora and increasing computational demands, the sustainability of pre-training scaling needs to be addressed. This paper presents a comprehensive survey of post-training scaling, an emergent paradigm aiming to relieve the limitations of traditional pre-training by focusing on the alignment phase, which traditionally accounts for a minor fraction of the total training computation. Our survey categorizes post-training scaling into three key methodologies: Supervised Fine-tuning (SFT), Reinforcement Learning from Feedback (RLxF), and Test-time Compute (TTC). We provide an in-depth analysis of the motivation behind post-training scaling, the scalable variants of these methodologies, and a comparative discussion against traditional approaches. By examining the latest advancements, identifying promising application scenarios, and highlighting unresolved issues, we seek a coherent understanding and map future research trajectories in the landscape of post-training scaling for LLMs.
pdf
bib
abs
Position-aware Automatic Circuit Discovery
Tal Haklay
|
Hadas Orgad
|
David Bau
|
Aaron Mueller
|
Yonatan Belinkov
A widely used strategy to discover and understand language model mechanisms is circuit analysis. A circuit is a minimal subgraph of a model’s computation graph that executes a specific task. We identify a gap in existing circuit discovery methods: they assume circuits are position-invariant, treating model components as equally relevant across input positions. This limits their ability to capture cross-positional interactions or mechanisms that vary across positions. To address this gap, we propose two improvements to incorporate positionality into circuits, even on tasks containing variable-length examples. First, we extend edge attribution patching, a gradient-based method for circuit discovery, to differentiate between token positions. Second, we introduce the concept of a dataset schema, which defines token spans with similar semantics across examples, enabling position-aware circuit discovery in datasets with variable length examples. We additionally develop an automated pipeline for schema generation and application using large language models. Our approach enables fully automated discovery of position-sensitive circuits, yielding better trade-offs between circuit size and faithfulness compared to prior work.
pdf
bib
abs
HyperFM: Fact-Centric Multimodal Fusion for Link Prediction over Hyper-Relational Knowledge Graphs
Yuhuan Lu
|
Weijian Yu
|
Xin Jing
|
Dingqi Yang
With the ubiquity of hyper-relational facts in modern Knowledge Graphs (KGs), existing link prediction techniques mostly focus on learning the sophisticated relationships among multiple entities and relations contained in a fact, while ignoring the multimodal information, which often provides additional clues to boost link prediction performance. Nevertheless, traditional multimodel fusion approaches, which are mainly designed for triple facts under either entity-centric or relation-guided fusion schemes, fail to integrate the multimodal information with the rich context of the hyper-relational fact consisting of multiple entities and relations. Against this background, we propose **HyperFM**, a **Hyper**-relational **F**act-centric **M**ultimodal Fusion technique. It effectively captures the intricate interactions between different data modalities while accommodating the hyper-relational structure of the KG in a fact-centric manner via a customized Hypergraph Transformer. We evaluate HyperFM against a sizeable collection of baselines in link prediction tasks on two real-world KG datasets. Results show that HyperFM consistently achieves the best performance, yielding an average improvement of 6.0-6.8% over the best-performing baselines on the two datasets. Moreover, a series of ablation studies systematically validate our fact-centric fusion scheme.
pdf
bib
abs
Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model
Gregor Geigle
|
Florian Schneider
|
Carolin Holtermann
|
Chris Biemann
|
Radu Timofte
|
Anne Lauscher
|
Goran Glavaš
Most Large Vision-Language Models (LVLMs) to date are trained predominantly on English data, which makes them struggle to understand non-English input and fail to generate output in the desired target language. Existing efforts mitigate these issues by adding multilingual training data, but do so in a largely ad-hoc manner, lacking insight into how different training mixes tip the scale for different groups of languages. In this work, we present a comprehensive investigation into the training strategies for massively multilingual LVLMs. First, we conduct a series of multi-stage experiments spanning 13 downstream vision-language tasks and 43 languages, systematically examining: (1) the number of training languages that can be included without degrading English performance and (2) optimal language distributions of pre-training as well as (3) instruction-tuning data. Further, we (4) investigate how to improve multilingual text-in-image understanding, and introduce a new benchmark for the task. Surprisingly, our analysis reveals that one can (i) include as many as 100 training languages simultaneously (ii) with as little as 25-50% of non-English data, to greatly improve multilingual performance while retaining strong English performance. We further find that (iii) including non-English OCR data in pre-training and instruction-tuning is paramount for improving multilingual text-in-image understanding. Finally, we put all our findings together and train , a 100-language LVLM, offering state-of-the-art performance in an evaluation covering 14 tasks and 56 languages.
pdf
bib
abs
Less for More: Enhanced Feedback-aligned Mixed LLMs for Molecule Caption Generation and Fine-Grained NLI Evaluation
Dimitris Gkoumas
|
Maria Liakata
Scientific language models drive research innovation but require extensive fine-tuning on large datasets. This work enhances such models by improving their inference and evaluation capabilities with minimal or no additional training. Focusing on molecule caption generation, we explore post-training synergies between alignment fine-tuning and model merging in a cross-modal setup. We reveal intriguing insights into the behaviour and suitability of such methods while significantly surpassing state-of-the-art models. Moreover, we propose a novel atomic-level evaluation method leveraging off-the-shelf Natural Language Inference (NLI) models for use in the unseen chemical domain. Our experiments demonstrate that our evaluation operates at the right level of granularity, effectively handling multiple content units and subsentence reasoning, while widely adopted NLI methods consistently misalign with assessment criteria.
pdf
bib
abs
Ensemble Watermarks for Large Language Models
Georg Niess
|
Roman Kern
As large language models (LLMs) reach human-like fluency, reliably distinguishing AI-generated text from human authorship becomes increasingly difficult. While watermarks already exist for LLMs, they often lack flexibility and struggle with attacks such as paraphrasing. To address these issues, we propose a multi-feature method for generating watermarks that combines multiple distinct watermark features into an ensemble watermark. Concretely, we combine acrostica and sensorimotor norms with the established red-green watermark to achieve a 98% detection rate. After a paraphrasing attack, the performance remains high with 95% detection rate. In comparison, the red-green feature alone as a baseline achieves a detection rate of 49% after paraphrasing. The evaluation of all feature combinations reveals that the ensemble of all three consistently has the highest detection rate across several LLMs and watermark strength settings. Due to the flexibility of combining features in the ensemble, various requirements and trade-offs can be addressed. Additionally, the same detection function can be used without adaptations for all ensemble configurations. This method is particularly of interest to facilitate accountability and prevent societal harm.
pdf
bib
abs
\mathsf{Con Instruction}: Universal Jailbreaking of Multimodal Large Language Models via Non-Textual Modalities
Jiahui Geng
|
Thy Thy Tran
|
Preslav Nakov
|
Iryna Gurevych
Existing attacks against multimodal language models often communicate instruction through text, either as an explicit malicious instruction or a crafted generic prompt, and accompanied by a toxic image. In contrast, here we exploit the capabilities of MLLMs in following non-textual instruction, i.e., an adversarial image or audio, namely Con Instruction. It is a novel gray-box attack method that generates adversarial images or audio to convey specific harmful instructions to MLLMs. We also find that combining our adversarial examples with certain non-empty text inputs amplifies attack success, while appending these after malicious text has limited effects. To evaluate whether an attack is successful, we introduce a new attack response categorization (ARC) that considers the response quality and relevancy concerning the malicious instruction. The results show that Con Instruction effectively bypasses the safety mechanisms in various visual and audio-language models, including LLaVA-v1.5, InternVL, Qwen-VL, and Qwen-Audio, across two standard benchmarks: AdvBench and SafeBench. Specifically, our method achieves the highest attack success rates, reaching 81.3% and 86.6% on LLaVA-v1.5 (13B). We show that larger models are more susceptible toCon Instruction, contrasting observations in their underlying LLMs. On the defense side, we explore various methods against our attacks and find substantial gaps among existing techniques. The code will be made available upon publication.
pdf
bib
abs
TRACT: Regression-Aware Fine-tuning Meets Chain-of-Thought Reasoning for LLM-as-a-Judge
Cheng-Han Chiang
|
Hung-yi Lee
|
Michal Lukasik
The LLM-as-a-judge paradigm uses large language models (LLMs) for automated text evaluation, assigning a score to the input based on scoring rubrics. Existing methods for fine-tuning LLM-as-a-judge use cross-entropy (CE) loss, which neglects the numeric nature of score prediction. Recent work addresses numerical prediction limitations of LLM fine-tuning through regression-aware fine-tuning but does not consider chain-of-thought (CoT) reasoning for score prediction. In this paper, we introduce TRACT (Two-stage Regression-Aware fine-tuning with CoT), which combines CoT reasoning with regression-aware training. TRACT uses a two-stage process: first, it fine-tunes the seed LLM to generate CoTs, which serve as the training data for the second stage; next, it uses these self-generated CoTs to retrain the seed LLM. The fine-tuning objective of TRACT applies CE loss for CoT reasoning and regression-aware loss for the score. Experiments across four LLM-as-a-judge datasets and two LLMs show that TRACT significantly outperforms existing methods. Extensive ablation studies validate the effectiveness of each component in TRACT.
pdf
bib
abs
DioR: Adaptive Cognitive Detection and Contextual Retrieval Optimization for Dynamic Retrieval-Augmented Generation
Hanghui Guo
|
Jia Zhu
|
Shimin Di
|
Weijie Shi
|
Zhangze Chen
|
Jiajie Xu
Dynamic Retrieval-augmented Generation (RAG) has shown great success in mitigating hallucinations in large language models (LLMs) during generation. However, existing dynamic RAG methods face significant limitations in two key aspects: 1) Lack of an effective mechanism to control retrieval triggers, and 2) Lack of effective scrutiny of retrieval content. To address these limitations, we propose an innovative dynamic RAG method, DioR (Adaptive Cognitive Detection and Contextual Retrieval Optimization), which consists of two main components: adaptive cognitive detection and contextual retrieval optimization, specifically designed to determine when retrieval is needed and what to retrieve for LLMs is useful. Experimental results demonstrate that DioR achieves superior performance on all tasks, demonstrating the effectiveness of our work.
pdf
bib
abs
Unveiling the Power of Source: Source-based Minimum Bayes Risk Decoding for Neural Machine Translation
Boxuan Lyu
|
Hidetaka Kamigaito
|
Kotaro Funakoshi
|
Manabu Okumura
Maximum a posteriori decoding, a commonly used method for neural machine translation (NMT), aims to maximize the estimated posterior probability. However, high estimated probability does not always lead to high translation quality. Minimum Bayes Risk (MBR) decoding offers an alternative by seeking hypotheses with the highest expected utility.Inspired by Quality Estimation (QE) reranking which uses the QE model as a ranker, we propose source-based MBR (sMBR) decoding, a novel approach that utilizes quasi-sources (generated via paraphrasing or back-translation) as “support hypotheses” and a reference-free quality estimation metric as the utility function, marking the first work to solely use sources in MBR decoding. Experiments show that sMBR outperforms QE reranking and the standard MBR decoding. Our findings suggest that sMBR is a promising approach for NMT decoding.
pdf
bib
abs
ToolHop: A Query-Driven Benchmark for Evaluating Large Language Models in Multi-Hop Tool Use
Junjie Ye
|
Zhengyin Du
|
Xuesong Yao
|
Weijian Lin
|
Yufei Xu
|
Zehui Chen
|
Zaiyuan Wang
|
Sining Zhu
|
Zhiheng Xi
|
Siyu Yuan
|
Tao Gui
|
Qi Zhang
|
Xuanjing Huang
|
Jiecao Chen
Effective evaluation of multi-hop tool use is critical for analyzing the understanding, reasoning, and function-calling capabilities of large language models (LLMs). However, progress has been hindered by a lack of reliable evaluation datasets. To address this, we present ToolHop, a dataset comprising 995 user queries and 3,912 associated tools, specifically designed for rigorous evaluation of multi-hop tool use. ToolHop ensures diverse queries, meaningful interdependencies, locally executable tools, detailed feedback, and verifiable answers through a novel query-driven data construction approach that includes tool creation, document refinement, and code generation. We evaluate 14 LLMs across five model families (i.e., LLaMA3.1, Qwen2.5, Gemini1.5, Claude3.5, and GPT), uncovering significant challenges in handling multi-hop tool-use scenarios. The leading model, GPT-4o, achieves an accuracy of 49.04%, underscoring substantial room for improvement. Further analysis reveals variations in tool-use strategies for various families, offering actionable insights to guide the development of more effective approaches. Code and data can be found in https://huggingface.co/datasets/bytedance-research/ToolHop.
pdf
bib
abs
Mixture of insighTful Experts (MoTE): The Synergy of Reasoning Chains and Expert Mixtures in Self-Alignment
Zhili Liu
|
Yunhao Gou
|
Kai Chen
|
Lanqing Hong
|
Jiahui Gao
|
Fei Mi
|
Yu Zhang
|
Zhenguo Li
|
Xin Jiang
|
Qun Liu
|
James Kwok
As the capabilities of large language models (LLMs) continue to expand, aligning these models with human values remains a significant challenge. Recent studies show that reasoning abilities contribute significantly to model safety, while integrating Mixture-of-Experts (MoE) architectures can further enhance alignment.In this work, we address a fundamental question:How to effectively incorporate reasoning abilitiesand MoE architectures into self-alignment processin LLMs?We propose Mixture of insighTful Experts (MoTE), a novel framework that synergistically combines reasoning chains and expert mixtures to improve self-alignments.From a data perspective, MoTE employs a structured reasoning chain comprising four key stages: Question Analysis, Answer Guidance, Safe Answer, and Safety Checking. This approach enhances safety through multi-step reasoning and proves effective even for smaller and less powerful LLMs (e.g., 7B models). From an architectural perspective, MoTE adopts a multi-LoRA framework with step-level routing, where each expert is dedicated to a specific reasoning step. This design eliminates the need for balance losses, ensures stable training, and supports adaptive inference lengths. Experimental results demonstrate that MoTE significantly improves model safety, jailbreak resistance, and over-refusal capabilities, achieving performance comparable to OpenAI’s state-of-the-art o1 model.
pdf
bib
abs
MAPS: Motivation-Aware Personalized Search via LLM-Driven Consultation Alignment
Weicong Qin
|
Yi Xu
|
Weijie Yu
|
Chenglei Shen
|
Ming He
|
Jianping Fan
|
Xiao Zhang
|
Jun Xu
Personalized product search aims to retrieve and rank items that match users’ preferences and search intent. Despite their effectiveness, existing approaches typically assume that users’ query fully captures their real motivation. However, our analysis of a real-world e-commerce platform reveals that users often engage in relevant consultations before searching, indicating they refine intents through consultations based on motivation and need. The implied motivation in consultations is a key enhancing factor for personalized search. This unexplored area comes with new challenges including aligning contextual motivations with concise queries, bridging the category-text gap, and filtering noise within sequence history. To address these, we propose a Motivation-Aware Personalized Search (MAPS) method. It embeds queries and consultations into a unified semantic space via LLMs, utilizes a Mixture of Attention Experts (MoAE) to prioritize critical semantics, and introduces dual alignment: (1) contrastive learning aligns consultations, reviews, and product features; (2) bidirectional attention integrates motivation-aware embeddings with user preferences. Extensive experiments on real and synthetic data show MAPS outperforms existing methods in both retrieval and ranking tasks. Code and supplementary materials are available at: https://github.com/E-qin/MAPS.
pdf
bib
abs
Aristotle: Mastering Logical Reasoning with A Logic-Complete Decompose-Search-Resolve Framework
Jundong Xu
|
Hao Fei
|
Meng Luo
|
Qian Liu
|
Liangming Pan
|
William Yang Wang
|
Preslav Nakov
|
Mong-Li Lee
|
Wynne Hsu
In the context of large language models (LLMs), current advanced reasoning methods have made impressive strides in various reasoning tasks. However, when it comes to logical reasoning tasks, significant challenges remain in both efficacy and efficiency. This is rooted in the fact that these systems fail to fully leverage the inherent structure of logical tasks throughout the reasoning processes, including decomposition, search, and resolution. To address this, this paper proposes a logic-complete reasoning framework, Aristotle. The framework consists of three key components: Logical Decomposer, Logical Search Router, and Logical Resolver, in which symbolic expressions and logical rules are comprehensively integrated into the entire reasoning process, significantly alleviating the bottlenecks of logical reasoning, i.e., reducing sub-task complexity, minimizing search errors, and resolving logical contradictions. Experimental results demonstrate that Aristotle consistently outperforms state-of-the-art reasoning frameworks in both accuracy and efficiency, particularly excelling in complex logical reasoning scenarios.
pdf
bib
abs
LADM: Long-context Training Data Selection with Attention-based Dependency Measurement for LLMs
Jianghao Chen
|
Junhong Wu
|
Yangyifan Xu
|
Jiajun Zhang
Long-context modeling has drawn more and more attention in the area of Large Language Models (LLMs). Continual training with long-context data becomes the de-facto method to equip LLMs with the ability to process long inputs. However, it still remains an open challenge to measure the quality of long-context training data. To address this issue, we propose a Long-context data selection framework with Attention-based Dependency Measurement (LADM), which can efficiently identify high-quality long-context data from a large-scale, multi-domain pre-training corpus. LADM leverages the retrieval capabilities of the attention mechanism to capture contextual dependencies, ensuring a comprehensive quality measurement of long-context data. Experimental results show that our LADM framework significantly boosts the performance of LLMs on multiple long-context tasks with only 1B tokens for continual training.
pdf
bib
abs
Iron Sharpens Iron: Defending Against Attacks in Machine-Generated Text Detection with Adversarial Training
Yuanfan Li
|
Zhaohan Zhang
|
Chengzhengxu Li
|
Chao Shen
|
Xiaoming Liu
Machine-generated Text (MGT) detection is crucial for regulating and attributing online texts. While the existing MGT detectors achieve strong performance, they remain vulnerable to simple perturbations and adversarial attacks. To build an effective defense against malicious perturbations, we view MGT detection from a threat modeling perspective, that is, analyzing the model’s vulnerability from an adversary’s point of view and exploring effective mitigations. To this end, we introduce an adversarial framework for training a robust MGT detector, named GREedy Adversary PromoTed DefendER (GREATER). The GREATER consists of two key components: an adversary GREATER-A and a detector GREATER-D. The GREATER-D learns to defend against the adversarial attack from GREATER-A and generalizes the defense to other attacks. GREATER-A identifies and perturbs the critical tokens in embedding space, along with greedy search and pruning to generate stealthy and disruptive adversarial examples. Besides, we update the GREATER-A and GREATER-D synchronously, encouraging the GREATER-D to generalize its defense to different attacks and varying attack intensities. Our experimental results across 10 text perturbation strategies and 6 adversarial attacks show that our GREATER-D reduces the Attack Success Rate (ASR) by 0.67% compared with SOTA defense methods while our GREATER-A is demonstrated to be more effective and efficient than SOTA attack approaches. Codes and dataset are available in https://github.com/Liyuuuu111/GREATER.
pdf
bib
abs
Cultural Learning-Based Culture Adaptation of Language Models
Chen Cecilia Liu
|
Anna Korhonen
|
Iryna Gurevych
Adapting large language models (LLMs) to diverse cultural values is a challenging task, as existing LLMs often reflect the values of specific groups by default, and potentially cause harm to others. In this paper, we present CLCA, a novel framework for enhancing LLM alignment with cultural values based on cultural learning. The framework leverages simulated social interactions to generate conversations in which LLMs engage in role-playing within culturally adapted social scenarios, capturing implicit cultural norms for model fine-tuning. CLCA improves cultural value alignment across various model architectures measured using World Value Survey data, demonstrating the effectiveness of our proposed approach. Our results provide early evidence that understanding intent and social interactions can enhance cultural value adaptation in LLMs, highlighting the promise of training approaches based on cultural learning.
pdf
bib
abs
A-TASC: Asian TED-Based Automatic Subtitling Corpus
Yuhan Zhou
|
Naoki Yoshinaga
Subtitles play a crucial role in improving the accessibility of the vast amount of audiovisual content available on the Internet, allowing audiences worldwide to comprehend and engage with this content in various languages. Automatic subtitling (AS) systems are essential for alleviating the substantial workload of human transcribers and translators. However, existing AS corpora and the primary metric SubER focus on European languages. This paper introduces A-TASC, an Asian TED-based automatic subtitling corpus derived from English TED Talks, comprising nearly 800 hours of audio segments, aligned English transcripts, and subtitles in Chinese, Japanese, Korean, and Vietnamese. We then present SacreSubER, a modification of SubER, to enable the reliable evaluation of subtitle quality for languages without explicit word boundaries. Experimental results, using both end-to-end systems and pipeline approaches built on strong ASR and LLM components, validate the quality of the proposed corpus and reveal differences in AS performance between European and Asian languages. The code to build our corpus is released.
pdf
bib
abs
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training
Youliang Yuan
|
Wenxiang Jiao
|
Wenxuan Wang
|
Jen-tse Huang
|
Jiahao Xu
|
Tian Liang
|
Pinjia He
|
Zhaopeng Tu
This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs) by identifying and tackling a refusal position bias within safety tuning data, which compromises the models’ ability to appropriately refuse generating unsafe content. We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position, significantly enhancing their safety capabilities. DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation (MLE) with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful response sequence. Our empirical evaluation, conducted using LLaMA3 and Mistral model families across six attack scenarios, demonstrates that our method not only improves model safety without compromising performance but also surpasses baseline methods in defending against attacks.
pdf
bib
abs
Token Prepending: A Training-Free Approach for Eliciting Better Sentence Embeddings from LLMs
Yuchen Fu
|
Zifeng Cheng
|
Zhiwei Jiang
|
Zhonghui Wang
|
Yafeng Yin
|
Zhengliang Li
|
Qing Gu
Extracting sentence embeddings from large language models (LLMs) is a promising direction, as LLMs have demonstrated stronger semantic understanding capabilities. Previous studies typically focus on prompt engineering to elicit sentence embeddings from LLMs by prompting the model to encode sentence information into the embedding of the last token.However, LLMs are mostly decoder-only models with causal attention and the earlier tokens in the sentence cannot attend to the latter tokens, resulting in biased encoding of sentence information and cascading effects on the final decoded token.To this end, we propose a novel Token Prepending (TP) technique that prepends each layer’s decoded sentence embedding to the beginning of the sentence in the next layer’s input, allowing earlier tokens to attend to the complete sentence information under the causal attention mechanism.The proposed TP technique is a plug-and-play and training-free technique, which means it can be seamlessly integrated with various prompt-based sentence embedding methods and autoregressive LLMs.Extensive experiments on various Semantic Textual Similarity (STS) tasks and downstream classification tasks demonstrate that our proposed TP technique can significantly improve the performance of existing prompt-based sentence embedding methods across different LLMs, while incurring negligible additional inference cost.
pdf
bib
abs
No Questions are Stupid, but some are Poorly Posed: Understanding Poorly-Posed Information-Seeking Questions
Neha Srikanth
|
Rachel Rudinger
|
Jordan Lee Boyd-Graber
Questions help unlock information to satisfy users’ information needs. However, when the question is poorly posed, answerers (whether human or computer) may struggle to answer the question in a way that satisfies the asker, despite possibly knowing everything necessary to address the asker’s latent information need. Using Reddit question-answer interactions from r/NoStupidQuestions, we develop a computational framework grounded in linguistic theory to study poorly-posedness of questions by generating spaces of potential interpretations of questions and computing distributions over these spaces based on interpretations chosen by both human answerers in the Reddit question thread, as well as by a suite of large language models. Both humans and models struggle to converge on dominant interpretations when faced with poorly-posed questions, but employ different strategies: humans focus on specific interpretations through question negotiation, while models attempt comprehensive coverage by addressing many interpretations simultaneously.
pdf
bib
abs
Understanding Common Ground Misalignment in Goal-Oriented Dialog: A Case-Study with Ubuntu Chat Logs
Rupak Sarkar
|
Neha Srikanth
|
Taylor Pellegrin
|
Rachel Rudinger
|
Claire Bonial
|
Philip Resnik
While it is commonly accepted that maintaining common ground plays a role in conversational success, little prior research exists connecting conversational grounding to success in task-oriented conversations. We study failures of grounding in the Ubuntu IRC dataset, where participants use text-only communication to resolve technical issues. We find that disruptions in conversational flow often stem from a misalignment in common ground, driven by a divergence in beliefs and assumptions held by participants. These disruptions, which we call conversational friction, significantly correlate with task success. While LLMs can identify overt cases of conversational friction, they struggle with subtler and more context-dependent instances that require pragmatic or domain-specific reasoning.
pdf
bib
abs
Addressing Blind Guessing: Calibration of Selection Bias in Multiple-Choice Question Answering by Video Language Models
Olga Loginova
|
Oleksandr Bezrukov
|
Ravi Shekhar
|
Alexey Kravets
Evaluating Video Language Models (VLMs) is a challenging task. Due to its transparency, Multiple-Choice Question Answering (MCQA) is widely used to measure the performance of these models through accuracy. However, existing MCQA benchmarks fail to capture the full reasoning capabilities of VLMs due to selection bias, when models disproportionately favor certain answer options based on positional patterns observed during training. In this work, we conduct a comprehensive empirical analysis of several VLM architectures across major datasets designed to assess complex video-focused reasoning. We identify where the bias is most pronounced and demonstrate to what extent model responses reflect genuine understanding of video content and related questions, as opposed to reliance on arbitrary patterns or superficial cues, such as answer position. By decomposing the MCQA task and adapting fairness bias metrics to VLMs, we introduce a post-processing calibration technique BOLD to balance this bias. Our results show that reducing selection bias improves not only debiasing metrics but also overall model performance, including Accuracy and F1 Mean score. Our method, by suppressing “blind guessing”, offers a more cost- and time-effective approach to mitigating selection bias compared to existing techniques. This study represents the first focused investigation of selection bias in video-to-text LLM-powered models.
pdf
bib
abs
Towards Reward Fairness in RLHF: From a Resource Allocation Perspective
Sheng Ouyang
|
Yulan Hu
|
Ge Chen
|
Qingyang Li
|
Fuzheng Zhang
|
Yong Liu
Rewards serve as proxies for human preferences and play a crucial role in Reinforcement Learning from Human Feedback (RLHF). However, if these rewards are inherently imperfect, exhibiting various biases, they can adversely affect the alignment of large language models (LLMs). In this paper, we collectively define the various biases present in rewards as the problem of reward unfairness. We propose a bias-agnostic method to address the issue of reward fairness from a resource allocation perspective, without specifically designing for each type of bias, yet effectively mitigating them. Specifically, we model preference learning as a resource allocation problem, treating rewards as resources to be allocated while considering the trade-off between utility and fairness in their distribution. We propose two methods, Fairness Regularization and Fairness Coefficient, to achieve fairness in rewards. We apply our methods in both verification and reinforcement learning scenarios to obtain a fairness reward model and a policy model, respectively. Experiments conducted in these scenarios demonstrate that our approach aligns LLMs with human preferences in a more fair manner. Our data and code are available at
https://github.com/shoyua/Towards-Reward-Fairness.
pdf
bib
abs
Taming LLMs with Gradient Grouping
Siyuan Li
|
Juanxi Tian
|
Zedong Wang
|
Xin Jin
|
Zicheng Liu
|
Wentao Zhang
|
Dan Xu
Training large language models (LLMs) poses challenges due to their massive scale and heterogeneous architectures. While adaptive optimizers like AdamW help address gradient variations, they still struggle with efficient and effective parameter-wise learning rate estimation, resulting in training instability, slow convergence, and poor compatibility with parameter-efficient fine-tuning (PEFT) techniques. This work introduces Scaling with Gradient Grouping (SGG), an optimizer wrapper that improves adaptive learning rate estimation by dynamic grouping and group-specific scaling. SGG first groups gradient statistics in each layer into clusters and then applies cluster-specific scaling to calibrate learning rates for each parameter, thus imposing collective group-wise constraints while maintaining precise per-parameter adaptation. Experiments on diverse (M)LLM benchmarks show that SGG integrates seamlessly with existing optimizers, and offers consistent gains and faster convergence over baselines, with various model sizes. Its stability across varying batch sizes and learning rates establishes SGG as a robust choice for LLM optimization.
pdf
bib
abs
LazyReview: A Dataset for Uncovering Lazy Thinking in NLP Peer Reviews
Sukannya Purkayastha
|
Zhuang Li
|
Anne Lauscher
|
Lizhen Qu
|
Iryna Gurevych
Peer review is a cornerstone of quality control in scientific publishing. With the increasing workload, the unintended use of ‘quick’ heuristics, referred to as lazy thinking, has emerged as a recurring issue compromising review quality. Automated methods to detect such heuristics can help improve the peer-reviewing process. However, there is limited NLP research on this issue, and no real-world dataset exists to support the development of detection tools. This work introduces LazyReview, a dataset of peer-review sentences annotated with fine-grained lazy thinking categories. Our analysis reveals that Large Language Models (LLMs) struggle to detect these instances in a zero-shot setting. However, instruction-based fine-tuning on our dataset significantly boosts performance by 10-20 performance points, highlighting the importance of high-quality training data. Furthermore, a controlled experiment demonstrates that reviews revised with lazy thinking feedback are more comprehensive and actionable than those written without such feedback. We will release our dataset and the enhanced guidelines that can be used to train junior reviewers in the community.
pdf
bib
abs
Revisiting Common Assumptions about Arabic Dialects in NLP
Amr Keleg
|
Sharon Goldwater
|
Walid Magdy
Arabic has diverse dialects, where one dialect can be substantially different from the others. In the NLP literature, some assumptions about these dialects are widely adopted (e.g., “Arabic dialects can be grouped into distinguishable regional dialects”) and are manifested in different computational tasks such as Arabic Dialect Identification (ADI). However, these assumptions are not quantitatively verified. We identify four of these assumptions and examine them by extending and analyzing a multi-label dataset, where the validity of each sentence in 11 different country-level dialects is manually assessed by speakers of these dialects. Our analysis indicates that the four assumptions oversimplify reality, and some of them are not always accurate. This in turn might be hindering further progress in different Arabic NLP tasks.
pdf
bib
abs
Retrieve to Explain: Evidence-driven Predictions for Explainable Drug Target Identification
Ravi Patel
|
Angus Brayne
|
Rogier Hintzen
|
Daniel Jaroslawicz
|
Georgiana Neculae
|
Dane S. Corneil
Language models hold incredible promise for enabling scientific discovery by synthesizing massive research corpora. Many complex scientific research questions have multiple plausible answers, each supported by evidence of varying strength. However, existing language models lack the capability to quantitatively and faithfully compare answer plausibility in terms of supporting evidence. To address this, we introduce Retrieve to Explain (R2E), a retrieval-based model that scores and ranks all possible answers to a research question based on evidence retrieved from a document corpus. The architecture represents each answer only in terms of its supporting evidence, with the answer itself masked. This allows us to extend feature attribution methods such as Shapley values, to transparently attribute answer scores to supporting evidence at inference time. The architecture also allows incorporation of new evidence without retraining, including non-textual data modalities templated into natural language. We developed R2E for the challenging scientific discovery task of drug target identification, a human-in-the-loop process where failures are extremely costly and explainability paramount. When predicting whether drug targets will subsequently be confirmed as efficacious in clinical trials, R2E not only matches non-explainable literature-based models but also surpasses a genetics-based target identification approach used throughout the pharmaceutical industry.
pdf
bib
abs
Whose Boat Does it Float? Improving Personalization in Preference Tuning via Inferred User Personas
Nishant Balepur
|
Vishakh Padmakumar
|
Fumeng Yang
|
Shi Feng
|
Rachel Rudinger
|
Jordan Lee Boyd-Graber
LLMs are aligned to follow input instructions by learning which of two responses users prefer for a prompt. However, such preference data do not convey *why* users prefer responses that are chosen or rejected, so LLMs trained on these datasets cannot tailor responses to varied user needs. To surface these parameters of personalization, we apply *abductive reasoning* to preference data, inferring needs and interests of users, i.e., personas, that may prefer either response. We test this idea in two steps: **Persona Inference (PI)**—abductively inferring personas of users who prefer chosen or rejected outputs—and **Persona Tailoring (PT)**—training models to tailor outputs to personas from PI. We show: 1) LLMs infer personas accurately explaining why different users may prefer *both* chosen or rejected outputs; 2) Training on preference data augmented with PI personas via PT boosts personalization and generalizes to supporting user-written personas; and 3) Rejected response personas form harder personalization evaluations, showing PT better aids users with uncommon preferences versus typical alignment methods. We argue for an abductive view of preferences for personalization, asking not only which response is better but when, why, and for whom.
pdf
bib
abs
Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above
Nishant Balepur
|
Rachel Rudinger
|
Jordan Lee Boyd-Graber
Multiple choice question answering (MCQA) is popular for LLM evaluation due to its simplicity and human-like testing, but we argue for its reform. We first reveal flaws in MCQA’s format, as it struggles to: 1) test generation/subjectivity; 2) match LLM use cases; and 3) fully test knowledge. We instead advocate for generative formats based on human testing—where LLMs construct and explain answers—better capturing user needs and knowledge while remaining easy to score. We then show even when MCQA is a useful format, its datasets suffer from: leakage; unanswerability; shortcuts; and saturation. In each issue, we give fixes from education, like rubrics to guide MCQ writing; scoring methods to bridle guessing; and Item Response Theory to build harder MCQs. Lastly, we discuss LLM errors in MCQA—robustness, biases, and unfaithful explanations—showing how our prior solutions better measure or address these issues. While we do not need to desert MCQA, we encourage more efforts in refining the task based on educational testing, advancing evaluations.
pdf
bib
abs
Detection of Human and Machine-Authored Fake News in Urdu
Muhammad Zain Ali
|
Yuxia Wang
|
Bernhard Pfahringer
|
Tony C Smith
The rise of social media has amplified the spread of fake news, now further complicated by large language models (LLMs) like ChatGPT, which ease the generation of highly convincing, error-free misinformation, making it increasingly challenging for the public to discern truth from falsehood. Traditional fake news detection methods relying on linguistic cues have also become less effective. Moreover, current detectors primarily focus on binary classification and English texts, often overlooking the distinction between machine-generated true vs. fake news and the detection in low-resource languages. To this end, we updated the detection schema to include machine-generated news focusing on Urdu. We further propose a conjoint detection strategy to improve the accuracy and robustness. Experiments show its effectiveness across four datasets in various settings.
pdf
bib
abs
An Efficient Task-Oriented Dialogue Policy: Evolutionary Reinforcement Learning Injected by Elite Individuals
Yangyang Zhao
|
Ben Niu
|
Libo Qin
|
Shihan Wang
Deep Reinforcement Learning (DRL) is widely used in task-oriented dialogue systems to optimize dialogue policy, but it struggles to balance exploration and exploitation due to the high dimensionality of state and action spaces. This challenge often results in local optima or poor convergence. Evolutionary Algorithms (EAs) have been proven to effectively explore the solution space of neural networks by maintaining population diversity. Inspired by this, we innovatively combine the global search capabilities of EA with the local optimization of DRL to achieve a balance between exploration and exploitation. Nevertheless, the inherent flexibility of natural language in dialogue tasks complicates this direct integration, leading to prolonged evolutionary times. Thus, we further propose an elite individual injection mechanism to enhance EA’s search efficiency by adaptively introducing best-performing individuals into the population. Experiments across four datasets show that our approach significantly improves the balance between exploration and exploitation, boosting performance. Moreover, the effectiveness of the EII mechanism in reducing exploration time has been demonstrated, achieving an efficient integration of EA and DRL on task-oriented dialogue policy tasks.
pdf
bib
abs
SR-LLM: Rethinking the Structured Representation in Large Language Model
Jiahuan Zhang
|
Tianheng Wang
|
Ziyi Huang
|
Yulong Wu
|
Hanqing Wu
|
DongbaiChen DongbaiChen
|
Linfeng Song
|
Yue Zhang
|
Guozheng Rao
|
Kaicheng Yu
Structured representations, exemplified by Abstract Meaning Representation (AMR), have long been pivotal in computational linguistics. However, their role remains ambiguous in the Large Language Models (LLMs) era. Initial attempts to integrate structured representation into LLMs via a zero-shot setting yielded inferior performance. We hypothesize that such a decline stems from the structure information being passed into LLMs in a code format unfamiliar to LLMs’ training corpora. Consequently, we propose SR-LLM, an innovative framework with two settings to explore a superior way of integrating structured representation with LLMs from training-free and training-dependent perspectives. The former integrates structural information through natural language descriptions in LLM prompts, whereas its counterpart augments the model’s inference capability through fine-tuning on linguistically described structured representations. Performance improvements were observed in widely downstream datasets, with particularly notable gains of 3.17% and 12.38% in PAWS. To the best of our knowledge, this work represents the pioneering demonstration that leveraging structural representations can substantially enhance LLMs’ inference capability. We hope that our work sheds light and encourages future research to enhance the reasoning and interoperability of LLMs by structure data.
pdf
bib
abs
Taming Language Models for Text-attributed Graph Learning with Decoupled Aggregation
Chuang Zhou
|
Zhu Wang
|
Shengyuan Chen
|
Jiahe Du
|
Qiyuan Zheng
|
Zhaozhuo Xu
|
Xiao Huang
Text-attributed graphs (TAGs) are prevalent in various real-world applications, including academic networks, e-commerce platforms, and social networks. Effective learning on TAGs requires leveraging both textual node features and structural graph information. While language models (LMs) excel at processing text and graph neural networks (GNNs) effectively capture relational structures, their direct integration is computationally prohibitive due to the high cost of text and graph representation learning. Existing approaches address this challenge by adopting a two-step pipeline where LMs generate fixed node embeddings, which are then used for GNN training. However, this method neglects the interaction between textual and structural information, leading to suboptimal learning outcomes. To overcome these limitations, we propose SKETCH (Semantic Knowledge and Structure Enrichment), a novel framework that decouples node aggregation from graph convolution and integrates it into the text representation learning process. SKETCH enhances TAG learning by incorporating two key aggregation mechanisms: (1) Semantic aggregation, which retrieves semantically relevant node texts for contextual enrichment, and (2) Structural aggregation, which propagates textual features beyond immediate neighbors to capture broader graph relationships. Extensive experiments demonstrate that SKETCH outperforms state-of-the-art TAG learning methods while requiring fewer computational resources. By enabling a more efficient and effective fusion of textual and structural information, SKETCH provides new insights into TAG problems and offers a practical solution for real applications.
pdf
bib
abs
Contrastive Prompting Enhances Sentence Embeddings in LLMs through Inference-Time Steering
Zifeng Cheng
|
Zhonghui Wang
|
Yuchen Fu
|
Zhiwei Jiang
|
Yafeng Yin
|
Cong Wang
|
Qing Gu
Extracting sentence embeddings from large language models (LLMs) is a practical direction, as it requires neither additional data nor fine-tuning. Previous studies usually focus on prompt engineering to guide LLMs to encode the core semantic information of the sentence into the embedding of the last token. However, the last token in these methods still encodes an excess of non-essential information, such as stop words, limiting its encoding capacity. To this end, we propose a Contrastive Prompting (CP) technique that introduces an extra auxiliary prompt to elicit better sentence embedding. By contrasting with the auxiliary prompt, CP can steer existing prompts to encode the core semantics of the sentence, rather than non-essential information. CP is a plug-and-play inference-time intervention method that can be combined with various prompt-based methods. Extensive experiments on Semantic Textual Similarity (STS) tasks and downstream classification tasks demonstrate that our method can improve the performance of existing prompt-based methods across different LLMs.
pdf
bib
abs
Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence
Jinghan He
|
Kuan Zhu
|
Haiyun Guo
|
Junfeng Fang
|
Zhenglin Hua
|
Yuheng Jia
|
Ming Tang
|
Tat-Seng Chua
|
Jinqiao Wang
Large vision-language models (LVLMs) have made substantial progress in integrating large language models (LLMs) with visual inputs, enabling advanced multimodal reasoning. Despite their success, a persistent challenge is hallucination—where generated text fails to accurately reflect visual content—undermining both accuracy and reliability. Existing methods focus on alignment training or decoding refinements but primarily address symptoms at the generation stage without probing the underlying causes. In this work, we investigate the internal mechanisms driving hallucination in LVLMs, with an emphasis on the multi-head attention module. Specifically, we introduce Vision-aware Head Divergence (VHD), a metric that quantifies the sensitivity of attention head outputs to visual context. Based on this, our findings reveal the presence of vision-aware attention heads that are more attuned to visual information; however, the model’s overreliance on its prior language patterns is closely related to hallucinations. Building on these insights, we propose Vision-aware Head Reinforcement (VHR), a training-free approach to mitigate hallucination by enhancing the role of vision-aware attention heads. Extensive experiments demonstrate that our method achieves superior performance compared to state-of-the-art approaches in mitigating hallucinations, while maintaining high efficiency with negligible additional time overhead. The code is available at https://github.com/jinghan1he/VHR.
pdf
bib
abs
Hierarchical Document Refinement for Long-context Retrieval-augmented Generation
Jiajie Jin
|
Xiaoxi Li
|
Guanting Dong
|
Yuyao Zhang
|
Yutao Zhu
|
Yongkang Wu
|
Zhonghua Li
|
Ye Qi
|
Zhicheng Dou
Real-world RAG applications often encounter long-context input scenarios, where redundant information and noise results in higher inference costs and reduced performance. To address these challenges, we propose LongRefiner, an efficient plug-and-play refiner that leverages the inherent structural characteristics of long documents. LongRefiner employs dual-level query analysis, hierarchical document structuring, and adaptive refinement through multi-task learning on a single foundation model. Experiments on seven QA datasets demonstrate that LongRefiner achieves competitive performance in various scenarios while using 10x fewer computational costs and latency compared to the best baseline. Further analysis validates that LongRefiner is scalable, efficient, and effective, providing practical insights for real-world long-text RAG applications. Our code is available at https://github.com/ignorejjj/LongRefiner.
pdf
bib
abs
Comparing Moral Values in Western English-speaking societies and LLMs with Word Associations
Chaoyi Xiang
|
Chunhua Liu
|
Simon De Deyne
|
Lea Frermann
As the impact of large language models increases, understanding the moral values they encode becomes ever more important. Assessing moral values encoded in these models via direct prompting is challenging due to potential leakage of human norms into model training data, and their sensitivity to prompt formulation. Instead, we propose to use word associations, which have been shown to reflect moral reasoning in humans, as low-level underlying representations to obtain a more robust picture of LLMs’ moral reasoning. We study moral differences in associations from western English-speaking communities and LLMs trained predominantly on English data. First, we create a large dataset of LLM-generated word associations, resembling an existing data set of human word associations. Next, we propose a novel method to propagate moral values based on seed words derived from Moral Foundation Theory through the human and LLM-generated association graphs. Finally, we compare the resulting moral representations, highlighting detailed but systematic differences between moral values emerging from English speakers and LLM associations.
pdf
bib
abs
TEACH: A Contrastive Knowledge Adaptive Distillation Framework for Classical Chinese Understanding
Yuting Wei
|
Qi Meng
|
Yuanxing Xu
|
Bin Wu
Traditional methods for processing classical Chinese typically segment language understanding into discrete tasks, which overlook crucial background information and reduce user engagement. Large language models (LLMs) provide integrated solutions, yet they entail high computational costs and risks of generating inaccurate historical information. To tackle these challenges, we propose a novel framework, TEACH (conTrastive knowlEdge Adaptive distillation with enhanCed Historical interpretability), which focuses on classical Chinese understanding by integrating word sense disambiguation with sentence translation. This integration leverages a confidence-annotated knowledge base and a step-by-step Chain-of-Thought prompting mechanism to minimize hallucinations and improve semantic analysis. Moreover, TEACH employs contrastive distillation learning to efficiently transfer capabilities from larger models to smaller ones (e.g., Qwen2-1.5B), addressing overly liberal translations. Additionally, we introduce an innovative generation evaluation metric using iterative word alignment, enhancing LLM performance assessments by distinguishing additional information and addressing excessive translation issues. Experiments conducted on real-world datasets validate TEACH’s efficacy in classical Chinese educational scenarios.
pdf
bib
abs
RAG-Critic: Leveraging Automated Critic-Guided Agentic Workflow for Retrieval Augmented Generation
Guanting Dong
|
Jiajie Jin
|
Xiaoxi Li
|
Yutao Zhu
|
Zhicheng Dou
|
Ji-Rong Wen
Retrieval-augmented generation (RAG) has emerged as a pivotal technology in natural language processing, owing to its efficacy in generating factual content. However, its informative inputs and complex paradigms often lead to a greater variety of errors. Consequently, achieving automated on-policy assessment and error-oriented correction remain unresolved issues. In this paper, we propose RAG-Critic, a novel framework that leverages a critic-guided agentic workflow to improve RAG capabilities autonomously. Specifically, we initially design a data-driven error mining pipeline to establish a hierarchical RAG error system. Based on this system, we progressively align an error-critic model using a coarse-to-fine training objective, which automatically provides fine-grained error feedback. Finally, we design a critic-guided agentic RAG workflow that customizes executor-based solution flows based on the error-critic model’s feedback, facilitating an error-driven self-correction process. Experimental results across seven RAG-related datasets confirm the effectiveness of RAG-Critic, while qualitative analysis offers practical insights for achieving reliable RAG systems. Our dataset and code are available at https://github.com/RUC-NLPIR/RAG-Critic.
pdf
bib
abs
Progressive Multimodal Reasoning via Active Retrieval
Guanting Dong
|
Chenghao Zhang
|
Mengjie Deng
|
Yutao Zhu
|
Zhicheng Dou
|
Ji-Rong Wen
Multi-step multimodal reasoning tasks pose significant challenges for multimodal large language models (MLLMs), and finding effective ways to enhance their performance in such scenarios remains an unresolved issue. In this paper, we propose AR-MCTS, a universal framework designed to progressively improve the reasoning capabilities of MLLMs through Active Retrieval (AR) and Monte Carlo Tree Search (MCTS). AR-MCTS follows the MCTS algorithm and heuristically integrates an active retrieval mechanism during the expansion stage to automatically acquire high-quality step-wise reasoning annotations. Moreover, we further introduce curriculum training objectives to progressively align with a process reward model, ultimately achieving process-level multimodal reasoning verification. Experimental results across three complex multimodal reasoning benchmarks confirm the effectiveness of AR-MCTS. Further analysis demonstrates that it can optimize sampling diversity and accuracy, yielding reliable multimodal reasoning.
pdf
bib
abs
Pre-training Distillation for Large Language Models: A Design Space Exploration
Hao Peng
|
Xin Lv
|
Yushi Bai
|
Zijun Yao
|
Jiajie Zhang
|
Lei Hou
|
Juanzi Li
Knowledge distillation (KD) aims to transfer knowledge from a large teacher model to a smaller student model. Previous work applying KD in the field of large language models (LLMs) typically focused on the post-training phase, where the student LLM learns directly from instructions and corresponding responses generated by the teacher model. In this paper, we extend KD to the pre-training phase of LLMs, named pre-training distillation (PD). We first conduct a preliminary experiment using GLM-4-9B as the teacher LLM to distill a 1.9B parameter student LLM, validating the effectiveness of PD. Considering the key impact factors of distillation, we systematically explore the design space of pre-training distillation across four aspects: logits processing, loss selection, scaling law, and offline or online logits. We conduct extensive experiments to explore the design space of pre-training distillation and find better configurations and interesting conclusions, such as larger student LLMs generally benefiting more from pre-training distillation, while a larger teacher LLM does not necessarily guarantee better results. We hope our exploration of the design space will inform future practices in pre-training distillation.
pdf
bib
abs
Teaching Vision-Language Models to Ask: Resolving Ambiguity in Visual Questions
Pu Jian
|
Donglei Yu
|
Wen Yang
|
Shuo Ren
|
Jiajun Zhang
In visual question answering (VQA) context, users often pose ambiguous questions to visual language models (VLMs) due to varying expression habits. Existing research addresses such ambiguities primarily by rephrasing questions. These approaches neglect the inherently interactive nature of user interactions with VLMs, where ambiguities can be clarified through user feedback. However, research on interactive clarification faces two major challenges: (1) Benchmarks are absent to assess VLMs’ capacity for resolving ambiguities through interaction; (2) VLMs are trained to prefer answering rather than asking, preventing them from seeking clarification. To overcome these challenges, we introduce ClearVQA benchmark, which targets three common categories of ambiguity in VQA context, and encompasses various VQA scenarios. Furthermore, we propose an automated pipeline to generate ambiguity-clarification question pairs, enabling VLMs to ask reasonable clarification questions and generate more accurate and specific answers based on user feedback, as demonstrated by experimental results.
pdf
bib
abs
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks
Yushi Bai
|
Shangqing Tu
|
Jiajie Zhang
|
Hao Peng
|
Xiaozhi Wang
|
Xin Lv
|
Shulin Cao
|
Jiazheng Xu
|
Lei Hou
|
Yuxiao Dong
|
Jie Tang
|
Juanzi Li
This paper introduces LongBench v2, a benchmark designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. LongBench v2 consists of 503 challenging multiple-choice questions, with contexts ranging from 8k to 2M words, across six major task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repository understanding, and long structured data understanding. To ensure the breadth and the practicality, we collect data from nearly 100 highly educated individuals with diverse professional backgrounds. We employ both automated and manual review processes to maintain high quality and difficulty, resulting in human experts achieving only 53.7% accuracy under a 15-minute time constraint. Our evaluation reveals that the best-performing model, when directly answers the questions, achieves only 50.1% accuracy. In contrast, the o1-preview model, which includes longer reasoning, achieves 57.7%, surpassing the human baseline by 4%. These results highlight the importance of enhanced reasoning ability and scaling inference-time compute to tackle the long-context challenges in LongBench v2.
pdf
bib
abs
Battling against Tough Resister: Strategy Planning with Adversarial Game for Non-collaborative Dialogues
Haiyang Wang
|
Zhiliang Tian
|
Yuchen Pan
|
Xin Song
|
Xin Niu
|
Minlie Huang
|
Bin Zhou
Non-collaborative dialogue involves two participants with conflicting interests engaging in a multi-round dialogue to achieve their own goals. Strategy planning is the key to guiding both participants towards a consensus. Most LLMs-based methods use stimulus prompts or external strategy planners for strategy planning. However, stimulus prompts fail to teach LLMs to plan dialogue strategies explicitly. Moreover, training external strategy planners doesn’t fully account for adversarial interactions, thereby limiting their effectiveness against tough resisters. In this paper, to mitigate the above issues, we propose GAIA, a Game-based Adversarial self-play InterActive training paradigm, which constructs an adversarial two-player (a persuader and a resister) zero-sum game and guides the game to approximate Nash Equilibrium (NE) via reinforcement learning (RL) for the non-collaborative dialogues. First, we design a Chain-of-Mind prompt to reason the resister’s dialogue act step-by-step to plan the persuasive strategies. Secondly, to adversarially improve the persuader, we construct diverse resistant planners and theoretically improve the persuader’s optimal lower bound. Finally, we iteratively optimise their policies via adversarial self-play interactive RL and design an 𝜖-NE verification algorithm to approximate the game’s NE. Experiments on three datasets show that our model obtains state-of-the-art performance.
pdf
bib
abs
Cross-model Transferability among Large Language Models on the Platonic Representations of Concepts
Youcheng Huang
|
Chen Huang
|
Duanyu Feng
|
Wenqiang Lei
|
Jiancheng Lv
Understanding the inner workings of Large Language Models (LLMs) is a critical research frontier. Prior research has shown that a single LLM’s concept representations can be captured as steering vectors (SVs), enabling the control of LLM behavior (e.g., towards generating harmful content). Our work takes a novel approach by exploring the intricate relationships between concept representations across different LLMs, drawing an intriguing parallel to Plato’s Allegory of the Cave. In particular, we introduce a linear transformation method to bridge these representations and present three key findings: 1) Concept representations across different LLMs can be effectively aligned using simple linear transformations, enabling efficient cross-model transfer and behavioral control via SVs. 2) This linear transformation generalizes across concepts, facilitating alignment and control of SVs representing different concepts across LLMs. 3) A weak-to-strong transferability exists between LLM concept representations, whereby SVs extracted from smaller LLMs can effectively control the behavior of larger LLMs. Our code is provided in the supplementary file and will be openly released.
pdf
bib
abs
FoldMoE: Efficient Long Sequence MoE Training via Attention-MoE Pipelining
Guichao Zhu
|
Lintian Lei
|
Yuhao Qing
|
Yichao Fu
|
Fanxin Li
|
Dong Huang
|
Zekai Sun
|
Heming Cui
Training LLMs with Mixture-of-Experts (MoE) architecture on long sequences poses significant challenges due to the all-to-all communication bottleneck of expert parallelism. While existing approaches attempt to hide the communication costs in computation through token-level pipelining within MoE layers, their effectiveness is limited by the insufficient computation. We present FoldMoE, a high-performance MoE training system that enables token-level overlapping across entire Transformer blocks through novel attention-MoE pipelining. We propose an efficient pipeline schedule, and a novel token buffering design to decouple attention and MoE layer partitioning, along with a time-uniform micro-batching strategy for enhanced efficiency. Evaluations on GPT-MoE models with sequences up to 32K tokens show that FoldMoE achieves up to 1.49x and 2.72x speedup over state-of-the-art token-level overlapping and non-overlapping baselines respectively.
pdf
bib
abs
LongReward: Improving Long-context Large Language Models with AI Feedback
Jiajie Zhang
|
Zhongni Hou
|
Xin Lv
|
Shulin Cao
|
Zhenyu Hou
|
Yilin Niu
|
Lei Hou
|
Yuxiao Dong
|
Ling Feng
|
Juanzi Li
Though significant advancements have been achieved in developing long-context large language models (LLMs), the compromised quality of LLM-synthesized data for supervised fine-tuning (SFT) often affects the long-context performance of SFT models and leads to inherent limitations. In principle, reinforcement learning (RL) with appropriate reward signals can further enhance models’ capacities. However, how to obtain reliable rewards in long-context scenarios remains unexplored. To this end, we propose LongReward, a novel method that utilizes an off-the-shelf LLM to provide rewards for long-context model responses from four human-valued dimensions: helpfulness, logicality, faithfulness, and completeness, each with a carefully designed assessment pipeline. By combining LongReward and offline RL algorithm DPO, we are able to effectively improve long-context SFT models. Our experiments indicate that LongReward not only significantly improves models’ long-context performance but also enhances their ability to follow short instructions. We also find that long-context DPO with LongReward and conventional short-context DPO can be used together without hurting either one’s performance.
pdf
bib
abs
Influences on LLM Calibration: A Study of Response Agreement, Loss Functions, and Prompt Styles
Yuxi Xia
|
Pedro Henrique Luz De Araujo
|
Klim Zaporojets
|
Benjamin Roth
Calibration, the alignment between model confidence and prediction accuracy, is critical for the reliable deployment of large language models (LLMs). Existing works neglect to measure the generalization of their methods to other prompt styles and different sizes of LLMs. To address this, we define a controlled experimental setting covering 12 LLMs and four prompt styles. We additionally investigate if incorporating the response agreement of multiple LLMs and an appropriate loss function can improve calibration performance. Concretely, we build Calib-n, a novel framework that trains an auxiliary model for confidence estimation that aggregates responses from multiple LLMs to capture inter-model agreement. To optimize calibration, we integrate focal and AUC surrogate losses alongside binary cross-entropy. Experiments across four datasets demonstrate that both response agreement and focal loss improve calibration from baselines. We find that few-shot prompts are the most effective for auxiliary model-based methods, and auxiliary models demonstrate robust calibration performance across accuracy variations, outperforming LLMs’ internal probabilities and verbalized confidences. These insights deepen the understanding of influence factors in LLM calibration, supporting their reliable deployment in diverse applications.
pdf
bib
abs
UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench
Boxi Yu
|
Yuxuan Zhu
|
Pinjia He
|
Daniel Kang
The advent of Large Language Models (LLMs) has spurred the development of coding agents for real-world code generation.As a widely used benchmark for evaluating the code generation capabilities of these agents, SWE-Bench uses real-world problems based on GitHub issues and their corresponding pull requests.However, the manually written test cases included in these pull requests are often insufficient, allowing generated patches to pass the tests without resolving the underlying issue.To address this challenge, we introduce UTGenerator, an LLM-driven test case generator that automatically analyzes codebases and dependencies to generate test cases for real-world Python projects.Building on UTGenerator, we propose UTBoost, a comprehensive framework for test case augmentation.In our evaluation, we identified 36 task instances with insufficient test cases and uncovered 345 erroneous patches incorrectly labeled as passed in the original SWE Bench.These corrections, impacting 40.9% of SWE-Bench Lite and 24.4% of SWE-Bench Verified leaderboard entries, yield 18 and 11 ranking changes, respectively.
pdf
bib
abs
Towards Better Evaluation for Generated Patent Claims
Lekang Jiang
|
Pascal A. Scherz
|
Stefan Goetz
Patent claims define the scope of protection and establish the legal boundaries of an invention. Drafting these claims is a complex and time-consuming process that usually requires the expertise of skilled patent attorneys, which can form a large access barrier for many small enterprises. To solve these challenges, researchers have investigated large language models (LLMs) for automating patent claim generation. However, existing studies highlight inconsistencies between automated evaluation metrics and human expert assessments. To bridge this gap, we introduce Patent-CE, the first comprehensive benchmark for evaluating patent claims. Patent-CE includes comparative claim evaluations annotated by patent experts, focusing on five key criteria: feature completeness, conceptual clarity, terminology consistency, logical linkage, and overall quality. Additionally, we propose PatClaimEval, a novel multi-dimensional evaluation method specifically designed for patent claims. Our experiments demonstrate that PatClaimEval achieves the highest correlation with human expert evaluations across all assessment criteria among all tested metrics. This research provides the groundwork for more accurate evaluations of automated patent claim generation systems.
pdf
bib
abs
Fine-Tuning on Diverse Reasoning Chains Drives Within-Inference CoT Refinement in LLMs
Haritz Puerto
|
Tilek Chubakov
|
Xiaodan Zhu
|
Harish Tayyar Madabushi
|
Iryna Gurevych
Requiring a large language model (LLM) to generate intermediary reasoning steps, known as Chain of Thought (CoT), has been shown to be an effective way of boosting performance. Previous approaches have focused on generating multiple independent CoTs, combining them through ensembling or other post-hoc strategies to enhance reasoning. In this work, we introduce a novel approach where LLMs are fine-tuned to generate a sequence of Diverse Chains of Thought (DCoT) within a single inference step, which is fundamentally different from prior work that primarily operate on parallel CoT generations. DCoT allows LLMs to gain the ability to perform within-inference refinement of reasoning chains without requiring external feedback. Through a rigorous set of experiments spanning a wide range of tasks that require various reasoning types, we show that fine-tuning on DCoT improves performance over the CoT baseline across model families and scales (1.3B to 70B). These improvements are particularly impactful for tasks with a large result state space, such as those involving numeric answers. Our work is also significant because both quantitative analyses and manual evaluations reveal the observed gains stem from the models’ ability to refine an initial reasoning chain by generating a second, improved chain within the same inference step, demonstrating previously elusive self-improvement. Our code and data are publicly available.
pdf
bib
abs
Establishing Trustworthy LLM Evaluation via Shortcut Neuron Analysis
Kejian Zhu
|
Shangqing Tu
|
Zhuoran Jin
|
Lei Hou
|
Juanzi Li
|
Jun Zhao
The development of large language models (LLMs) depends on **trustworthy evaluation**. However, most current evaluations rely on public benchmarks, which are prone to data contamination issues that significantly compromise fairness. Previous researches have focused on constructing dynamic benchmarks to address contamination. However, continuously building new benchmarks is costly and cyclical.In this work, we aim to tackle contamination by analyzing the mechanisms of contaminated models themselves. Through our experiments, we discover that the overestimation of contaminated models is likely due to parameters acquiring shortcut solutions in training. We further propose a novel method for identifying shortcut neurons through **comparative and causal analysis**.Building on this, we introduce an evaluation method called **shortcut neuron patching** to suppress shortcut neurons. Experiments validate the effectiveness of our approach in mitigating contamination. Additionally, our evaluation results exhibit a strong linear correlation with MixEval, a recently released trustworthy benchmark, achieving a Spearman coefficient (𝜌) exceeding 0.95. This high correlation indicates that our method closely reveals true capabilities of the models and is trustworthy. We conduct further experiments to demonstrate the generalizability of our method across various benchmarks and hyperparameter settings. **Code**: https://github.com/GaryStack/Trustworthy-Evaluation.
pdf
bib
abs
Do Large Language Models have an English Accent? Evaluating and Improving the Naturalness of Multilingual LLMs
Yanzhu Guo
|
Simone Conia
|
Zelin Zhou
|
Min Li
|
Saloni Potdar
|
Henry Xiao
Current Large Language Models (LLMs) are predominantly designed with English as the primary language, and even the few that are multilingual tend to exhibit strong English-centric biases. Much like speakers who might produce awkward expressions when learning a second language, LLMs often generate unnatural outputs in non-English languages, reflecting English-centric patterns in both vocabulary and grammar. Despite the importance of this issue, the naturalness of multilingual LLM outputs has received limited attention. In this paper, we address this gap by introducing novel automatic corpus-level metrics to assess the lexical and syntactic naturalness of LLM outputs in a multilingual context. Using our new metrics, we evaluate state-of-the-art LLMs on a curated benchmark in French and Chinese, revealing a tendency towards English-influenced patterns. To mitigate this issue, we also propose a simple and effective alignment method to improve the naturalness of an LLM in a target language and domain, achieving consistent improvements in naturalness without compromising the performance on general-purpose benchmarks. Our work highlights the importance of developing multilingual metrics, resources and methods for the new wave of multilingual LLMs.
pdf
bib
abs
Enhancing Character-Level Understanding in LLMs through Token Internal Structure Learning
Zhu Xu
|
Zhiqiang Zhao
|
Zihan Zhang
|
Yuchi Liu
|
Quanwei Shen
|
Fei Liu
|
Yu Kuang
|
Jian He
|
Conglin Liu
Tokenization methods like Byte-Pair Encoding (BPE) enhance computational efficiency in large language models (LLMs) but often obscure internal character structures within tokens. This limitation hinders LLMs’ ability to predict precise character positions, which is crucial in tasks like Chinese Spelling Correction (CSC) where identifying the positions of misspelled characters accelerates correction processes. We propose Token Internal Position Awareness (TIPA), a method that significantly improves models’ ability to capture character positions within tokens by training them on reverse character prediction tasks using the tokenizer’s vocabulary. Experiments demonstrate that TIPA enhances position prediction accuracy in LLMs, enabling more precise identification of target characters in original text. Furthermore, when applied to downstream tasks that do not require exact position prediction, TIPA still boosts performance in tasks needing character-level information, validating its versatility and effectiveness.
pdf
bib
abs
Conformity in Large Language Models
Xiaochen Zhu
|
Caiqi Zhang
|
Tom Stafford
|
Nigel Collier
|
Andreas Vlachos
The conformity effect describes the tendency of individuals to align their responses with the majority. Studying this bias in large language models (LLMs) is crucial, as LLMs are increasingly used in various information-seeking and decision-making tasks as conversation partners to improve productivity. Thus, conformity to incorrect responses can compromise their effectiveness. In this paper, we adapt psychological experiments to examine the extent of conformity in state-of-the-art LLMs. Our findings reveal that all models tested exhibit varying levels of conformity toward the majority, regardless of their initial choice or correctness, across different knowledge domains. Notably, we are the first to show that LLMs are more likely to conform when they are more uncertain in their own prediction. We further explore factors that influence conformity, such as training paradigms and input characteristics, finding that instruction-tuned models are less susceptible to conformity, while increasing the naturalness of majority tones amplifies conformity. Finally, we propose two interventions—Devil’s Advocate and Question Distillation—to mitigate conformity, providing insights into building more robust language models.
pdf
bib
abs
Interpret and Improve In-Context Learning via the Lens of Input-Label Mappings
Chenghao Sun
|
Zhen Huang
|
Yonggang Zhang
|
Le Lu
|
Houqiang Li
|
Xinmei Tian
|
Xu Shen
|
Jieping Ye
Large language models (LLMs) excel at downstream NLP tasks through in-context learning (ICL) with a few demonstrations of input–label pairs. However, the internal mechanisms behind ICL remain under-explored, particularly the mappings between inputs and labels. In this work, we reverse-engineer ICL by examining input-label mappings: what they are within LLMs, where they function, and how LLMs utilize them. (1) what: We discover input-label mappings stored within a few specific layers in the form of principal components (PCs), which capture human-interpretable and task-related words. (2) where: We propose a PC patching approach to identify the modules where input-label mappings function. Specifically, PC patching automatically crafts counterfactual representations using identified semantic PCs, rather than manually designing counterfactual text, to suppress the behavior related to LLM capability for ICL-related modules. Utilizing PC patching, we identify LLMs apply input-label mappings in a small fraction of attention heads. (3) how: We observe and verify that the identified key heads utilize input-label mappings from demonstrations to generate target labels for new queries. Based on these discoveries, we further show that precisely fine-tuning key ICL-related modules leads to significant improvements across diverse tasks.
pdf
bib
abs
Positional Overload: Positional Debiasing and Context Window Extension for Large Language Models using Set Encoding
Lukas Kinder
|
Lukas Edman
|
Alexander Fraser
|
Tobias Käfer
Large Language Models (LLMs) typically track the order of tokens using positional encoding, which causes the following problems: positional bias, where the model is influenced by an ordering within the prompt, and a fixed context window, as models struggle to generalize to positions beyond those encountered during training. To address these limitations, we developed a novel method called set encoding. This method allows multiple pieces of text to be encoded in the same position, thereby eliminating positional bias entirely. Another promising use case for set encoding is to increase the size of the input an LLM can handle. Our experiments demonstrate that set encoding allows an LLM to solve tasks with far more tokens than without set encoding. To our knowledge, set encoding is the first technique to effectively extend an LLM’s context window without requiring any additional training.
pdf
bib
abs
FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling
Weilin Zhao
|
Tengyu Pan
|
Xu Han
|
Yudi Zhang
|
Sun Ao
|
Yuxiang Huang
|
Kaihuo Zhang
|
Weilun Zhao
|
Yuxuan Li
|
Jie Zhou
|
Hao Zhou
|
Jianyong Wang
|
Maosong Sun
|
Zhiyuan Liu
Speculative sampling has emerged as an important technique for accelerating the auto-regressive generation process of large language models (LLMs) by utilizing a draft-then-verify mechanism to produce multiple tokens per forward pass. While state-of-the-art speculative sampling methods use only a single layer and a language modeling (LM) head as the draft model to achieve impressive layer compression, their efficiency gains are substantially reduced for large-vocabulary LLMs, such as Llama-3-8B with a vocabulary of 128k tokens. To address this, we present FR-Spec, a frequency-ranked speculative sampling framework that optimizes draft candidate selection through vocabulary space compression. By constraining the draft search to a frequency-prioritized token subset, our method reduces LM Head computation overhead by 75% while ensuring the equivalence of the final output distribution. Experiments across multiple datasets demonstrate an average of 1.12× speedup over the state-of-the-art speculative sampling method EAGLE-2. Code is availableat https://github.com/thunlp/FR-Spec.
pdf
bib
abs
VReST: Enhancing Reasoning in Large Vision-Language Models through Tree Search and Self-Reward Mechanism
Congzhi Zhang
|
Jiawei Peng
|
Zhenglin Wang
|
Yilong Lai
|
Haowen Sun
|
Heng Chang
|
Fei Ma
|
Weijiang Yu
Large Vision-Language Models (LVLMs) have shown exceptional performance in multimodal tasks, but their effectiveness in complex visual reasoning is still constrained, especially when employing Chain-of-Thought prompting techniques. In this paper, we propose VReST, a novel training-free approach that enhances Reasoning in LVLMs through Monte Carlo Tree Search and Self-Reward mechanisms. VReST meticulously traverses the reasoning landscape by establishing a search tree, where each node encapsulates a reasoning step, and each path delineates a comprehensive reasoning sequence. Our innovative multimodal Self-Reward mechanism assesses the quality of reasoning steps by integrating the utility of sub-questions, answer correctness, and the relevance of vision-language clues, all without the need for additional models. VReST surpasses current prompting methods and secures state-of-the-art performance across three multimodal mathematical reasoning benchmarks. Furthermore, it substantiates the efficacy of test-time scaling laws in multimodal tasks, offering a promising direction for future research.
pdf
bib
abs
Past Meets Present: Creating Historical Analogy with Large Language Models
Nianqi Li
|
Siyu Yuan
|
Jiangjie Chen
|
Jiaqing Liang
|
Feng Wei
|
Zujie Liang
|
Deqing Yang
|
Yanghua Xiao
Historical analogies, which compare known past events with contemporary but unfamiliar events, are important abilities that help people make decisions and understand the world. However, research in applied history suggests that people have difficulty finding appropriate analogies. And previous studies in the AI community have also overlooked historical analogies. To fill this gap, in this paper, we focus on the historical analogy acquisition task, which aims to acquire analogous historical events for a given event. We explore retrieval and generation methods for acquiring historical analogies based on different large language models (LLMs). Furthermore, we propose a self-reflection method to mitigate hallucinations and stereotypes when LLMs generate historical analogies. Through human evaluations and our specially designed automatic multi-dimensional assessment, we find that LLMs generally have a good potential for historical analogies. And the performance of the models can be further improved by using our self-reflection method. Resources of this paper can be found at https://anonymous.4open.science/r/Historical-Analogy-of-LLMs-FC17
pdf
bib
abs
Meta-Reflection: A Feedback-Free Reflection Learning Framework
Yaoke Wang
|
Yun Zhu
|
XintongBao XintongBao
|
Wenqiao Zhang
|
Suyang Dai
|
Kehan Chen
|
Wenqiang Li
|
Gang Huang
|
Siliang Tang
|
Yueting Zhuang
Despite the remarkable capabilities of large language models (LLMs) in natural language understanding and reasoning, they often display undesirable behaviors, such as generating hallucinations and unfaithful reasoning. A prevalent strategy to mitigate these issues is the use of reflection, which refines responses through an iterative process. However, while promising, reflection heavily relies on high-quality external feedback and requires iterative multi-agent inference processes, thus hindering its practical application. In this paper, we propose Meta-Reflection, a novel feedback-free reflection mechanism that necessitates only a single inference pass without external feedback. Motivated by the human ability to remember and retrieve reflections from past experiences when encountering similar problems, Meta-Reflection integrates reflective insights into a codebook, allowing the historical insights to be stored, retrieved, and used to guide LLMs in problem-solving. To thoroughly investigate and evaluate the practicality of Meta-Reflection in real-world scenarios, we introduce an industrial e-commerce benchmark named E-commerce Customer Intent Detection. Extensive experiments conducted on both public datasets and the ECID benchmark highlight the effectiveness and efficiency of our proposed approach. Project is available at https://github.com/DCDmllm/Meta-Reflection
pdf
bib
abs
Read it in Two Steps: Translating Extremely Low-Resource Languages with Code-Augmented Grammar Books
Chen Zhang
|
Jiuheng Lin
|
Xiao Liu
|
Zekai Zhang
|
Yansong Feng
While large language models (LLMs) have shown promise in translating extremely low-resource languages using resources like dictionaries, the effectiveness of grammar books remains debated. This paper investigates the role of grammar books in translating extremely low-resource languages by decomposing it into two key steps: grammar rule retrieval and application. To facilitate the study, we introduce ZhuangRules, a modularized dataset of grammar rules and their corresponding test sentences. Our analysis reveals that rule retrieval constitutes a primary bottleneck in grammar-based translation. Moreover, although LLMs can apply simple rules for translation when explicitly provided, they encounter difficulties in handling more complex rules. To address these challenges, we propose representing grammar rules as code functions, considering their similarities in structure and the benefit of code in facilitating LLM reasoning. Our experiments show that using code rules significantly boosts both rule retrieval and application, ultimately resulting in a 13.1% BLEU improvement in translation.
pdf
bib
abs
Confidence v.s. Critique: A Decomposition of Self-Correction Capability for LLMs
Zhe Yang
|
Yichang Zhang
|
Yudong Wang
|
Ziyao Xu
|
Junyang Lin
|
Zhifang Sui
Large Language Models (LLMs) can correct their self-generated responses, but a decline in accuracy after self-correction is also witnessed. To have a deeper understanding of self-correction, we endeavor to decompose, evaluate, and analyze the self-correction behaviors of LLMs. By enumerating and analyzing answer correctness before and after self-correction, we decompose the self-correction capability into confidence (being confident to correct answers) and critique (turning wrong answers to correct) capabilities, and propose two metrics from a probabilistic perspective to measure these 2 capabilities, along with another metric for overall self-correction capability evaluation. Based on our decomposition and evaluation metrics, we conduct extensive experiments and draw some empirical conclusions. For example, we find different models can exhibit distinct behaviors: some models are confident while others are more critical. We also find the trade-off between the two capabilities (i.e. improving one can lead to a decline in the other) when manipulating model self-correction behavior by prompts or in-context learning. Further, we find a simple yet efficient strategy to improve self-correction capability by transforming Supervision Fine-Tuning (SFT) data format, and our strategy outperforms vanilla SFT in both capabilities and achieves much higher accuracy after self-correction.
pdf
bib
abs
Automating Legal Interpretation with LLMs: Retrieval, Generation, and Evaluation
Kangcheng Luo
|
Quzhe Huang
|
Cong Jiang
|
Yansong Feng
Interpreting the law is always essential for the law to adapt to the ever-changing society. It is a critical and challenging task even for legal practitioners, as it requires meticulous and professional annotations and summarizations by legal experts, which are admittedly time-consuming and expensive to collect at scale. To alleviate the burden on legal experts, we propose a method for automated legal interpretation. Specifically, by emulating doctrinal legal research, we introduce a novel framework, **ATRIE**, to address Legal Concept Interpretation, a typical task in legal interpretation. **ATRIE** utilizes large language models (LLMs) to **A**u**T**omatically **R**etrieve concept-related information, **I**nterpret legal concepts, and **E**valuate generated interpretations, eliminating dependence on legal experts. ATRIE comprises a legal concept interpreter and a legal concept interpretation evaluator. The interpreter uses LLMs to retrieve relevant information from previous cases and interpret legal concepts. The evaluator uses performance changes on Legal Concept Entailment, a downstream task we propose, as a proxy of interpretation quality. Automated and multifaceted human evaluations indicate that the quality of our interpretations is comparable to those written by legal experts, with superior comprehensiveness and readability. Although there remains a slight gap in accuracy, it can already assist legal practitioners in improving the efficiency of legal interpretation.
pdf
bib
abs
Visual Evidence Prompting Mitigates Hallucinations in Large Vision-Language Models
Wei Li
|
Zhen Huang
|
Houqiang Li
|
Le Lu
|
Yang Lu
|
Xinmei Tian
|
Xu Shen
|
Jieping Ye
Large Vision-Language Models (LVLMs) have shown impressive progress by integrating visual perception with linguistic understanding to produce contextually grounded outputs. Despite these advancements achieved, LVLMs still suffer from the hallucination problem, e.g., they tend to produce content that does not exist in the input images. Our investigation suggests that such hallucinations often stem from the deficiencies in fine-grained comprehension on the visual aspect, particularly when visual scenes exhibit appearance or semantic similarities (e.g., bicycle vs. motorcycles, baseball bat vs. baseball). In this work, we show such hallucination is naturally mitigated via a novel method called visual evidence prompting, utilizing small visual models to complement the LVLMs. While traditional visual models are not adept at interacting with humans, they excel at perceiving the fine-grained image contents. By symbolizing the professional outputs of domain-expert models as prompts, the LVLM generalists are able to refer to these evidences as visual knowledge to generate more precise answers. Detailed analysis shows that visual evidence enables models to adjust and rectify the attribution and attention on the images, reducing visual confusion by suppressing false activation while enhancing correct ones. Extensive experiments and in-depth analysis demonstrate the effectiveness of our method. We hope our straightforward but insightful work enhances the comprehension of hallucination in LVLMs and offers valuable perspectives on addressing such challenges.
pdf
bib
abs
Leveraging Dual Process Theory in Language Agent Framework for Real-time Simultaneous Human-AI Collaboration
Shao Zhang
|
Xihuai Wang
|
Wenhao Zhang
|
Chaoran Li
|
Junru Song
|
Tingyu Li
|
Lin Qiu
|
Xuezhi Cao
|
Xunliang Cai
|
Wen Yao
|
Weinan Zhang
|
Xinbing Wang
|
Ying Wen
Agents built on large language models (LLMs) have excelled in turn-by-turn human-AI collaboration but struggle with simultaneous tasks requiring real-time interaction. Latency issues and the challenge of inferring variable human strategies hinder their ability to make autonomous decisions without explicit instructions. Through experiments with current independent *System 1* and *System 2* methods, we validate the necessity of using Dual Process Theory (DPT) in real-time tasks. We propose DPT-Agent, a novel language agent framework that integrates *System 1* and *System 2* for efficient real-time simultaneous human-AI collaboration. DPT-Agent’s *System 1* uses a Finite-state Machine (FSM) and code-as-policy for fast, intuitive, and controllable decision-making. DPT-Agent’s *System 2* integrates Theory of Mind (ToM) and asynchronous reflection to infer human intentions and perform reasoning-based autonomous decisions. We demonstrate the effectiveness of DPT-Agent through further experiments with rule-based agents and human collaborators, showing significant improvements over mainstream LLM-based frameworks. To the best of our knowledge, DPT-Agent is the first language agent framework that achieves successful real-time simultaneous human-AI collaboration autonomously. Code of DPT-Agent can be found in https://github.com/sjtu-marl/DPT-Agent.
pdf
bib
abs
TokAlign: Efficient Vocabulary Adaptation via Token Alignment
Chong Li
|
Jiajun Zhang
|
Chengqing Zong
Tokenization serves as a foundational step for Large Language Models (LLMs) to process text. In new domains or languages, the inefficiency of the tokenizer will slow down the training and generation of LLM. The mismatch in vocabulary also hinders deep knowledge transfer between LLMs like token-level distillation. To mitigate this gap, we propose an efficient method named **TokAlign** to replace the vocabulary of LLM from the token co-occurrences view, and further transfer the token-level knowledge between models. It first aligns the source vocabulary to the target one by learning a one-to-one mapping matrix for token IDs. Model parameters, including embeddings, are rearranged and progressively fine-tuned for the new vocabulary. Our method significantly improves multilingual text compression rates and vocabulary initialization for LLMs, decreasing the perplexity from 3.4e2 of strong baseline methods to 1.2e2 after initialization. Experimental results on models across multiple parameter scales demonstrate the effectiveness and generalization of TokAlign, which costs as few as 5k steps to restore the performance of the vanilla model. After unifying vocabularies between LLMs, token-level distillation can remarkably boost (+4.4% than sentence-level distillation) the base model, costing only 235M tokens.
pdf
bib
abs
AdaEdit: Advancing Continuous Knowledge Editing For Large Language Models
Qi Li
|
Xiaowen Chu
Knowledge editing (KE) has emerged as a prominent alternative that enables efficient and precise information modification inside language models. However, a critical challenge arises in continuous language models editing — a significant performance decline both in knowledge update and retention when the number of edits increases. By dissecting the perturbation weight of language model in continuous KE, we uncover that disentangled and sparsified knowledge representation can significantly alleviate the performance decline. Building on these insights, we introduce AdaEdit, a novel knowledge editing method. Extensive empirical evaluations on multiple LLMs demonstrate that our proposed methods can enhance the performance of edited LLMs in large-size continuous editing regimes, outperforming existing ones without substantially compromising the general abilities of these models.
pdf
bib
abs
The Impact of Token Granularity on the Predictive Power of Language Model Surprisal
Byung-Doh Oh
|
William Schuler
Word-by-word language model surprisal is often used to model the incremental processing of human readers, which raises questions about how various choices in language modeling influence its predictive power. One factor that has been overlooked in cognitive modeling is the granularity of subword tokens, which explicitly encodes information about word length and frequency, and ultimately influences the quality of vector representations that are learned. This paper presents experiments that manipulate the token granularity and evaluate its impact on the ability of surprisal to account for processing difficulty of naturalistic text and garden-path constructions. Experiments with naturalistic reading times reveal a substantial influence of token granularity on surprisal, with tokens defined by a vocabulary size of 8,000 resulting in surprisal that is most predictive. In contrast, on garden-path constructions, language models trained on coarser-grained tokens generally assigned higher surprisal to critical regions, suggesting a greater sensitivity to garden-path effects than previously reported. Taken together, these results suggest a large role of token granularity on the quality of language model surprisal for cognitive modeling.
pdf
bib
abs
Segment-Level Diffusion: A Framework for Controllable Long-Form Generation with Diffusion Language Models
Xiaochen Zhu
|
Georgi Karadzhov
|
Chenxi Whitehouse
|
Andreas Vlachos
Diffusion models have shown promise in text generation, but often struggle with generating long, coherent, and contextually accurate text. Token-level diffusion doesn’t model word-order dependencies explicitly and operates on short, fixed output windows, while passage-level diffusion struggles with learning robust representations for long-form text. To address these challenges, we propose Segment-Level Diffusion (SLD), a framework that enhances diffusion-based text generation through text segmentation, robust representation training with adversarial and contrastive learning, and improved latent-space guidance. By segmenting long-form outputs into multiple latent representations and decoding them with an autoregressive decoder, SLD simplifies diffusion predictions and improves scalability. Experiments on four datasets demonstrate that, when compared to other diffusion and autoregressive baselines SLD achieves competitive or superior fluency, coherence, and contextual compatibility in automatic and human evaluations.
pdf
bib
abs
BELLE: A Bi-Level Multi-Agent Reasoning Framework for Multi-Hop Question Answering
Taolin Zhang
|
Dongyang Li
|
Qizhou Chen
|
Chengyu Wang
|
Xiaofeng He
Multi-hop question answering (QA) involves finding multiple relevant passages and performing step-by-step reasoning to answer complex questions. Previous works on multi-hop QA employ specific methods from different modeling perspectives based on large language models (LLMs), regardless of the question types. In this paper, we first conduct an in-depth analysis of public multi-hop QA benchmarks, dividing the questions into four types and evaluating five types of cutting-edge methods for multi-hop QA: Chain-of-Thought (CoT), Single-step, Iterative-step, Sub-step, and Adaptive-step. We find that different types of multi-hop questions have varying degrees of sensitivity to different types of methods. Thus, we propose a Bi-levEL muLti-agEnt reasoning (BELLE) framework to address multi-hop QA by specifically focusing on the correspondence between question types and methods, where each type of method is regarded as an ”operator” by prompting LLMs differently. The first level of BELLE includes multiple agents that debate to obtain an executive plan of combined ”operators” to address the multi-hop QA task comprehensively. During the debate, in addition to the basic roles of affirmative debater, negative debater, and judge, at the second level, we further leverage fast and slow debaters to monitor whether changes in viewpoints are reasonable. Extensive experiments demonstrate that BELLE significantly outperforms strong baselines in various datasets. Additionally, the model consumption of BELLE is higher cost-effectiveness than that of single models in more complex multi-hop QA scenarios.
pdf
bib
abs
Dynamic and Generalizable Process Reward Modeling
Zhangyue Yin
|
Qiushi Sun
|
Zhiyuan Zeng
|
Qinyuan Cheng
|
Xipeng Qiu
|
Xuanjing Huang
Process Reward Models (PRMs) are crucial for guiding Large Language Models (LLMs) in complex scenarios by providing dense reward signals. However, existing PRMs primarily rely on heuristic approaches, which struggle with cross-domain generalization. While LLM-as-judge has been proposed to provide generalized rewards, current research has focused mainly on feedback results, overlooking the meaningful guidance embedded within the text. Additionally, static and coarse-grained evaluation criteria struggle to adapt to complex process supervision. To tackle these challenges, we propose Dynamic and Generalizable Process Reward Modeling (DG-PRM), which features a reward tree to capture and store fine-grained, multi-dimensional reward criteria. DG-PRM dynamically selects reward signals for step-wise reward scoring. To handle multifaceted reward signals, we pioneeringly adopt Pareto dominance estimation to identify discriminative positive and negative pairs. Experimental results show that DG-PRM achieves stunning performance on prevailing benchmarks, significantly boosting model performance across tasks with dense rewards. Further analysis reveals that DG-PRM adapts well to out-of-distribution scenarios, demonstrating exceptional generalizability.
pdf
bib
abs
AdamMeme: Adaptively Probe the Reasoning Capacity of Multimodal Large Language Models on Harmfulness
Zixin Chen
|
Hongzhan Lin
|
Kaixin Li
|
Ziyang Luo
|
Zhen Ye
|
Guang Chen
|
Zhiyong Huang
|
Jing Ma
The proliferation of multimodal memes in the social media era demands that multimodal Large Language Models (mLLMs) effectively understand meme harmfulness. Existing benchmarks for assessing mLLMs on harmful meme understanding rely on accuracy-based, model-agnostic evaluations using static datasets. These benchmarks are limited in their ability to provide up-to-date and thorough assessments, as online memes evolve dynamically. To address this, we propose AdamMeme, a flexible, agent-based evaluation framework that adaptively probes the reasoning capabilities of mLLMs in deciphering meme harmfulness. Through multi-agent collaboration, AdamMeme provides comprehensive evaluations by iteratively updating the meme data with challenging samples, thereby exposing specific limitations in how mLLMs interpret harmfulness. Extensive experiments show that our framework systematically reveals the varying performance of different target mLLMs, offering in-depth, fine-grained analyses of model-specific weaknesses. Our code is available at https://github.com/Lbotirx/AdamMeme.
pdf
bib
abs
Towards Text-Image Interleaved Retrieval
Xin Zhang
|
Ziqi Dai
|
Yongqi Li
|
Yanzhao Zhang
|
Dingkun Long
|
Pengjun Xie
|
Meishan Zhang
|
Jun Yu
|
Wenjie Li
|
Min Zhang
Current multimodal information retrieval studies mainly focus on single-image inputs, which limits real-world applications involving multiple images and text-image interleaved content. In this work, we introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences, and the model is required to understand the semantics from the interleaved context for effective retrieval. We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries. To explore the task, we adapt several off-the-shelf retrievers and build a dense baseline by interleaved multimodal large language model (MLLM). We then propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity, to address the challenge of excessive visual tokens in MLLM-based TIIR models. Experiments demonstrate that simple adaption of existing models does not consistently yield effective results. Our MME achieves significant improvements over the baseline by substantially fewer visual tokens. We provide extensive analysis and will release the dataset and code to facilitate future research.
pdf
bib
abs
Large Margin Representation Learning for Robust Cross-lingual Named Entity Recognition
Guangcheng Zhu
|
Ruixuan Xiao
|
Haobo Wang
|
Zhen Zhu
|
Gengyu Lyu
|
Junbo Zhao
Cross-lingual named entity recognition (NER) aims to build an NER model that generalizes to the low-resource target language with labeled data from the high-resource source language. Current state-of-the-art methods typically combine self-training mechanism with contrastive learning paradigm, in order to develop discriminative entity clusters for cross-lingual adaptation. Despite the promise, we identify that these methods neglect two key problems: distribution skewness and pseudo-label bias, leading to indistinguishable entity clusters with small margins. To this end, we propose a novel framework, MARAL, which optimizes an adaptively reweighted contrastive loss to handle the class skewness and theoretically guarantees the optimal feature arrangement with maximum margin. To further mitigate the adverse effects of unreliable pseudo-labels, MARAL integrates a progressive cross-lingual adaptation strategy, which first selects reliable samples as anchors and then refines the remaining unreliable ones. Extensive experiments demonstrate that MARAL significantly outperforms the current state-of-the-art methods on multiple benchmarks, e.g., +2.04% on the challenging MultiCoNER dataset.
pdf
bib
abs
An Efficient and Precise Training Data Construction Framework for Process-supervised Reward Model in Mathematical Reasoning
Wei Sun
|
Qianlong Du
|
Fuwei Cui
|
Jiajun Zhang
Enhancing the mathematical reasoning capabilities of Large Language Models (LLMs) is of great scientific and practical significance. Researchers typically employ process-supervised reward models (PRMs) to guide the reasoning process, effectively improving the models’ reasoning abilities. However, existing methods for constructing process supervision training data, such as manual annotation and per-step Monte Carlo estimation, are often costly or suffer from poor quality. To address these challenges, this paper introduces a framework called EpicPRM (Efficient, Precise, Cheap), which annotates each intermediate reasoning step based on its quantified contribution and uses an adaptive binary search algorithm to enhance both annotation precision and efficiency. Using this approach, we efficiently construct a high-quality process supervision training dataset named Epic50k, consisting of 50k annotated intermediate steps. Compared to other publicly available datasets, the PRM trained on Epic50k demonstrates significantly superior performance.
pdf
bib
abs
QAEncoder: Towards Aligned Representation Learning in Question Answering Systems
Zhengren Wang
|
Qinhan Yu
|
Shida Wei
|
Zhiyu Li
|
Feiyu Xiong
|
Xiaoxing Wang
|
Simin Niu
|
Hao Liang
|
Wentao Zhang
Modern QA systems entail retrieval-augmented generation (RAG) for accurate and trustworthy responses. However, the inherent gap between user queries and relevant documents hinders precise matching. We introduce QAEncoder, a training-free approach to bridge this gap. Specifically, QAEncoder estimates the expectation of potential queries in the embedding space as a robust surrogate for the document embedding, and attaches document fingerprints to effectively distinguish these embeddings. Extensive experiments across diverse datasets, languages, and embedding models confirmed QAEncoder’s alignment capability, which offers a simple-yet-effective solution with zero additional index storage, retrieval latency, training costs, or catastrophic forgetting and hallucination issues. The repository is publicly available at https://github.com/IAAR-Shanghai/QAEncoder.
pdf
bib
abs
Game Development as Human-LLM Interaction
Jiale Hong
|
Hongqiu Wu
|
Hai Zhao
Game development is a highly specialized task that relies on a complex game engine powered by complex programming languages, preventing many gaming enthusiasts from handling it. This paper introduces the Chat Game Engine (ChatGE) powered by LLM, which allows everyone to develop a custom game using natural language through Human-LLM interaction. To enable an LLM to function as a ChatGE, we instruct it to perform the following processes in each turn: (1) Pscript: configure the game script segment based on the user’s input; (2) Pcode: generate the corresponding code snippet based on the game script segment; (3) Putter: interact with the user, including guidance and feedback. We propose a data synthesis pipeline based on LLM to generate game script-code pairs and interactions from a few manually crafted seed data. We propose a three-stage training strategy following curriculum learning principles to transfer the dialogue-based LLM to our ChatGE smoothly. We construct a ChatGE for poker games as a case study and comprehensively evaluate it from two perspectives: interaction quality and code correctness.
pdf
bib
abs
Can LLMs Simulate L2-English Dialogue? An Information-Theoretic Analysis of L1-Dependent Biases
Rena Wei Gao
|
Xuetong Wu
|
Tatsuki Kuribayashi
|
Mingrui Ye
|
Siya Qi
|
Carsten Roever
|
Yuanxing Liu
|
Zheng Yuan
|
Jey Han Lau
This study evaluates Large Language Models’ (LLMs) ability to simulate non-native-like English use observed in human second language (L2) learners interfered with by their native first language (L1). In dialogue-based interviews, we prompt LLMs to mimic L2 English learners with specific L1s (e.g., Japanese, Thai, Urdu) across seven languages, comparing their outputs to real L2 learner data. Our analysis examines L1-driven linguistic biases, such as reference word usage and avoidance behaviors, using information-theoretic and distributional density measures. Results show that modern LLMs (e.g., Qwen2.5, LLAMA3, DeepseekV3, GPT 4o) replicate L1-dependent patterns observed in human L2 data, with distinct influences from various languages (e.g., Japanese, Korean, and Mandarin significantly affect tense agreement, and Urdu influences noun-verb collocations). Our results reveal LLMs’ potential for L2 dialogue generation and evaluation for future educational applications.
pdf
bib
abs
DeepSolution: Boosting Complex Engineering Solution Design via Tree-based Exploration and Bi-point Thinking
Zhuoqun Li
|
Haiyang Yu
|
Xuanang Chen
|
Hongyu Lin
|
Yaojie Lu
|
Fei Huang
|
Xianpei Han
|
Yongbin Li
|
Le Sun
Designing solutions for complex engineering challenges is crucial in human production activities. However, previous research in the retrieval-augmented generation (RAG) field has not sufficiently addressed tasks related to the design of complex engineering solutions. To fill this gap, we introduce a new benchmark, SolutionBench, to evaluate a system’s ability to generate complete and feasible solutions for engineering problems with multiple complex constraints. To further advance the design of complex engineering solutions, we propose a novel system, SolutionRAG, that leverages the tree-based exploration and bi-point thinking mechanism to generate reliable solutions. Extensive experimental results demonstrate that SolutionRAG achieves state-of-the-art (SOTA) performance on the SolutionBench, highlighting its potential to enhance the automation and reliability of complex engineering solution design in real-world applications.
pdf
bib
abs
SurveyPilot: an Agentic Framework for Automated Human Opinion Collection from Social Media
Viet Thanh Pham
|
Lizhen Qu
|
Zhuang Li
|
Suraj Sharma
|
Gholamreza Haffari
Opinion survey research is a crucial method used by social scientists for understanding societal beliefs and behaviors. Traditional methodologies often entail high costs and limited scalability, while current automated methods such as opinion synthesis exhibit severe biases and lack traceability. In this paper, we introduce SurveyPilot, a novel finite-state orchestrated agentic framework that automates the collection and analysis of human opinions from social media platforms. SurveyPilot addresses the limitations of pioneering approaches by (i) providing transparency and traceability in each state of opinion collection and (ii) incorporating several techniques for mitigating biases, notably with a novel genetic algorithm for improving result diversity. Our extensive experiments reveal that SurveyPilot achieves a close alignment with authentic survey results across multiple domains, observing average relative improvements of 68,98% and 51,37% when comparing to opinion synthesis and agent-based approaches. Implementation of SurveyPilot is available on https://github.com/thanhpv2102/SurveyPilot.
pdf
bib
abs
Sharper and Faster mean Better: Towards More Efficient Vision-Language Model for Hour-scale Long Video Understanding
Daoze Zhang
|
Yuze Zhao
|
Jintao Huang
|
Yingda Chen
Despite existing multimodal language models showing impressive performance on the video understanding task, extremely long videos still pose significant challenges to language model’s context length, memory consumption, and computational complexity. To address these issues, we propose a vision-language model named Sophia for long video understanding, which can efficiently handle hour-scale long videos. First, we employ a Shot-adaptive Frame Pruning technique, which naturally segments long videos into multiple camera shots, to more sharply identify and focus on the frames relevant to the query. Additionally, we introduce a Hierarchical Attention mechanism to effectively model the long-term temporal dependencies between video frames, which achieves a time and space complexity of O(N) w.r.t. the input sequence length N while theoretically maintaining the global modeling efficiency. Experimentally, our Sophia exhibits competitive performance compared to existing video understanding baselines across various benchmarks for long video understanding with reduced time and memory consumption. The model code and weights are available at https://huggingface.co/Tao-tse/Sophia.
pdf
bib
abs
Auto-Arena: Automating LLM Evaluations with Agent Peer Battles and Committee Discussions
Ruochen Zhao
|
Wenxuan Zhang
|
Yew Ken Chia
|
Weiwen Xu
|
Deli Zhao
|
Lidong Bing
As LLMs continuously evolve, there is an urgent need for a reliable evaluation method that delivers trustworthy results promptly. Currently, static benchmarks suffer from inflexibility and unreliability, leading users to prefer human voting platforms like Chatbot Arena. However, human evaluations require significant manual effort. Therefore, we propose Auto-Arena, an innovative framework that automates the entire evaluation process using LLM-powered agents. Firstly, an LLM examiner generates questions. Then, two LLM candidates engage in a multi-round peer battle based on the questions, aiming at revealing their true performance differences. Finally, a committee of LLM judges collaboratively discusses and decides the winner, reducing bias and enhancing fairness. During the peer battles, we observe intriguing scenarios where the LLM candidates display competitive behaviors and learn from the opponents. In our extensive experiments involving 15 recent LLMs, Auto-Arena shows a 92.14% correlation with human preferences, surpassing all previous expert-annotated benchmarks without any manual efforts. Auto-Arena offers a promising alternative to current human evaluation platforms for evaluating LLMs automatically.
pdf
bib
abs
How Humans and LLMs Organize Conceptual Knowledge: Exploring Subordinate Categories in Italian
Andrea Pedrotti
|
Giulia Rambelli
|
Caterina Villani
|
Marianna Bolognesi
People can categorize the same entity at multiple taxonomic levels, such as basic (bear), superordinate (animal), and subordinate (grizzly bear). While prior research has focused on basic-level categories, this study is the first attempt to examine the organization of categories by analyzing exemplars produced at the subordinate level. We present a new Italian psycholinguistic dataset of human-generated exemplars for 187 concrete words. We then leverage these data to evaluate whether textual and vision LLMs produce meaningful exemplars that align with human category organization across three key tasks: exemplar generation, category induction, and typicality judgment. Our findings show a low alignment between humans and LLMs, consistent with previous studies. However, their performance varies notably across different semantic domains. Ultimately, this study highlights both the promises and the constraints of using AI-generated exemplars to support psychological and linguistic research.
pdf
bib
abs
PTQ1.61: Push the Real Limit of Extremely Low-Bit Post-Training Quantization Methods for Large Language Models
Jiaqi Zhao
|
Miao Zhang
|
Ming Wang
|
Yuzhang Shang
|
Kaihao Zhang
|
Weili Guan
|
Yaowei Wang
|
Min Zhang
Large Language Models (LLMs) suffer severe performance degradation when facing extremely low-bit (sub 2-bit) quantization. Several existing sub 2-bit post-training quantization (PTQ) methods utilize a mix-precision scheme by leveraging an unstructured fine-grained mask to explicitly distinguish salient weights, while which introduces an extra 1-bit or more per weight. To explore the real limit of PTQ, we propose an extremely low-bit PTQ method called PTQ1.61, which enables weight quantization to 1.61-bit for the first time. Specifically, we first introduce a one-dimensional structured mask with negligibly additional 0.0002-bit per weight based on input activations from the perspective of reducing the upper bound of quantization error to allocate corresponding salient weight channels to 4-bit. For non-salient channels binarization, an efficient block-wise scaling factors optimization framework is then presented to take implicit row-wise correlations and angular biases into account. Different from prior works that concentrate on adjusting quantization methodologies, we further propose a novel paradigm called quantization preprocessing, where we argue that transforming the weight distribution of the pretrained model before quantization can alleviate the difficulty in per-channel extremely low-bit PTQ. Extensive experiments indicate our PTQ1.61 achieves state-of-the-art performance in extremely low-bit quantization. Codes are available at https://github.com/zjq0455/PTQ1.61.
pdf
bib
abs
ProtoLens: Advancing Prototype Learning for Fine-Grained Interpretability in Text Classification
Bowen Wei
|
Ziwei Zhu
In this work, we propose ProtoLens, a novel prototype-based model that provides fine-grained, sub-sentence level interpretability for text classification. ProtoLens uses a Prototype-aware Span Extraction module to identify relevant text spans associated with learned prototypes and a Prototype Alignment mechanism to ensure prototypes are semantically meaningful throughout training. By aligning the prototype embeddings with human-understandable examples, ProtoLens provides interpretable predictions while maintaining competitive accuracy. Extensive experiments demonstrate that ProtoLens outperforms both prototype-based and non-interpretable baselines on multiple text classification benchmarks. Code and data are available at
https://github.com/weibowen555/ProtoLens.
pdf
bib
abs
Fine-grained Video Dubbing Duration Alignment with Segment Supervised Preference Optimization
Chaoqun Cui
|
Liangbin Huang
|
Shijing Wang
|
Zhe Tong
|
Zhaolong Huang
|
Xiao Zeng
|
Xiaofeng Liu
Video dubbing aims to translate original speech in visual media programs from the source language to the target language, relying on neural machine translation and text-to-speech technologies. Due to varying information densities across languages, target speech often mismatches the source speech duration, causing audio-video synchronization issues that significantly impact viewer experience. In this study, we approach duration alignment in LLM-based video dubbing machine translation as a preference optimization problem. We propose the Segment Supervised Preference Optimization (SSPO) method, which employs a segment-wise sampling strategy and fine-grained loss to mitigate duration mismatches between source and target lines. Experimental results demonstrate that SSPO achieves superior performance in duration alignment tasks.
pdf
bib
abs
Sparse Latents Steer Retrieval-Augmented Generation
Chunlei Xin
|
Shuheng Zhou
|
Huijia Zhu
|
Weiqiang Wang
|
Xuanang Chen
|
Xinyan Guan
|
Yaojie Lu
|
Hongyu Lin
|
Xianpei Han
|
Le Sun
Understanding the mechanisms underlying Large Language Model (LLM) behavior in Retrieval-Augmented Generation (RAG) systems is critical for enhancing reliability. In this paper, we leverage Sparse Autoencoders (SAEs) within the LLaMA Scope to uncover sparse, interpretable latents that govern RAG behaviors. Through systematic analysis of SAE activations, we identify specific latents associated with two fundamental RAG decisions: (1) context versus memory prioritization, and (2) response generation versus query rejection. Intervention experiments demonstrate that these latents enable precise control over model behavior and maintain generalizability across various experimental settings. Mechanistic analysis reveals that manipulating these latents influences model behavior by reconfiguring attention patterns of retrieval heads. Our findings establish SAEs as a principled tool for understanding and controlling RAG behaviors, demonstrating capabilities in precise behavior steering without architectural modifications.
pdf
bib
abs
Unveiling Language-Specific Features in Large Language Models via Sparse Autoencoders
Boyi Deng
|
Yu Wan
|
Baosong Yang
|
Yidan Zhang
|
Fuli Feng
The mechanisms behind multilingual capabilities in Large Language Models (LLMs) have been examined using neuron-based or internal-activation-based methods. However, these methods often face challenges such as superposition and layer-wise activation variance, which limit their reliability. Sparse Autoencoders (SAEs) offer a more nuanced analysis by decomposing the activations of LLMs into a sparse linear combination of SAE features. We introduce a novel metric to assess the monolinguality of features obtained from SAEs, discovering that some features are strongly related to specific languages. Additionally, we show that ablating these SAE features only significantly reduces abilities in one language of LLMs, leaving others almost unaffected. Interestingly, we find some languages have multiple synergistic SAE features, and ablating them together yields greater improvement than ablating individually. Moreover, we leverage these SAE-derived language-specific features to enhance steering vectors, achieving control over the language generated by LLMs. The code is publicly available at
https://github.com/Aatrox103/multilingual-llm-features.
pdf
bib
abs
SafeRAG: Benchmarking Security in Retrieval-Augmented Generation of Large Language Model
Xun Liang
|
Simin Niu
|
Zhiyu Li
|
Sensen Zhang
|
Hanyu Wang
|
Feiyu Xiong
|
Zhaoxin Fan
|
Bo Tang
|
Jihao Zhao
|
Jiawei Yang
|
Shichao Song
|
Mengwei Wang
The indexing-retrieval-generation paradigm of retrieval-augmented generation (RAG) has been highly successful in solving knowledge-intensive tasks by integrating external knowledge into large language models (LLMs). However, the incorporation of external and unverified knowledge increases the vulnerability of LLMs because attackers can perform attack tasks by manipulating knowledge. In this paper, we introduce a benchmark named SafeRAG designed to evaluate the RAG security. First, we classify attack tasks into silver noise, inter-context conflict, soft ad, and white Denial-of-Service. Next, we construct RAG security evaluation dataset (i.e., SafeRAG dataset) primarily manually for each task. We then utilize the SafeRAG dataset to simulate various attack scenarios that RAG may encounter. Experiments conducted on 14 representative RAG components demonstrate that RAG exhibits significant vulnerability to all attack tasks and even the most apparent attack task can easily bypass existing retrievers, filters, or advanced LLMs, resulting in the degradation of RAG service quality. Code is available at: https://github.com/IAAR-Shanghai/SafeRAG.
pdf
bib
abs
AnRe: Analogical Replay for Temporal Knowledge Graph Forecasting
Guo Tang
|
Zheng Chu
|
Wenxiang Zheng
|
Junjia Xiang
|
Yizhuo Li
|
Weihao Zhang
|
Ming Liu
|
Bing Qin
Temporal Knowledge Graphs (TKGs) are vital for event prediction, yet current methods face limitations. Graph neural networks mainly depend on structural information, often overlooking semantic understanding and requiring high computational costs. Meanwhile, Large Language Models (LLMs) support zero-shot reasoning but lack sufficient capabilities to grasp the laws of historical event development. To tackle these challenges, we introduce a training-free Analogical Replay (AnRe) reasoning framework. Our approach retrieves similar events for queries through semantic-driven clustering and builds comprehensive historical contexts using a dual history extraction module that integrates long-term and short-term history. It then uses LLMs to generate analogical reasoning examples as contextual inputs, enabling the model to deeply understand historical patterns of similar events and improve its ability to predict unknown ones. Our experiments on four benchmarks show that AnRe significantly exceeds traditional training and existing LLM-based methods. Further ablation studies also confirm the effectiveness of the dual history extraction and analogical replay mechanisms.
pdf
bib
abs
Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities?
Zhiyuan Zeng
|
Qinyuan Cheng
|
Zhangyue Yin
|
Yunhua Zhou
|
Xipeng Qiu
The advent of test-time scaling in large language models (LLMs), exemplified by OpenAI’s o1 series, has advanced reasoning capabilities by scaling computational resource allocation during inference. While successors like QwQ, Deepseek-R1 (R1) and LIMO replicate these advancements, whether these models truly possess test-time scaling capabilities remains underexplored. This study found that longer CoTs of these o1-like models do not consistently enhance accuracy; in fact, correct solutions are often shorter than incorrect ones for the same questions. Further investigation shows this phenomenon is closely related to models’ self-revision capabilities - longer CoTs contain more self-revisions, which often lead to performance degradation. We then compare sequential and parallel scaling strategies on QwQ, R1 and LIMO, finding that parallel scaling achieves better coverage and scalability. Based on these insights, we propose “Shortest Majority Vote”, a method that combines parallel scaling strategies with CoT length characteristics, significantly improving models’ test-time scalability compared to conventional majority voting approaches.
pdf
bib
abs
Text is All You Need: LLM-enhanced Incremental Social Event Detection
Zitai Qiu
|
Congbo Ma
|
Jia Wu
|
Jian Yang
Social event detection (SED) is the task of identifying, categorizing, and tracking events from social data sources such as social media posts, news articles, and online discussions. Existing state-of-the-art (SOTA) SED models predominantly rely on graph neural networks (GNNs), which involve complex graph construction and time-consuming training processes, limiting their practicality in real-world scenarios. In this paper, we rethink the key challenge in SED: the informal and noisy nature of short texts on social media platforms, which impacts clustering accuracy. We propose a novel framework, LLM-enhanced Social Event Detection (LSED), which leverages the rich background knowledge of large language models (LLMs) to address this challenge. Specifically, LSED utilizes LLMs to formalize and disambiguate short texts by completing abbreviations and summarizing informal expressions. Furthermore, we introduce hyperbolic space embeddings, which are more suitable for natural language sentence representations, to enhance clustering performance. Extensive experiments on two challenging real-world datasets demonstrate that LSED outperforms existing SOTA models, achieving improvements in effectiveness, efficiency, and stability. Our work highlights the potential of LLMs in SED and provides a practical solution for real-world applications.
pdf
bib
abs
Multimodal Pragmatic Jailbreak on Text-to-image Models
Tong Liu
|
Zhixin Lai
|
Jiawen Wang
|
Gengyuan Zhang
|
Shuo Chen
|
Philip Torr
|
Vera Demberg
|
Volker Tresp
|
Jindong Gu
Diffusion models have recently achieved remarkable advancements in terms of image quality and fidelity to textual prompts. Concurrently, the safety of such generative models has become an area of growing concern. This work introduces a novel type of jailbreak, which triggers T2I models to generate the image with visual text, where the image and the text, although considered to be safe in isolation, combine to form unsafe content. To systematically explore this phenomenon, we propose a dataset to evaluate the current diffusion-based text-to-image (T2I) models under such jailbreak. We benchmark nine representative T2I models, including two closed-source commercial models. Experimental results reveal a concerning tendency to produce unsafe content: all tested models suffer from such type of jailbreak, with rates of unsafe generation ranging from around 10% to 70% where DALL·E 3 demonstrates almost the highest unsafety. In real-world scenarios, various filters such as keyword blocklists, customized prompt filters, and NSFW image filters, are commonly employed to mitigate these risks. We evaluate the effectiveness of such filters against our jailbreak and found that, while these filters may be effective for single modality detection, they fail to work against our jailbreak. We also investigate the underlying reason for such jailbreaks, from the perspective of text rendering capability and training data. Our work provides a foundation for further development towards more secure and reliable T2I models.
pdf
bib
abs
Principled Understanding of Generalization for Generative Transformer Models in Arithmetic Reasoning Tasks
Xingcheng Xu
|
Zibo Zhao
|
Haipeng Zhang
|
Yanqing Yang
Transformer-based models excel in various tasks but their generalization capabilities, especially in arithmetic reasoning, remain incompletely understood. Arithmetic tasks provide a controlled framework to explore these capabilities, yet performance anomalies persist, such as inconsistent effectiveness in multiplication and erratic generalization in modular addition (e.g., modulo 100 vs. 101). This paper develops a unified theoretical framework for understanding the generalization behaviors of transformers in arithmetic tasks, focusing on length generalization. Through detailed analysis of addition, multiplication, and modular operations, we reveal that translation invariance in addition aligns with relative positional encoding for robust generalization, while base mismatch in modular operations disrupts this alignment. Experiments across GPT-family models validate our framework, confirming its ability to predict generalization behaviors. Our work highlights the importance of task structure and training data distribution for achieving data-efficient and structure-aware training, providing a systematic approach to understanding of length generalization in transformers.
pdf
bib
abs
Discourse Relation-Enhanced Neural Coherence Modeling
Wei Liu
|
Michael Strube
Discourse coherence theories posit relations between text spans as a key feature of coherent texts. However, existing work on coherence modeling has paid little attention to discourse relations. In this paper, we provide empirical evidence to demonstrate that relation features are correlated with text coherence. Then, we investigate a novel fusion model that uses position-aware attention and a visible matrix to combine text- and relation-based features for coherence assessment. Experimental results on two benchmarks show that our approaches can significantly improve baselines, demonstrating the importance of relation features for coherence modeling.
pdf
bib
abs
Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models
Kuofeng Gao
|
Shu-Tao Xia
|
Ke Xu
|
Philip Torr
|
Jindong Gu
Large Audio-Language Models (LALMs), such as GPT-4o, have recently unlocked audio dialogue capabilities, enabling direct spoken exchanges with humans. The potential of LALMs broadens their applicability across a wide range of practical scenarios supported by audio dialogues. However, given these advancements, a comprehensive benchmark to evaluate the performance of LALMs in the open-ended audio dialogue understanding remains absent currently. To address this gap, we propose an **A**udio **D**ialogue **U**nderstanding **Bench**mark **(ADU-Bench),** which consists of 4 benchmark datasets. They assess the open-ended audio dialogue ability for LALMs in 3 general scenarios, 12 skills, 9 multilingual languages, and 4 categories of ambiguity handling. Notably, *we firstly propose the evaluation of ambiguity handling* in audio dialogues that expresses different intentions beyond the same literal meaning of sentences, *e.g.,* ‘“Really!?”‘ with different intonations. In summary, ADU-Bench includes over 20,000 open-ended audio dialogues for the assessment of LALMs. Through extensive experiments conducted on 16 LALMs, our analysis reveals that existing LALMs struggle with mathematical symbols and formulas, understanding human behavior such as roleplay, comprehending multiple languages, and handling audio dialogue ambiguities from different phonetic elements, such as intonations, pause positions, and homophones. The benchmark is available at https://adu-bench.github.io/.
pdf
bib
abs
from Benign import Toxic: Jailbreaking the Language Model via Adversarial Metaphors
Yu Yan
|
Sheng Sun
|
Zenghao Duan
|
Teli Liu
|
Min Liu
|
Zhiyi Yin
|
LeiJingyu LeiJingyu
|
Qi Li
Current studies have exposed the risk of Large Language Models (LLMs) generating harmful content by jailbreak attacks. However, they overlook that the direct generation of harmful content from scratch is more difficult than inducing LLM to calibrate benign content into harmful forms.In our study, we introduce a novel attack framework that exploits AdVersArial meTAphoR (AVATAR) to induce the LLM to calibrate malicious metaphors for jailbreaking.Specifically, to answer harmful queries, AVATAR adaptively identifies a set of benign but logically related metaphors as the initial seed.Then, driven by these metaphors, the target LLM is induced to reason and calibrate about the metaphorical content, thus jailbroken by either directly outputting harmful responses or calibrating residuals between metaphorical and professional harmful content.Experimental results demonstrate that AVATAR can effectively and transferably jailbreak LLMs and achieve a state-of-the-art attack success rate across multiple advanced LLMs.
pdf
bib
abs
ShifCon: Enhancing Non-Dominant Language Capabilities with a Shift-based Multilingual Contrastive Framework
Hengyuan Zhang
|
Chenming Shang
|
Sizhe Wang
|
Dongdong Zhang
|
Yiyao Yu
|
Feng Yao
|
Renliang Sun
|
Yujiu Yang
|
Furu Wei
Although fine-tuning Large Language Models (LLMs) with multilingual data can rapidly enhance the multilingual capabilities of LLMs, they still exhibit a performance gap between the dominant language (e.g., English) and non-dominant ones due to the imbalance of training data across languages. To further enhance the performance of non-dominant languages, we propose ShifCon, a Shift-based multilingual Contrastive framework that aligns the internal forward process of other languages toward that of the dominant one. Specifically, it shifts the representations of non-dominant languages into the dominant language subspace, allowing them to access relatively rich information encoded in the model parameters. The enriched representations are then shifted back into their original language subspace before generation. Moreover, we introduce a subspace distance metric to pinpoint the optimal layer area for shifting representations and employ multilingual contrastive learning to further enhance the alignment of representations within this area. Experiments demonstrate that our ShifCon framework significantly enhances the performance of non-dominant languages, particularly for low-resource ones. Further analysis offers extra insights to verify the effectiveness of ShifCon and propel future research.
pdf
bib
abs
MorphMark: Flexible Adaptive Watermarking for Large Language Models
Zongqi Wang
|
Tianle Gu
|
Baoyuan Wu
|
Yujiu Yang
Watermarking by altering token sampling probabilities based on red-green list is a promising method for tracing the origin of text generated by large language models (LLMs). However, existing watermark methods often struggle with a fundamental dilemma: improving watermark effectiveness (the detectability of the watermark) often comes at the cost of reduced text quality. This trade-off limits their practical application. To address this challenge, we first formalize the problem within a multi-objective trade-off analysis framework. Within this framework, we identify a key factor that influences the dilemma. Unlike existing methods, where watermark strength is typically treated as a fixed hyperparameter, our theoretical insights lead to the development of MorphMark—a method that adaptively adjusts the watermark strength in response to changes in the identified factor, thereby achieving an effective resolution of the dilemma. In addition, MorphMark also prioritizes flexibility since it is an model-agnostic and model-free watermark method, thereby offering a practical solution for real-world deployment, particularly in light of the rapid evolution of AI models. Extensive experiments demonstrate that MorphMark achieves a superior resolution of the effectiveness-quality dilemma, while also offering greater flexibility and time and space efficiency.
pdf
bib
abs
A Silver Bullet or a Compromise for Full Attention? A Comprehensive Study of Gist Token-based Context Compression
Chenlong Deng
|
Zhisong Zhang
|
Kelong Mao
|
Shuaiyi Li
|
Xinting Huang
|
Dong Yu
|
Zhicheng Dou
In this work, we provide an empirical investigation of gist-based context compression methods to improve context processing in large language models. We focus on two key questions: (1) How well can these methods replace full attention models? and (2) What potential failure patterns arise due to compression? Through extensive experiments, we show that while gist-based compression can achieve only slight performance loss on tasks like retrieval-augmented generation and long-document QA, it faces challenges in tasks like synthetic recall. Furthermore, we identify three key failure patterns: lost by the boundary, lost if surprise, and lost along the way. To mitigate these issues, we propose two effective strategies: fine-grained autoencoding, which enhances the reconstruction of original token information, and segment-wise token importance estimation, which adjusts optimization based on token dependencies. Our work provides valuable insights into the understanding of gist token-based context compression and offers practical strategies for improving compression capabilities.
pdf
bib
abs
On the Limit of Language Models as Planning Formalizers
Cassie Huang
|
Li Zhang
Large Language Models have been found to create plans that are neither executable nor verifiable in grounded environments. An emerging line of work demonstrates success in using the LLM as a formalizer to generate a formal representation of the planning domain in some language, such as Planning Domain Definition Language (PDDL). This formal representation can be deterministically solved to find a plan. We systematically evaluate this methodology while bridging some major gaps. While previous work only generates a partial PDDL representation, given templated, and therefore unrealistic environment descriptions, we generate the complete representation given descriptions of various naturalness levels. Among an array of observations critical to improve LLMs’ formal planning abilities, we note that most large enough models can effectively formalize descriptions as PDDL, outperforming those directly generating plans, while being robust to lexical perturbation. As the descriptions become more natural-sounding, we observe a decrease in performance and provide detailed error analysis.
pdf
bib
abs
Learning to Generate Structured Output with Schema Reinforcement Learning
Yaxi Lu
|
Haolun Li
|
Xin Cong
|
Zhong Zhang
|
Yesai Wu
|
Yankai Lin
|
Zhiyuan Liu
|
Fangming Liu
|
Maosong Sun
This study investigates the structured generation capabilities of large language models (LLMs), focusing on producing valid JSON outputs against a given schema. Despite the widespread use of JSON in integrating language models with programs, there is a lack of comprehensive analysis and benchmarking of these capabilities. We explore various aspects of JSON generation, such as structure understanding, escaping, and natural language description, to determine how to assess and enable LLMs to generate valid responses. Building upon this, we propose SchemaBench features around 40K different JSON schemas to obtain and assess models’ abilities in generating valid JSON. We find that the latest LLMs are still struggling to generate a valid JSON string. Moreover, we demonstrate that incorporating reinforcement learning with a Fine-grained Schema Validator can further enhance models’ understanding of JSON schema, leading to improved performance. Our models demonstrate significant improvement in both generating JSON outputs and downstream tasks.
pdf
bib
abs
Enhancing Unsupervised Sentence Embeddings via Knowledge-Driven Data Augmentation and Gaussian-Decayed Contrastive Learning
Peichao Lai
|
Zhengfeng Zhang
|
Wentao Zhang
|
Fangcheng Fu
|
Bin Cui
Recently, using large language models (LLMs) for data augmentation has led to considerable improvements in unsupervised sentence embedding models. However, existing methods encounter two primary challenges: limited data diversity and high data noise. Current approaches often neglect fine-grained knowledge, such as entities and quantities, leading to insufficient diversity. Besides, unsupervised data frequently lacks discriminative information, and the generated synthetic samples may introduce noise. In this paper, we propose a pipeline-based data augmentation method via LLMs and introduce the Gaussian-decayed gradient-assisted Contrastive Sentence Embedding (GCSE) model to enhance unsupervised sentence embeddings. To tackle the issue of low data diversity, our pipeline utilizes knowledge graphs (KGs) to extract entities and quantities, enabling LLMs to generate more diverse samples. To address high data noise, the GCSE model uses a Gaussian-decayed function to limit the impact of false hard negative samples, enhancing the model’s discriminative capability. Experimental results show that our approach achieves state-of-the-art performance in semantic textual similarity (STS) tasks, using fewer data samples and smaller LLMs, demonstrating its efficiency and robustness across various models.
pdf
bib
abs
Improve Safety Training of Large Language Models with Safety-Critical Singular Vectors Localization
Peijian Gu
|
Quan Wang
|
Zhendong Mao
The rapid advancement of large language models (LLMs) has brought about increased concerns regarding their safety, especially as adversaries develop jailbreak techniques to bypass LLMs’ safety mechanism. Although recent work on safety training with modules such as low-rank adaptation (LoRA) to resist jailbreaks shows promise, these approaches can inadvertently degrade a model’s general utility. In this paper, we propose a novel plug-and-play method that mitigates the impact of safety training on model utility by explicitly locating and leveraging safety-critical singular vectors, which only contribute to safety, within the model’s parameter space. We quantify the safety-criticality of each singular vector as the difference of their importance for safety and utility measured by a corresponding low-rank projection. The top scored singular vectors are located as safety-critical and are used to initialize the LoRA modules within existing safety training methods in a plug-and-play manner, thereby constraining the training updates within safety-critical parameters. Additionally, we propose a dynamic rank number determination strategy to further reduce parameter overhead. Experiments on HarmBench with multiple jailbreak methods validate the effectiveness of our approach in safety training, while evaluations on several utility benchmarks demonstrate that our method successfully mitigates the adverse impact of safety training on model utility, enhancing the utility performance of the evaluated safety training baselines.
pdf
bib
abs
WarriorCoder: Learning from Expert Battles to Augment Code Large Language Models
Huawen Feng
|
Pu Zhao
|
Qingfeng Sun
|
Can Xu
|
Fangkai Yang
|
Lu Wang
|
Qianli Ma
|
Qingwei Lin
|
Saravan Rajmohan
|
Dongmei Zhang
|
Qi Zhang
Despite recent progress achieved by code large language models (LLMs), their remarkable abilities are largely dependent on fine-tuning on the high-quality data, posing challenges for data collection and annotation. To address this, current methods often design various data flywheels to collect complex code instructions, enabling models to handle more intricate tasks. However, these approaches typically rely on off-the-shelf datasets and data augmentation from a limited set of proprietary LLMs (e.g., Claude, GPT4, and so on), which restricts the diversity of the constructed data and makes it prone to systemic biases. In this paper, we propose **WarriorCoder**, a novel paradigm learns from expert battles to address these limitations. Specifically, we create an arena where leading expert code LLMs challenge each other, with evaluations conducted by impartial judges. This competitive framework generates novel training data from scratch, leveraging the strengths of all participants. Experimental results show that **WarriorCoder** achieves state-of-the-art performance compared to previous models of the same size, even without relying on proprietary LLMs.
pdf
bib
abs
A Triple-View Framework for Fine-Grained Emotion Classification with Clustering-Guided Contrastive Learning
Junqing Gong
|
Binhan Yang
|
Wei Shen
Fine-grained emotion classification (FEC) aims to analyze speakers’ utterances and distinguish dozens of emotions with subtle differences, allowing for a more nuanced understanding of human emotional states. However, compared to traditional coarse-grained emotion classification, two difficulties arise as the granularity of emotions becomes finer, i.e., the presence of closely confusable emotions which are hard to distinguish, and the biased performance caused by long-tailed emotions. Although addressing both difficulties is vital to FEC, previous studies have predominantly focused on dealing with only one of them. In this paper, we propose TACO, a novel triple-view framework that treats FEC as an instance-label (i.e., utterance-emotion) joint embedding learning problem to tackle both difficulties concurrently by considering three complementary views. Specifically, we design a clustering-guided contrastive loss, which incorporates clustering techniques to guide the contrastive learning process and facilitate more discriminative instance embeddings. Additionally, we introduce the emotion label description as a helpful resource to refine label embeddings and mitigate the poor performance towards under-represented (i.e., long-tailed) emotions. Extensive experiments on two widely-used benchmark datasets demonstrate that our proposed TACO achieves substantial and consistent improvements compared to other competitive baseline methods.
pdf
bib
abs
Quantification of Large Language Model Distillation
Sunbowen Lee
|
Junting Zhou
|
Chang Ao
|
Kaige Li
|
Xeron Du
|
Sirui He
|
Haihong Wu
|
Tianci Liu
|
Jiaheng Liu
|
Hamid Alinejad-Rokny
|
Min Yang
|
Yitao Liang
|
Zhoufutu Wen
|
Shiwen Ni
Model distillation is a fundamental technique in building large language models (LLMs), transferring knowledge from a teacher model to a student model. However, distillation can lead to model homogenization, reducing diversity among models and impairing their ability to robustly handle complex or novel tasks. These limitations underscore the need to systematically quantify the distillation process and its impact. In this work, we propose a framework to evaluate and quantify model distillation. Our method addresses two key aspects: (1) Identifying identity cognition contradictions to assess discrepancies in how models perceive and represent identity-related information, and (2) Analyzing multi-granularity response similarities across models to measure the extent of homogenization. Experimental results demonstrate two key insights: (1) Well-known closed-source and open-source LLMs usually exhibit high distillation degrees, except for Claude, Doubao, and Gemini. (2) Base LLMs show higher distillation degrees compared to aligned LLMs. By offering a systematic approach to improve the transparency of LLM data distillation, we call for LLMs with more independent development and more transparent technical reports to improve LLMs’ robustness and safety. The code and data are available at https://github.com/Aegis1863/LLMs-Distillation-Quantification.
pdf
bib
abs
Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models
Zihan Qiu
|
Zeyu Huang
|
Bo Zheng
|
Kaiyue Wen
|
Zekun Wang
|
Rui Men
|
Ivan Titov
|
Dayiheng Liu
|
Jingren Zhou
|
Junyang Lin
This paper revisits the implementation of Load-Balancing-Loss (LBL) when training Mixture-of-Experts (MoEs) models. Specifically, LBL for MoEs is defined as NE ∑i=1NE fipi, where NE is the total number of experts, fi represents the frequency of expert i being selected, and pi denotes the average gating score of the expert i. Existing MoE training frameworks usually employ the parallel training strategy so that fi and the LBL are calculated within a micro-batch and averaged across parallel groups.However, a micro-batch for training billion-scale LLMs typically contains very few sequences, leading to the micro-batch LBL being almost at the sequence level, and the router is pushed to distribute the token evenly within each sequence.Under this strict constraint, even tokens from a domain-specific sequence (e.g., code) are uniformly routed to all experts, thereby inhibiting expert specialization.In this work, we propose calculating LBL using a global-batch to loose this constraint. Because a global-batch contains much more diverse sequences than a micro-batch, which will encourage load balance at the corpus level. Specifically, we introduce an extra communication step to synchronize fi across micro-batches and then use it to calculate the LBL.Through experiments on training MoEs-based LLMs (up to 42.8B parameters and 400B tokens), we surprisingly find that the global-batch LBL strategy yields excellent performance gains in both pre-training perplexity and downstream tasks.Our analysis reveals that the global-batch LBL greatly improves the domain specialization of experts. Global-batch LBL is also used in Qwen3-MoEs.
pdf
bib
abs
Pandora’s Box or Aladdin’s Lamp: A Comprehensive Analysis Revealing the Role of RAG Noise in Large Language Models
Jinyang Wu
|
Shuai Zhang
|
Feihu Che
|
Mingkuan Feng
|
Pengpeng Shao
|
Jianhua Tao
Retrieval-Augmented Generation (RAG) has emerged as a crucial method for addressing hallucinations in large language models (LLMs). While recent research has extended RAG models to complex noisy scenarios, these explorations often confine themselves to limited noise types and presuppose that noise is inherently detrimental to LLMs, potentially deviating from real-world retrieval environments and restricting practical applicability. In this paper, we define seven distinct noise types from a linguistic perspective and establish a Noise RAG Benchmark (NoiserBench), a comprehensive evaluation framework encompassing multiple datasets and reasoning tasks. Through empirical evaluation of eight representative LLMs with diverse architectures and scales, we reveal that these noises can be further categorized into two practical groups: noise that is beneficial to LLMs (aka beneficial noise) and noise that is harmful to LLMs (aka harmful noise). While harmful noise generally impairs performance, beneficial noise may enhance several aspects of model capabilities and overall performance. Our analysis offers insights for developing robust RAG solutions and mitigating hallucinations across diverse retrieval scenarios. Code is available at https://github.com/jinyangwu/NoiserBench.
pdf
bib
abs
Stepwise Reasoning Disruption Attack of LLMs
Jingyu Peng
|
Maolin Wang
|
Xiangyu Zhao
|
Kai Zhang
|
Wanyu Wang
|
Pengyue Jia
|
Qidong Liu
|
Ruocheng Guo
|
Qi Liu
Large language models (LLMs) have made remarkable strides in complex reasoning tasks, but their safety and robustness in reasoning processes remain unexplored, particularly in third-party platforms that facilitate user interactions via APIs. Existing attacks on LLM reasoning are constrained by specific settings or lack of imperceptibility, limiting their feasibility and generalizability. To address these challenges, we propose the Stepwise rEasoning Error Disruption (SEED) attack, which subtly injects errors into prior reasoning steps to mislead the model into producing incorrect subsequent reasoning and final answers. Unlike previous methods, SEED is compatible with zero-shot and few-shot settings, maintains the natural reasoning flow, and ensures covert execution without modifying the instruction. Extensive experiments on four datasets across four different models demonstrate SEED’s effectiveness, revealing the vulnerabilities of LLMs to disruptions in reasoning processes. These findings underscore the need for greater attention to the robustness of LLM reasoning to ensure safety in practical applications. Our code is available at: https://github.com/Applied-Machine-Learning-Lab/SEED-Attack
pdf
bib
abs
Crowd Comparative Reasoning: Unlocking Comprehensive Evaluations for LLM-as-a-Judge
Qiyuan Zhang
|
Yufei Wang
|
Yuxin Jiang
|
Liangyou Li
|
Chuhan Wu
|
Yasheng Wang
|
Xin Jiang
|
Lifeng Shang
|
Ruiming Tang
|
Fuyuan Lyu
|
Chen Ma
LLM-as-a-Judge, which generates chain-of-thought (CoT) judgments, has become a widely adopted auto-evaluation method. However, its reliability is compromised by the CoT reasoning’s inability to capture comprehensive and deeper details, often leading to incomplete outcomes. Existing methods mainly rely on majority voting or criteria expansion, which is insufficient to address the limitation in CoT. We propose Crowd-based Comparative Evaluation, which introduces additional crowd responses to compare with the candidate responses, thereby exposing deeper and more comprehensive details within the candidate responses. This process effectively guides LLM-as-a-Judge to provide a more detailed CoT judgment. Extensive experiments demonstrate that our approach enhances evaluation reliability, achieving an average accuracy gain of 6.7% across five benchmarks. Moreover, our method produces higher-quality CoTs that facilitate judge distillation and exhibit superior performance in rejection sampling for supervised fine-tuning (SFT), referred to as crowd rejection sampling, thereby enabling more efficient SFT. Our analysis confirms that CoTs generated by ours are more comprehensive and of higher quality, and evaluation accuracy improves as inference scales.
pdf
bib
abs
Lost in Multilinguality: Dissecting Cross-lingual Factual Inconsistency in Transformer Language Models
Mingyang Wang
|
Heike Adel
|
Lukas Lange
|
Yihong Liu
|
Ercong Nie
|
Jannik Strötgen
|
Hinrich Schuetze
Multilingual language models (MLMs) store factual knowledge across languages but often struggle to provide consistent responses to semantically equivalent prompts in different languages. While previous studies point out this cross-lingual inconsistency issue, the underlying causes remain unexplored. In this work, we use mechanistic interpretability methods to investigate cross-lingual inconsistencies in MLMs. We find that MLMs encode knowledge in a language-independent concept space through most layers, and only transition to language-specific spaces in the final layers. Failures during the language transition often result in incorrect predictions in the target language, even when the answers are correct in other languages. To mitigate this inconsistency issue, we propose a linear shortcut method that bypasses computations in the final layers, enhancing both prediction accuracy and cross-lingual consistency. Our findings shed light on the internal mechanisms of MLMs and provide a lightweight, effective strategy for producing more consistent factual outputs.
pdf
bib
abs
Optimizing Decomposition for Optimal Claim Verification
Yining Lu
|
Noah Ziems
|
Hy Dang
|
Meng Jiang
Current research on the Decompose-Then-Verify paradigm for evaluating the factuality of long-form text typically treats decomposition and verification in isolation, overlooking their interactions and potential misalignment. We find that existing decomposition policies, typically hand-crafted demonstrations, do not align well with downstream verifiers in terms of atomicity—a novel metric quantifying information density—leading to suboptimal verification results. We formulate finding the optimal decomposition policy for optimal verification as a bilevel optimization problem. To approximate a solution for this strongly NP-hard problem, we propose dynamic decomposition, a reinforcement learning framework that leverages verifier feedback to learn a policy for dynamically decomposing claims to verifier-preferred atomicity. Experimental results show that dynamic decomposition outperforms existing decomposition policies, improving verification confidence by 0.07 and accuracy by 0.12 (on a 0-1 scale) on average across varying verifiers, datasets, and atomcities of input claims.
pdf
bib
abs
GradOT: Training-free Gradient-preserving Offsite-tuning for Large Language Models
Kai Yao
|
Zhaorui Tan
|
Penglei Gao
|
Lichun Li
|
Kaixin Wu
|
Yinggui Wang
|
Yuan Zhao
|
Yixin Ji
|
Jianke Zhu
|
Wei Wang
The rapid growth of large language models (LLMs) with traditional centralized fine-tuning emerges as a key technique for adapting these models to domain-specific challenges, yielding privacy risks for both model and data owners. One promising solution, called offsite-tuning (OT), is proposed to address these challenges, where a weaker emulator is compressed from the original model and further fine-tuned with adapter to enhance privacy. However, the existing OT-based methods require high computational costs and lack theoretical analysis. This paper introduces a novel OT approach based on gradient-preserving compression. By analyzing the OT problem through the lens of optimization, we propose a method that selectively applies compression techniques such as rank compression and channel pruning, preserving the gradients of fine-tuned adapters while ensuring privacy. Extensive experiments demonstrate that our approach surpasses existing OT methods, both in terms of privacy protection and model performance. Our method provides a theoretical foundation for OT and offers a practical, training-free solution for offsite-tuning of large-scale LLMs.
pdf
bib
abs
Knowledge Boundary of Large Language Models: A Survey
Moxin Li
|
Yong Zhao
|
Wenxuan Zhang
|
Shuaiyi Li
|
Wenya Xie
|
See-Kiong Ng
|
Tat-Seng Chua
|
Yang Deng
Although large language models (LLMs) store vast amount of knowledge in their parameters, they still have limitations in the memorization and utilization of certain knowledge, leading to undesired behaviors such as generating untruthful and inaccurate responses. This highlights the critical need to understand the knowledge boundary of LLMs, a concept that remains inadequately defined in existing research. In this survey, we propose a comprehensive definition of the LLM knowledge boundary and introduce a formalized taxonomy categorizing knowledge into four distinct types. Using this foundation, we systematically review the field through three key lenses: the motivation for studying LLM knowledge boundaries, methods for identifying these boundaries, and strategies for mitigating the challenges they present. Finally, we discuss open challenges and potential research directions in this area. We aim for this survey to offer the community a comprehensive overview, facilitate access to key issues, and inspire further advancements in LLM knowledge research.
pdf
bib
abs
Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning
Hai-Long Sun
|
Zhun Sun
|
Houwen Peng
|
Han-Jia Ye
Recent advancements in Large Language Models (LLMs) have demonstrated enhanced reasoning capabilities, evolving from Chain-of-Thought (CoT) prompting to advanced, product-oriented solutions like OpenAI o1. During our re-implementation of this model, we noticed that in multimodal tasks requiring visual input (e.g., geometry problems), Multimodal LLMs (MLLMs) struggle to maintain focus on the visual information, in other words, MLLMs suffer from a gradual decline in attention to visual information as reasoning progresses, causing text-over-relied outputs. To investigate this, we ablate image inputs during long-chain reasoning. Concretely, we truncate the reasoning process midway, then re-complete the reasoning process with the input image removed. We observe only a ~2 accuracy drop on MathVista’s test-hard subset, revealing the model’s textual outputs dominate the following reasoning process. Motivated by this, we propose Take-along Visual Conditioning (TVC), a strategy that shifts image input to critical reasoning stages and compresses redundant visual tokens via dynamic pruning. This methodology helps the model retain attention to the visual components throughout the reasoning. Our approach achieves state-of-the-art performance on average across five mathematical reasoning benchmarks (+3.4% vs previous sota), demonstrating the effectiveness of TVC in enhancing multimodal reasoning systems. The project page is available at
https://sun-hailong.github.io/projects/TVC.
pdf
bib
abs
MoC: Mixtures of Text Chunking Learners for Retrieval-Augmented Generation System
Jihao Zhao
|
Zhiyuan Ji
|
Zhaoxin Fan
|
Hanyu Wang
|
Simin Niu
|
Bo Tang
|
Feiyu Xiong
|
Zhiyu Li
Retrieval-Augmented Generation (RAG), while serving as a viable complement to large language models (LLMs), often overlooks the crucial aspect of text chunking within its pipeline. This paper initially introduces a dual-metric evaluation method, comprising Boundary Clarity and Chunk Stickiness, to enable the direct quantification of chunking quality. Leveraging this assessment method, we highlight the inherent limitations of traditional and semantic chunking in handling complex contextual nuances, thereby substantiating the necessity of integrating LLMs into chunking process. To address the inherent trade-off between computational efficiency and chunking precision in LLM-based approaches, we devise the granularity-aware Mixture-of-Chunkers (MoC) framework, which consists of a three-stage processing mechanism. Notably, our objective is to guide the chunker towards generating a structured list of chunking regular expressions, which are subsequently employed to extract chunks from the original text. Extensive experiments demonstrate that both our proposed metrics and the MoC framework effectively settle challenges of the chunking task, revealing the chunking kernel while enhancing the performance of the RAG system.
pdf
bib
abs
Mitigating Selection Bias with Node Pruning and Auxiliary Options
Hyeong Kyu Choi
|
Weijie Xu
|
Chi Xue
|
Stephanie Eckman
|
Chandan K. Reddy
Large language models (LLMs) often exhibit systematic preferences for certain answer choices when responding to multiple-choice questions—a behavior known as selection bias. This bias reduces the accuracy and reliability of LLM outputs, limiting their usefulness in decision-critical applications. While prior work has focused on adjusting model inputs or outputs to mitigate this issue, our work takes a fundamentally different approach by identifying and removing the internal sources of bias. We introduce two methods: Bias Node Pruning (BNP), which prunes parameters that contribute to selection bias, and Auxiliary Option Injection (AOI), which introduces an additional answer choice to reduce bias in both white-box and black-box settings. To address the shortcomings of existing evaluation metrics, we propose Choice Kullback-Leibler Divergence (CKLD), a new metric that captures distributional imbalances in model predictions. Experiments on three LLMs across multiple datasets demonstrate that our methods consistently improve answer accuracy while reducing selection bias, providing a robust solution for both open- and closed-source models.
pdf
bib
abs
Dually Self-Improved Counterfactual Data Augmentation Using Large Language Model
Luhao Zhang
|
Xinyu Zhang
|
Linmei Hu
|
Dandan Song
|
Liqiang Nie
Counterfactual data augmentation, which generates minimally edited tokens to alter labels, has become a key approach to improving model robustness in natural language processing (NLP). It is usually implemented by first identifying the causal terms and then modifying these terms to create counterfactual candidates. The emergence of large language models (LLMs) has effectively facilitated the task of counterfactual data augmentation. However, existing LLM-based approaches still face some challenges in 1) accurately extracting the task-specific causal terms, and 2) the quality of LLM-generated counterfacts. To address the issues, we propose a dually self-improved counterfactual data augmentation method using LLM for the Natural Language Inference (NLI) task. On the one hand, we design a self-improved strategy employing the attention distribution of the task model to identify the task-specific causal terms, which is lightweight and task-specific. On the other hand, a second self-improved strategy based on direct preference optimization is utilized to refine LLM-generated counterfacts, achieving high-quality counterfacts. Finally, a balanced loss preventing over-emphasis on augmented data is proposed to retrain the task model on the fusion of existing data and generated counterfacts. Extensive experiments on NLI benchmarks demonstrate the effectiveness of our proposed method in generating high-quality counterfacts for improving task performance.
pdf
bib
abs
RPO: Retrieval Preference Optimization for Robust Retrieval-Augmented Generation
Shi-Qi Yan
|
Quan Liu
|
Zhen-Hua Ling
While Retrieval-Augmented Generation (RAG) has exhibited promise in utilizing external knowledge, its generation process heavily depends on the quality and accuracy of the retrieved context. Large language models (LLMs) struggle to evaluate the correctness of non-parametric knowledge retrieved externally when it differs from internal memorization, leading to knowledge conflicts during response generation. To this end, we introduce the **R**etrieval **P**reference **O**ptimization (RPO), a lightweight and effective alignment method to adaptively leverage multi-source knowledge based on retrieval relevance. An implicit representation of retrieval relevance is derived and incorporated into the reward model to integrate retrieval evaluation and response generation into a single model, solving the problem that previous methods necessitate the additional procedure to assess the retrieval quality. Notably, RPO is a RAG-dedicated alignment approach that quantifies the awareness of retrieval relevance in training, first overcoming mathematical obstacles. Experiments on four datasets demonstrate that RPO outperforms RAG by 4-10% in accuracy without any extra component, exhibiting its robust generalization.
pdf
bib
abs
Learning to Reason from Feedback at Test-Time
Yanyang Li
|
Michael R. Lyu
|
Liwei Wang
Solving complex tasks in a single attempt is challenging for large language models (LLMs). Iterative interaction with the environment and feedback is often required to achieve success, making effective feedback utilization a critical topic. Existing approaches either struggle with length generalization or rely on naive retries without leveraging prior information. In this paper, we introduce FTTT, a novel paradigm that formulates feedback utilization as an optimization problem at test time. Additionally, we propose a learnable test-time optimizer, OpTune, to effectively exploit feedback. Experiments on two LLMs across four reasoning datasets demonstrate that FTTT and OpTune achieve superior scalability and performance.
pdf
bib
abs
L-CiteEval: A Suite for Evaluating Fidelity of Long-context Models
Zecheng Tang
|
Keyan Zhou
|
Juntao Li
|
Baibei Ji
|
Jianye Hou
|
Min Zhang
Long-context models(LCMs) have witnessed remarkable advancements in recent years, facilitating real-world tasks like long-document QA. The success of LCMs is founded on the hypothesis that the model demonstrates strong fidelity, enabling it to respond based on the provided long context rather than relying solely on the intrinsic knowledge acquired during pre-training. Yet, in this paper, we find that open-sourced LCMs are not as faithful as expected. We introduce L-CiteEval, an out-of-the-box suite that can assess both generation quality and fidelity in long-context understanding tasks. It covers 11 tasks with context lengths ranging from 8K to 48K and a corresponding automatic evaluation pipeline. Evaluation of 11 cutting-edge closed-source and open-source LCMs indicates that, while there are minor differences in their generation, open-source models significantly lag behind closed-source counterparts in terms of fidelity. Furthermore, we analyze the benefits of citation generation for LCMs from both the perspective of explicit model output and the internal attention mechanism.
pdf
bib
abs
SECRET: Semi-supervised Clinical Trial Document Similarity Search
Trisha Das
|
Afrah Shafquat
|
Mandis Beigi
|
Jacob Aptekar
|
Jimeng Sun
Clinical trials are vital for evaluation of safety and efficacy of new treatments. However, clinical trials are resource-intensive, time-consuming and expensive to conduct, where errors in trial design, reduced efficacy, and safety events can result in significant delays, financial losses, and damage to reputation. These risks underline the importance of informed and strategic decisions in trial design to mitigate these risks and improve the chances of a successful trial. Identifying similar historical trials is critical as these trials can provide an important reference for potential pitfalls and challenges including serious adverse events, dosage inaccuracies, recruitment difficulties, patient adherence issues, etc. Addressing these challenges in trial design can lead to development of more effective study protocols with optimized patient safety and trial efficiency. In this paper, we present a novel method to identify similar historical trials by summarizing clinical trial protocols and searching for similar trials based on a query trial’s protocol. Our approach significantly outperforms all baselines, achieving up to a 78% improvement in recall@1 and a 53% improvement in precision@1 over the best baseline. We also show that our method outperforms all other baselines in partial trial similarity search and zero-shot patient-trial matching, highlighting its superior utility in these tasks.
pdf
bib
abs
Geometric Signatures of Compositionality Across a Language Model’s Lifetime
Jin Hwa Lee
|
Thomas Jiralerspong
|
Lei Yu
|
Yoshua Bengio
|
Emily Cheng
By virtue of linguistic compositionality, few syntactic rules and a finite lexicon can generate an unbounded number of sentences. That is, language, though seemingly high-dimensional, can be explained using relatively few degrees of freedom. An open question is whether contemporary language models (LMs) reflect the intrinsic simplicity of language that is enabled by compositionality. We take a geometric view of this problem by relating the degree of compositionality in a dataset to the intrinsic dimension (ID) of its representations under an LM, a measure of feature complexity. We find not only that the degree of dataset compositionality is reflected in representations’ ID, but that the relationship between compositionality and geometric complexity arises due to learned linguistic features over training. Finally, our analyses reveal a striking contrast between nonlinear and linear dimensionality, showing they respectively encode semantic and superficial aspects of linguistic composition.
pdf
bib
abs
Pattern Recognition or Medical Knowledge? The Problem with Multiple-Choice Questions in Medicine
Maxime Griot
|
Jean Vanderdonckt
|
Demet Yuksel
|
Coralie Hemptinne
Large Language Models (LLMs) such as ChatGPT demonstrate significant potential in the medical domain and are often evaluated using multiple-choice questions (MCQs) modeled on exams like the USMLE. However, such benchmarks may overestimate true clinical understanding by rewarding pattern recognition and test-taking heuristics. To investigate this, we created a fictional medical benchmark centered on an imaginary organ, the Glianorex, allowing us to separate memorized knowledge from reasoning ability. We generated textbooks and MCQs in English and French using leading LLMs, then evaluated proprietary, open-source, and domain-specific models in a zero-shot setting. Despite the fictional content, models achieved an average score of 64%, while physicians scored only 27%. Fine-tuned medical models outperformed base models in English but not in French. Ablation and interpretability analyses revealed that models frequently relied on shallow cues, test-taking strategies, and hallucinated reasoning to identify the correct choice. These results suggest that standard MCQ-based evaluations may not effectively measure clinical reasoning and highlight the need for more robust, clinically meaningful assessment methods for LLMs.
pdf
bib
abs
People who frequently use ChatGPT for writing tasks are accurate and robust detectors of AI-generated text
Jenna Russell
|
Marzena Karpinska
|
Mohit Iyyer
In this paper, we study how well humans can detect text generated by commercial LLMs (GPT-4o, Claude, o1). We hire annotators to read 300 non-fiction English articles, label them as either human-written or AI-generated, and provide paragraph-length explanations for their decisions. Our experiments show that annotators who frequently use LLMs for writing tasks excel at detecting AI-generated text, even without any specialized training or feedback. In fact, the majority vote among five such “expert” annotators misclassifies only 1 of 300 articles, significantly outperforming most commercial and open-source detectors we evaluated even in the presence of evasion tactics like paraphrasing and humanization. Qualitative analysis of the experts’ free-form explanations shows that while they rely heavily on specific lexical clues (‘AI vocabulary’), they also pick up on more complex phenomena within the text (e.g., formality, originality, clarity) that are challenging to assess for automatic detectors. We release our annotated dataset and code to spur future research into both human and automated detection of AI-generated text.
pdf
bib
abs
YuLan-Mini: Pushing the Limits of Open Data-efficient Language Model
Hu Yiwen
|
Huatong Song
|
Jie Chen
|
Jia Deng
|
Jiapeng Wang
|
Kun Zhou
|
Yutao Zhu
|
Jinhao Jiang
|
Zican Dong
|
Yang Lu
|
Xu Miao
|
Xin Zhao
|
Ji-Rong Wen
Due to the immense resource demands and the involved complex techniques, it is still challenging for successfully pre-training a large language models (LLMs) with state-of-the-art performance. In this paper, we explore the key bottlenecks and designs during pre-training, and make the following contributions: (1) a comprehensive investigation into the factors contributing to training instability; (2) a robust optimization approach designed to mitigate training instability effectively; (3) an elaborate data pipeline that integrates data synthesis, data curriculum, and data selection. By integrating the above techniques, we create a rather low-cost training recipe and use it to pre-train YuLan-Mini, a fully-open base model with 2.4B parameters on 1.08T tokens. Remarkably, YuLan-Mini achieves top-tier performance among models of similar parameter scale, with comparable performance to industry-leading models that require significantly more data. To facilitate reproduction, we release the full details of training recipe and data composition. Project details can be accessed at the following link: https://anonymous.4open.science/r/YuLan-Mini/README.md.
pdf
bib
abs
Your Model is Overconfident, and Other Lies We Tell Ourselves
Timothee Mickus
|
Aman Sinha
|
Raúl Vázquez
The difficulty intrinsic to a given example, rooted in its inherent ambiguity, is a key yet often overlooked factor in evaluating neural NLP models. We investigate the interplay and divergence among various metrics for assessing intrinsic difficulty, including annotator dissensus, training dynamics, and model confidence. Through a comprehensive analysis using 29 models on three datasets, we reveal that while correlations exist among these metrics, their relationships are neither linear nor monotonic. By disentangling these dimensions of uncertainty, we aim to refine our understanding of data complexity and its implications for evaluating and improving NLP models.
pdf
bib
abs
Bridging the Language Gaps in Large Language Models with Inference-Time Cross-Lingual Intervention
Weixuan Wang
|
Minghao Wu
|
Barry Haddow
|
Alexandra Birch
Large Language Models (LLMs) have shown remarkable capabilities in natural language processing but exhibit significant performance gaps among different languages. Most existing approaches to address these disparities rely on pretraining or fine-tuning, which are resource-intensive. To overcome these limitations without incurring significant costs, we propose Inference-Time Cross-Lingual Intervention (INCLINE), a novel framework that enhances LLM performance on low-performing (source) languages by aligning their internal representations with those of high-performing (target) languages during inference. INCLINE initially learns alignment matrices using parallel sentences from source and target languages through a Least-Squares optimization, and then applies these matrices during inference to transform the low-performing language representations toward the high-performing language space. Extensive experiments on nine benchmarks with five LLMs demonstrate that INCLINE significantly improves performance across diverse tasks and languages, compared to recent strong baselines. Our analysis demonstrates that INCLINE is highly cost-effective and applicable to a wide range of applications. In addition, we release the code to foster research along this line.
pdf
bib
abs
Plug-in and Fine-tuning: Bridging the Gap between Small Language Models and Large Language Models
Kyeonghyun Kim
|
Jinhee Jang
|
Juhwan Choi
|
Yoonji Lee
|
Kyohoon Jin
|
YoungBin Kim
Large language models (LLMs) are renowned for their extensive linguistic knowledge and strong generalization capabilities, but their high computational demands make them unsuitable for resource-constrained environments. In contrast, small language models (SLMs) are computationally efficient but often lack the broad generalization capacity of LLMs. To bridge this gap, we propose PiFi, a novel framework that combines the strengths of both LLMs and SLMs to achieve high performance while maintaining efficiency. PiFi integrates a single frozen layer from an LLM into a SLM and fine-tunes the combined model for specific tasks, boosting performance without a significant increase in computational cost. We show that PiFi delivers consistent performance improvements across a range of natural language processing tasks, including both natural language understanding and generation. Moreover, our findings demonstrate PiFi’s ability to effectively leverage LLM knowledge, enhancing generalization to unseen domains and facilitating the transfer of linguistic abilities.
pdf
bib
abs
What is Stigma Attributed to? A Theory-Grounded, Expert-Annotated Interview Corpus for Demystifying Mental-Health Stigma
Han Meng
|
Yancan Chen
|
Yunan Li
|
Yitian Yang
|
Jungup Lee
|
Renwen Zhang
|
Yi-Chieh Lee
Mental-health stigma remains a pervasive social problem that hampers treatment-seeking and recovery. Existing resources for training neural models to finely classify such stigma are limited, relying primarily on social-media or synthetic data without theoretical underpinnings. To remedy this gap, we present an expert-annotated, theory-informed corpus of human-chatbot interviews, comprising 4,141 snippets from 684 participants with documented socio-cultural backgrounds. Our experiments benchmark state-of-the-art neural models and empirically unpack the challenges of stigma detection. This dataset can facilitate research on computationally detecting, neutralizing, and counteracting mental-health stigma. Our corpus is openly available at https://github.com/HanMeng2004/Mental-Health-Stigma-Interview-Corpus.
pdf
bib
abs
ATRI: Mitigating Multilingual Audio Text Retrieval Inconsistencies by Reducing Data Distribution Errors
Yuguo Yin
|
Yuxin Xie
|
Wenyuan Yang
|
Dongchao Yang
|
Jinghan Ru
|
Xianwei Zhuang
|
Liming Liang
|
Yuexian Zou
Multilingual audio-text retrieval (ML-ATR) is a challenging task that aims to retrieve audio clips or multilingual texts from databases. However, existing ML-ATR schemes suffer from inconsistencies for instance similarity matching across languages. To address the inconsistency issue in multilingual audio-text retrieval, we first identify two intuitive factors that contribute to inconsistency: misalignment between audio and multilingual text embeddings, and error propagation in model optimization. By systematically analyzing these factors, we derive theoretical weight error upper bounds for quantifying their effects and find that the main source of inconsistency is the data distribution error during training. This finding motivates our solution to reduce data distribution errors.We propose a consistent ML-ATR scheme using 1-to-k contrastive learning and audio-English co-anchor contrastive learning, aiming to mitigate the negative impact of data distribution error on recall and consistency in ML-ATR. Experimental results on the translated AudioCaps and Clotho datasets show that our scheme achieves state-of-the-art performance on recall and consistency metrics for eight mainstream languages, including English. Our code will be available at https://github.com/ATRI-ACL/ATRI-ACL.
pdf
bib
abs
Enhancing Transformers for Generalizable First-Order Logical Entailment
Tianshi Zheng
|
Jiazheng Wang
|
Zihao Wang
|
Jiaxin Bai
|
Hang Yin
|
Zheye Deng
|
Yangqiu Song
|
Jianxin Li
Transformers, as the fundamental deep learning architecture, have demonstrated great capability in reasoning. This paper studies the generalizable first-order logical reasoning ability of transformers with their *parameterized* knowledge and how to improve it. Transformers’ capability of first-order reasoning is further captured by whether they can conduct first-order logical entailment, which is quantitatively measured by their performance in answering knowledge graph queries. We establish the connections between (1) two types of distribution shifts studied in out-of-distribution generalization and (2) unseen knowledge and query settings discussed in the task of knowledge graph query answering, which makes it possible to characterize the fine-grained generalizability. Results on our comprehensive dataset showed that transformers **outperform** previous methods designed particularly for this task and provided detailed empirical evidence about the impact of the input query syntax, token embedding, and transformer architectures on the reasoning capability of transformers. Interestingly, our results revealed the mismatch of positional encoding and other design choices of transformer architectures in previous practices. Motivated by this, we propose **TEGA**, a logic-aware architecture that significantly improves the performance in generalizable first-order logical entailment.
pdf
bib
abs
Self-Taught Agentic Long Context Understanding
Yufan Zhuang
|
Xiaodong Yu
|
Jialian Wu
|
Ximeng Sun
|
Ze Wang
|
Jiang Liu
|
Yusheng Su
|
Jingbo Shang
|
Zicheng Liu
|
Emad Barsoum
Answering complex, long-context questions remains a major challenge for large language models (LLMs) as it requires effective question clarifications and context retrieval. We propose Agentic Long-Context Understanding (AgenticLU), a framework designed to enhance an LLM’s understanding of such queries by integrating targeted self-clarification with contextual grounding within an agentic workflow. At the core of AgenticLU is Chain-of-Clarifications (CoC), where models refine their understanding through self-generated clarification questions and corresponding contextual groundings. By scaling inference as a tree search where each node represents a CoC step, we achieve 97.8% answer recall on NarrativeQA with a search depth of up to three and a branching factor of eight. To amortize the high cost of this search process to training, we leverage the preference pairs for each step obtained by the CoC workflow and perform two-stage model finetuning: (1) supervised finetuning to learn effective decomposition strategies, and (2) direct preference optimization to enhance reasoning quality. This enables AgenticLU models to generate clarifications and retrieve relevant context effectively and efficiently in a single inference pass. Extensive experiments across seven long-context tasks demonstrate that AgenticLU significantly outperforms state-of-the-art prompting methods and specialized long-context LLMs, achieving robust multi-hop reasoning while sustaining consistent performance as context length grows.
pdf
bib
abs
Hallucination Detox: Sensitivity Dropout (SenD) for Large Language Model Training
Shahrad Mohammadzadeh
|
Juan David Guerra
|
Marco Bonizzato
|
Reihaneh Rabbany
|
Golnoosh Farnadi
As large language models (LLMs) become increasingly prevalent, concerns about their reliability, particularly due to hallucinations - factually inaccurate or irrelevant outputs - have grown. Our research investigates the relationship between the uncertainty in training dynamics and the emergence of hallucinations. Using models from the Pythia suite and several hallucination detection metrics, we analyze hallucination trends and identify significant variance during training. To address this, we propose Sensitivity Dropout (SenD), a novel training protocol designed to reduce hallucination variance during training by deterministically dropping embedding indices with significant variability. In addition, we develop an unsupervised hallucination detection metric, Efficient EigenScore (EES), which approximates the traditional EigenScore in 2x speed. This metric is integrated into our training protocol, allowing SenD to be both computationally scalable and effective at reducing hallucination variance. SenD improves test-time reliability of Pythia and Meta’s Llama models by up to 17% and enhances factual accuracy in Wikipedia, Medical, Legal, and Coding domains without affecting downstream task performance.
pdf
bib
abs
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis
Qiushi Sun
|
Kanzhi Cheng
|
Zichen Ding
|
Chuanyang Jin
|
Yian Wang
|
Fangzhi Xu
|
Zhenyu Wu
|
Chengyou Jia
|
Liheng Chen
|
Zhoumianze Liu
|
Ben Kao
|
Guohao Li
|
Junxian He
|
Yu Qiao
|
Zhiyong Wu
Graphical User Interface (GUI) agents powered by Vision-Language Models (VLMs) have demonstrated human-like computer control capability. Despite their utility in advancing digital automation, the development of such agents faces a critical bottleneck: collecting high-quality trajectory data for training. Common practices for collecting such data rely on human supervision or synthetic data generation through executing pre-defined tasks, which are either resource-intensive or unable to guarantee data quality. Further, these approaches exhibit significant gaps between the generated data and online environments, alongside limited data diversity. To address this issue, we introduce OS-Genesis, a novel GUI data synthesis pipeline that overcomes the challenges above. Unlike prior methods that rely on preset tasks, OS-Genesis reverse engineers the GUI trajectory construction process. Agents first perceive environments and perform step-level interactions, then retrospectively derive high-quality tasks to enable trajectory-level exploration. A trajectory reward model is then employed to ensure the quality of the generated trajectories. We demonstrate that training GUI agents with OS-Genesis significantly improves their performance on highly challenging online benchmarks. In-depth analysis further validates OS-Genesis’s cost-effectiveness and its superior data quality and diversity compared to existing synthesis methods.
pdf
bib
abs
CORAL: Learning Consistent Representations across Multi-step Training with Lighter Speculative Drafter
Yepeng Weng
|
Dianwen Mei
|
Huishi Qiu
|
Xujie Chen
|
Li Liu
|
Jiang Tian
|
Zhongchao Shi
Speculative decoding is a powerful technique that accelerates Large Language Model (LLM) inference by leveraging a lightweight speculative draft model. However, existing designs suffers in performance due to misalignment between training and inference. Recent methods have tried to solve this issue by adopting a multi-step training strategy, but the complex inputs of different training steps make it harder for the draft model to converge. To address this, we propose CORAL, a novel framework that improves both accuracy and efficiency in speculative drafting. CORAL introduces Cross-Step Representation Alignment, a method that enhances consistency across multiple training steps, significantly improving speculative drafting performance. Additionally, we identify the LM head as a major bottleneck in the inference speed of the draft model. We introduce a weight-grouping mechanism that selectively activates a subset of LM head parameters during inference, substantially reducing the latency of the draft model. We evaluate CORAL on three LLM families and three benchmark datasets, achieving speedup ratios of 2.50x-4.07x, outperforming state-of-the-art methods such as EAGLE-2 and HASS. Our results demonstrate that CORAL effectively mitigates training-inference misalignment and delivers significant speedup for modern LLMs with large vocabularies.
pdf
bib
abs
ConSim: Measuring Concept-Based Explanations’ Effectiveness with Automated Simulatability
Antonin Poché
|
Alon Jacovi
|
Agustin Martin Picard
|
Victor Boutin
|
Fanny Jourdan
Concept-based explanations work by mapping complex model computations to human-understandable concepts. Evaluating such explanations is very difficult, as it includes not only the quality of the induced space of possible concepts but also how effectively the chosen concepts are communicated to users. Existing evaluation metrics often focus solely on the former, neglecting the latter.We introduce an evaluation framework for measuring concept explanations via automated simulatability: a simulator’s ability to predict the explained model’s outputs based on the provided explanations. This approach accounts for both the concept space and its interpretation in an end-to-end evaluation. Human studies for simulatability are notoriously difficult to enact, particularly at the scale of a wide, comprehensive empirical evaluation (which is the subject of this work). We propose using large language models (LLMs) as simulators to approximate the evaluation and report various analyses to make such approximations reliable. Our method allows for scalable and consistent evaluation across various models and datasets. We report a comprehensive empirical evaluation using this framework and show that LLMs provide consistent rankings of explanation methods. Code available at Anonymous GitHub.
pdf
bib
abs
Decoding Reading Goals from Eye Movements
Omer Shubi
|
Cfir Avraham Hadar
|
Yevgeni Berzak
Readers can have different goals with respect to the text that they are reading. Can these goals be decoded from their eye movements over the text? In this work, we examine for the first time whether it is possible to distinguish between two types of common reading goals: information seeking and ordinary reading for comprehension. Using large-scale eye tracking data, we address this task with a wide range of models that cover different architectural and data representation strategies, and further introduce a new model ensemble. We find that transformer-based models with scanpath representations coupled with language modeling solve it most successfully, and that accurate predictions can be made in real time, shortly after the participant started reading the text. We further introduce a new method for model performance analysis based on mixed effect modeling. Combining this method with rich textual annotations reveals key properties of textual items and participants that contribute to the difficulty of the task, and improves our understanding of the variability in eye movement patterns across the two reading regimes.
pdf
bib
abs
Uncovering Visual-Semantic Psycholinguistic Properties from the Distributional Structure of Text Embedding Space
Si Wu
|
Sebastian Bruch
Imageability (potential of text to evoke a mental image) and concreteness (perceptibility of text) are two psycholinguistic properties that link visual and semantic spaces. It is little surprise that computational methods that estimate them do so using parallel visual and semantic spaces, such as collections of image-caption pairs or multi-modal models. In this paper, we work on the supposition that text itself in an image-caption dataset offers sufficient signals to accurately estimate these properties. We hypothesize, in particular, that the peakedness of the neighborhood of a word in the semantic embedding space reflects its degree of imageability and concreteness. We then propose an unsupervised, distribution-free measure, which we call Neighborhood Stability Measure (NSM), that quantifies the sharpness of peaks. Extensive experiments show that NSM correlates more strongly with ground-truth ratings than existing unsupervised methods, and is a strong predictor of these properties for classification. Our code and data are available on GitHub (https://github.com/Artificial-Memory-Lab/imageability).
pdf
bib
abs
GUI-explorer: Autonomous Exploration and Mining of Transition-aware Knowledge for GUI Agent
Bin Xie
|
Rui Shao
|
Gongwei Chen
|
Kaiwen Zhou
|
Yinchuan Li
|
Jie Liu
|
Min Zhang
|
Liqiang Nie
GUI automation faces critical challenges in dynamic environments. MLLMs suffer from two key issues: misinterpreting UI components and outdated knowledge. Traditional fine-tuning methods are costly for app-specific knowledge updates. We propose GUI-explorer, a training-free GUI agent that incorporates two fundamental mechanisms: (1) Autonomous Exploration of Function-aware Trajectory. To comprehensively cover all application functionalities, we design a Function-aware Task Goal Generator that automatically constructs exploration goals by analyzing GUI structural information (e.g., screenshots and activity hierarchies). This enables systematic exploration to collect diverse trajectories. (2) Unsupervised Mining of Transition-aware Knowledge. To establish precise screen-operation logic, we develop a Transition-aware Knowledge Extractor that extracts effective screen-operation logic through unsupervised analysis the state transition of structured interaction triples (observation, action, outcome). This eliminates the need for human involvement in knowledge extraction. With a task success rate of 53.7% on SPA-Bench and 47.4% on AndroidWorld, GUI-explorer shows significant improvements over SOTA agents. It requires no parameter updates for new apps. GUI-explorer is open-sourced and publicly available at https://github.com/JiuTian-VL/GUI-explorer.
pdf
bib
abs
P2 Law: Scaling Law for Post-Training After Model Pruning
Xiaodong Chen
|
Yuxuan Hu
|
Xiaokang Zhang
|
Yanling Wang
|
Cuiping Li
|
Hong Chen
|
Jing Zhang
Pruning has become a widely adopted technique for reducing the hardware requirements of large language models (LLMs). To recover model performance after pruning, post-training is commonly employed to mitigate the resulting performance degradation. While post-training benefits from larger datasets, once the dataset size is already substantial, increasing the training data provides only limited performance gains. To balance post-training cost and model performance, it is necessary to explore the optimal amount of post-training data. Through extensive experiments on the Llama-3 and Qwen-2.5 series models, pruned using various common pruning methods, we uncover the scaling Law for Post-training after model Pruning, referred to as the P2 Law. This law identifies four key factors for predicting the pruned model’s post-training loss: the model size before pruning, the number of post-training tokens, the pruning rate, and the model’s loss before pruning. Moreover, P2 Law can generalize to larger dataset sizes, larger model sizes, and higher pruning rates, offering valuable insights for the post-training of pruned LLMs.
pdf
bib
abs
Making FETCH! Happen: Finding Emergent Dog Whistles Through Common Habitats
Kuleen Sasse
|
Carlos Alejandro Aguirre
|
Isabel Cachola
|
Sharon Levy
|
Mark Dredze
Dog whistles are coded expressions with dual meanings: one intended for the general public (outgroup) and another that conveys a specific message to an intended audience (ingroup). Often, these expressions are used to convey controversial political opinions while maintaining plausible deniability and slip by content moderation filters. Identification of dog whistles relies on curated lexicons, which have trouble keeping up to date. We introduce FETCH!, a task for finding novel dog whistles in massive social media corpora. We find that state-of-the-art systems fail to achieve meaningful results across three distinct social media case studies. We present EarShot, a strong baseline system that combines the strengths of vector databases and Large Language Models (LLMs) to efficiently and effectively identify new dog whistles.
pdf
bib
abs
Lost in the Context: Insufficient and Distracted Attention to Contexts in Preference Modeling
Shihan Dou
|
Jiayi Chen
|
Chenhao Huang
|
Feng Chen
|
Wei Chengzhi
|
Huiyuan Zheng
|
Shichun Liu
|
Yan Liu
|
Chenxiao Liu
|
Chao Xin
|
Lin Yan
|
Zongzhang Zhang
|
Tao Gui
|
Qi Zhang
|
Xuanjing Huang
In Reinforcement Learning from Human Feedback (RLHF), the reward model (RM) evaluates the response quality based on the given context and assigns a reward. It plays a crucial role in aligning RLHF with human preferences. Although the current RM training paradigm concatenates the context and response while amplifying the reward difference between good and bad response pairs, we demonstrate that the RM faces two significant issues: i) it often allocates only a small proportion of attention to the context, and ii) it frequently ignores segments of the context that are relevant for evaluating the response quality. These issues undermine the RM’s effectiveness in modeling human preferences. To further address these challenges, we propose AttnRM, a novel optimization framework that enables the RM to concentrate on crucial segments of the context. Experimental results demonstrate that AttnRM significantly improves preference modeling by increasing attention to relevant information within the context. It also enhances the RM’s generalizability and achieves better performance in aligning with human preferences.
pdf
bib
abs
Entailment-Preserving First-order Logic Representations in Natural Language Entailment
Jinu Lee
|
Qi Liu
|
Runzhi Ma
|
Vincent Han
|
Ziqi Wang
|
Heng Ji
|
Julia Hockenmaier
First-order logic (FOL) is often used to represent logical entailment, but determining natural language (NL) entailment using FOL remains a challenge. To address this, we propose the Entailment-Preserving FOL representations (EPF) task and introduce reference-free evaluation metrics for EPF (Entailment-Preserving Rate (EPR) family). In EPF, one should generate FOL representations from multi-premise NL entailment data (e.g., EntailmentBank) so that the automatic prover’s result preserves the entailment labels. Furthermore, we propose a training method specialized for the task, iterative learning-to-rank, which trains an NL-to-FOL translator by using the natural language entailment labels as verifiable rewards. Our method achieves a 1.8–2.7% improvement in EPR and a 17.4–20.6% increase in EPR@16 compared to diverse baselines in three datasets. Further analyses reveal that iterative learning-to-rank effectively suppresses the arbitrariness of FOL representation by reducing the diversity of predicate signatures, and maintains strong performance across diverse inference types and out-of-domain data.
pdf
bib
abs
Enhancing Multimodal Continual Instruction Tuning with BranchLoRA
Duzhen Zhang
|
Yong Ren
|
Zhong-Zhi Li
|
Yahan Yu
|
Jiahua Dong
|
Chenxing Li
|
Zhilong Ji
|
Jinfeng Bai
Multimodal Continual Instruction Tuning (MCIT) aims to finetune Multimodal Large Language Models (MLLMs) to continually align with human intent across sequential tasks. Existing approaches often rely on the Mixture-of-Experts (MoE) LoRA framework to preserve previous instruction alignments. However, these methods are prone to Catastrophic Forgetting (CF), as they aggregate all LoRA blocks via simple summation, which compromises performance over time. In this paper, we identify a critical parameter inefficiency in the MoELoRA framework within the MCIT context. Based on this insight, we propose BranchLoRA, an asymmetric framework to enhance both efficiency and performance. To mitigate CF, we introduce a flexible tuning-freezing mechanism within BranchLoRA, enabling branches to specialize in intra-task knowledge while fostering inter-task collaboration. Moreover, we incrementally incorporate task-specific routers to ensure an optimal branch distribution over time, rather than favoring the most recent task. To streamline inference, we introduce a task selector that automatically routes test inputs to the appropriate router without requiring task identity. Extensive experiments on the latest MCIT benchmark demonstrate that BranchLoRA significantly outperforms MoELoRA and maintains its superiority across various MLLM sizes.
pdf
bib
abs
Enhancing Automated Interpretability with Output-Centric Feature Descriptions
Yoav Gur-Arieh
|
Roy Mayan
|
Chen Agassy
|
Atticus Geiger
|
Mor Geva
Automated interpretability pipelines generate natural language descriptions for the concepts represented by features in large language models (LLMs), such as “plants” or “the first word in a sentence”. These descriptions are derived using inputs that activate the feature, which may be a dimension or a direction in the model’s representation space. However, identifying activating inputs is costly, and the mechanistic role of a feature in model behavior is determined both by how inputs cause a feature to activate and by how feature activation affects outputs. Using steering evaluations, we reveal that current pipelines provide descriptions that fail to capture the causal effect of the feature on outputs. To fix this, we propose efficient, output-centric methods for automatically generating feature descriptions. These methods use the tokens weighted higher after feature stimulation or the highest weight tokens after applying the vocabulary “unembedding” head directly to the feature. Our output-centric descriptions better capture the causal effect of a feature on model outputs than input-centric descriptions, but combining the two leads to the best performance on both input and output evaluations. Lastly, we show that output-centric descriptions can be used to find inputs that activate features previously thought to be “dead”.
pdf
bib
abs
Towards Effective and Efficient Continual Pre-training of Large Language Models
Jie Chen
|
Zhipeng Chen
|
Jiapeng Wang
|
Kun Zhou
|
Yutao Zhu
|
Jinhao Jiang
|
Yingqian Min
|
Xin Zhao
|
Zhicheng Dou
|
Jiaxin Mao
|
Yankai Lin
|
Ruihua Song
|
Jun Xu
|
Xu Chen
|
Rui Yan
|
Zhewei Wei
|
Di Hu
|
Wenbing Huang
|
Ji-Rong Wen
Continual pre-training (CPT) has been an important approach for adapting language models to specific domains or tasks. In this paper, we comprehensively study its key designs to balance the new abilities while retaining the original abilities, and present an effective CPT method that can greatly improve the Chinese language ability and scientific reasoning ability of LLMs. To achieve it, we design specific data mixture and curriculum strategies based on existing datasets and synthetic high-quality data. Concretely, we synthesize multidisciplinary scientific QA pairs based on related web pages to guarantee the data quality, and also devise the performance tracking and data mixture adjustment strategy to ensure the training stability. For the detailed designs, we conduct preliminary studies on a relatively small model, and summarize the findings to help optimize our CPT method. Extensive experiments on a number of evaluation benchmarks show that our approach can largely improve the performance of Llama-3 (8B), including both the general abilities (+8.81 on C-Eval and +6.31 on CMMLU) and the scientific reasoning abilities (+12.00 on MATH and +4.13 on SciEval). Our model, data, and codes are available at https://github.com/RUC-GSAI/Llama-3-SynE.
pdf
bib
abs
Efficient Universal Goal Hijacking with Semantics-guided Prompt Organization
Yihao Huang
|
Chong Wang
|
Xiaojun Jia
|
Qing Guo
|
Felix Juefei-Xu
|
Jian Zhang
|
Yang Liu
|
Geguang Pu
Universal goal hijacking is a kind of prompt injection attack that forces LLMs to return a target malicious response for arbitrary normal user prompts. The previous methods achieve high attack performance while being too cumbersome and time-consuming. Also, they have concentrated solely on optimization algorithms, overlooking the crucial role of the prompt. To this end, we propose a method called POUGH that incorporates an efficient optimization algorithm and two semantics-guided prompt organization strategies. Specifically, our method starts with a sampling strategy to select representative prompts from a candidate pool, followed by a ranking strategy that prioritizes them. Given the sequentially ranked prompts, our method employs an iterative optimization algorithm to generate a fixed suffix that can concatenate to arbitrary user prompts for universal goal hijacking. Experiments conducted on four popular LLMs and ten types of target responses verified the effectiveness.
pdf
bib
abs
mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding
Anwen Hu
|
Haiyang Xu
|
Liang Zhang
|
Jiabo Ye
|
Ming Yan
|
Ji Zhang
|
Qin Jin
|
Fei Huang
|
Jingren Zhou
Multimodel Large Language Models(MLLMs) have achieved promising OCR-free Document Understanding performance by increasing the supported resolution of document images. However, this comes at the cost of generating thousands of visual tokens for a single document image, leading to excessive GPU memory and slower inference times, particularly in multi-page document comprehension. In this work, to address these challenges, we propose a High-resolution DocCompressor module to compress each high-resolution document image into 324 tokens, guided by low-resolution global visual features. With this compression module, to strengthen multi-page document comprehension ability and balance both token efficiency and question-answering performance, we develop the DocOwl2 under a three-stage training framework: Single-image Pretraining, Multi-image Continue-pretraining, and Multi-task Finetuning. DocOwl2 sets a new state-of-the-art across multi-page document understanding benchmarks and reduces first token latency by more than 50%. Compared to single-image MLLMs trained on similar data, our DocOwl2 achieves comparable single-page understanding performance with less than 20% of the visual tokens. Our codes, models, and data will be publicly available.
pdf
bib
abs
What Makes a Good Natural Language Prompt?
Do Xuan Long
|
Duy Dinh
|
Ngoc-Hai Nguyen
|
Kenji Kawaguchi
|
Nancy F. Chen
|
Shafiq Joty
|
Min-Yen Kan
As large language models (LLMs) have progressed towards more human-like and human–AI communications prevalent, prompting has emerged as a decisive component. However, there is limited conceptual consensus on what exactly quantifies natural language prompts. We attempt to address this question by conducting a meta-analysis surveying 150+ prompting-related papers from leading NLP and AI conferences (2022–2024), and blogs. We propose a property- and human-centric framework for evaluating prompt quality, encompassing 21 properties categorized into six dimensions. We then examine how existing studies assess their impact on LLMs, revealing their imbalanced support across models and tasks, and substantial research gaps. Further, we analyze correlations among properties in high-quality natural language prompts, deriving prompting recommendations. Finally, we explore multi-property prompt enhancements in reasoning tasks, observing that single-property enhancements often have the greatest impact. Our findings establish a foundation for property-centric prompt evaluation and optimization, bridging the gaps between human–AI communication and opening new prompting research directions.
pdf
bib
abs
X-TURING: Towards an Enhanced and Efficient Turing Test for Long-Term Dialogue Agents
Weiqi Wu
|
Hongqiu Wu
|
Hai Zhao
The Turing test examines whether AIs exhibit human-like behaviour in natural language conversations. The traditional setting limits each participant to one message at a time and requires constant human participation. This fails to reflect a natural conversational style and hinders the evaluation of dialogue agents based on Large Language Models (LLMs) in complex and prolonged interactions. This paper proposes X-Turing, which enhances the original test with a burst dialogue pattern, allowing more dynamic exchanges using consecutive messages. It further reduces human workload by iteratively generating dialogues that simulate the long-term interaction between the agent and a human to compose the majority of the test process. With the pseudo-dialogue history, the agent then engages in a shorter dialogue with a real human, which is paired with a human-human conversation on the same topic to be judged using questionnaires. We introduce the X-Turn Pass-Rate metric to assess the human likeness of LLMs across varying durations. While LLMs like GPT-4 initially perform well, achieving pass rates of 51.9% and 38.9% during 3 turns and 10 turns of dialogues respectively, their performance drops as the dialogue progresses, which underscores the difficulty in maintaining consistency in the long term.
pdf
bib
abs
Are Rules Meant to be Broken? Understanding Multilingual Moral Reasoning as a Computational Pipeline with UniMoral
Shivani Kumar
|
David Jurgens
Moral reasoning is a complex cognitive process shaped by individual experiences and cultural contexts and presents unique challenges for computational analysis. While natural language processing (NLP) offers promising tools for studying this phenomenon, current research lacks cohesion, employing discordant datasets and tasks that examine isolated aspects of moral reasoning. We bridge this gap with UniMoral, a unified dataset integrating psychologically grounded and social-media-derived moral dilemmas annotated with labels for action choices, ethical principles, contributing factors, and consequences, alongside annotators’ moral and cultural profiles. Recognizing the cultural relativity of moral reasoning, UniMoral spans six languages, Arabic, Chinese, English, Hindi, Russian, and Spanish, capturing diverse socio-cultural contexts. We demonstrate UniMoral’s utility through a benchmark evaluations of three large language models (LLMs) across four tasks: action prediction, moral typology classification, factor attribution analysis, and consequence generation. Key findings reveal that while implicitly embedded moral contexts enhance the moral reasoning capability of LLMs, there remains a critical need for increasingly specialized approaches to further advance moral reasoning in these models.
pdf
bib
abs
Modality-Aware Neuron Pruning for Unlearning in Multimodal Large Language Models
Zheyuan Liu
|
Guangyao Dou
|
Xiangchi Yuan
|
Chunhui Zhang
|
Zhaoxuan Tan
|
Meng Jiang
Generative models such as Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) trained on massive datasets can lead them to memorize and inadvertently reveal sensitive information, raising ethical and privacy concerns. While some prior works have explored this issue in the context of LLMs, it presents a unique challenge for MLLMs due to the entangled nature of knowledge across modalities, making comprehensive unlearning more difficult. To address this challenge, we propose Modality Aware Neuron Unlearning (MANU), a novel unlearning framework for MLLMs designed to selectively clip neurons based on their relative importance to the targeted forget data, curated for different modalities. Specifically, MANU consists of two stages: important neuron selection and selective pruning. The first stage identifies and collects the most influential neurons across modalities relative to the targeted forget knowledge, while the second stage is dedicated to pruning those selected neurons. MANU effectively isolates and removes the neurons that contribute most to the forget data within each modality, while preserving the integrity of retained knowledge. Our experiments conducted across various MLLM architectures illustrate that MANU can achieve a more balanced and comprehensive unlearning in each modality without largely affecting the overall model utility.
pdf
bib
abs
NGQA: A Nutritional Graph Question Answering Benchmark for Personalized Health-aware Nutritional Reasoning
Zheyuan Zhang
|
Yiyang Li
|
Nhi Ha Lan Le
|
Zehong Wang
|
Tianyi Ma
|
Vincent Galassi
|
Keerthiram Murugesan
|
Nuno Moniz
|
Werner Geyer
|
Nitesh V Chawla
|
Chuxu Zhang
|
Yanfang Ye
Diet plays a critical role in human health, yet tailoring dietary reasoning to individual health conditions remains a major challenge. Nutrition Question Answering (QA) has emerged as a popular method for addressing this problem. However, current research faces two critical limitations. On one hand, the absence of datasets involving user-specific medical information severely limits personalization. This challenge is further compounded by the wide variability in individual health needs. On the other hand, while large language models (LLMs), a popular solution for this task, demonstrate strong reasoning abilities, they struggle with the domain-specific complexities of personalized healthy dietary reasoning, and existing benchmarks fail to capture these challenges. To address these gaps, we introduce the Nutritional Graph Question Answering (NGQA) benchmark, the first graph question answering dataset designed for personalized nutritional health reasoning. NGQA leverages data from the National Health and Nutrition Examination Survey (NHANES) and the Food and Nutrient Database for Dietary Studies (FNDDS) to evaluate whether a food is healthy for a specific user, supported by explanations of the key contributing nutrients. The benchmark incorporates three question complexity settings and evaluates reasoning across three downstream tasks. Extensive experiments with LLM backbones and baseline models demonstrate that the NGQA benchmark effectively challenges existing models. In sum, NGQA addresses a critical real-world problem while advancing GraphQA research with a novel domain-specific benchmark. Our codebase and dataset are available here.
pdf
bib
abs
ReLearn: Unlearning via Learning for Large Language Models
Haoming Xu
|
Ningyuan Zhao
|
Liming Yang
|
Sendong Zhao
|
Shumin Deng
|
Mengru Wang
|
Bryan Hooi
|
Nay Oo
|
Huajun Chen
|
Ningyu Zhang
Current unlearning methods for large language models usually rely on reverse optimization to reduce target token probabilities. However, this paradigm disrupts the subsequent tokens prediction, degrading model performance and linguistic coherence. Moreover, existing evaluation metrics overemphasize contextual forgetting while inadequately assessing response fluency and relevance. To address these challenges, we propose ReLearn, a data augmentation and fine-tuning pipeline for effective unlearning, along with a comprehensive evaluation framework. This framework introduces Knowledge Forgetting Ratio (KFR) and Knowledge Retention Ratio (KRR) to measure knowledge-level preservation, and Linguistic Score (LS) to evaluate generation quality. Our experiments show that ReLearn successfully achieves targeted forgetting while preserving high-quality outputs. Through mechanistic analysis, we further demonstrate how reverse optimization disrupts coherent text generation, while ReLearn preserves this essential capability.
pdf
bib
abs
Understanding Cross-Domain Adaptation in Low-Resource Topic Modeling
Pritom Saha Akash
|
Kevin Chen-Chuan Chang
Topic modeling plays a vital role in uncovering hidden semantic structures within text corpora, but existing models struggle in low-resource settings where limited target-domain data leads to unstable and incoherent topic inference. We address this challenge by formally introducing domain adaptation for low-resource topic modeling, where a high-resource source domain informs a low-resource target domain without overwhelming it with irrelevant content. We establish a finite-sample generalization bound showing that effective knowledge transfer depends on robust performance in both domains, minimizing latent-space discrepancy, and preventing overfitting to the data. Guided by these insights, we propose DALTA (Domain-Aligned Latent Topic Adaptation), a new framework that employs a shared encoder for domain-invariant features, specialized decoders for domain-specific nuances, and adversarial alignment to selectively transfer relevant information. Experiments on diverse low-resource datasets demonstrate that DALTA consistently outperforms state-of-the-art methods in terms of topic coherence, stability, and transferability.
pdf
bib
abs
UAlign: Leveraging Uncertainty Estimations for Factuality Alignment on Large Language Models
Boyang Xue
|
Fei Mi
|
Qi Zhu
|
Hongru Wang
|
Rui Wang
|
Sheng Wang
|
Erxin Yu
|
Xuming Hu
|
Kam-Fai Wong
Despite demonstrating impressive capabilities, Large Language Models (LLMs) still often struggle to accurately express the factual knowledge they possess, especially in cases where the LLMs’ knowledge boundaries are ambiguous. To improve LLMs’ factual expressions, we propose the UAlign framework, which leverages Uncertainty estimations to represent knowledge boundaries, and then explicitly incorporates these representations as input features into prompts for LLMs to Align with factual knowledge. First, we prepare the dataset on knowledge question-answering (QA) samples by calculating two uncertainty estimations, including confidence score and semantic entropy, to represent the knowledge boundaries for LLMs. Subsequently, using the prepared dataset, we train a reward model that incorporates uncertainty estimations and then employ the Proximal Policy Optimization (PPO) algorithm for factuality alignment on LLMs. Experimental results indicate that, by integrating uncertainty representations in LLM alignment, the proposed UAlign can significantly enhance the LLMs’ capacities to confidently answer known questions and refuse unknown questions on both in-domain and out-of-domain tasks, showing reliability improvements and good generalizability over various prompt- and training-based baselines.
pdf
bib
abs
CoT-Valve: Length-Compressible Chain-of-Thought Tuning
Xinyin Ma
|
Guangnian Wan
|
Runpeng Yu
|
Gongfan Fang
|
Xinchao Wang
Chain-of-Thought significantly enhances a model’s reasoning capability, but it also comes with a considerable increase in inference costs due to long chains. With the observation that the reasoning path can be easily compressed under easy tasks but struggle on hard tasks, we explore the feasibility of elastically controlling the length of reasoning paths with only one model, thereby reducing the inference overhead of reasoning models dynamically based on task difficulty. We introduce a new tuning and inference strategy named CoT-Valve, designed to allow models to generate reasoning chains of varying lengths. To achieve this, we propose to identify a direction in the parameter space that, when manipulated, can effectively control the length of generated CoT. Moreover, we show that this property is valuable for compressing the reasoning chain. We construct datasets with chains from long to short for the same questions and explore two enhanced strategies for CoT-Valve: (1) a precise length-compressible CoT tuning method, and (2) a progressive chain length compression approach. Our experiments show that CoT-Valve successfully enables controllability and compressibility of the chain and shows better performance than the prompt-based control. We applied this method to QwQ-32B-Preview, reducing reasoning chains on GSM8K from 741 to 225 tokens with a minor performance drop (95.07% to 94.92%) and on AIME from 6827 to 4629 tokens, with only one additional incorrect answer.
pdf
bib
abs
HoH: A Dynamic Benchmark for Evaluating the Impact of Outdated Information on Retrieval-Augmented Generation
Jie Ouyang
|
Tingyue Pan
|
Mingyue Cheng
|
Ruiran Yan
|
Yucong Luo
|
Jiaying Lin
|
Qi Liu
While Retrieval-Augmented Generation (RAG) has emerged as an effective approach for addressing the knowledge outdating problem in Large Language Models (LLMs), it still faces a critical challenge: the prevalence of outdated information in knowledge bases. Current research primarily focuses on incorporating up-to-date information, yet the impact of outdated information coexisting in retrieval sources remains inadequately addressed. To bridge this gap, we introduce HoH, the first benchmark specifically designed to evaluate the impact of outdated information on RAG. Our benchmark leverages token-level diff algorithms combined with LLM pipelines to efficiently create a large-scale QA dataset that accurately captures the evolution of temporal knowledge in real-world facts.Through comprehensive experiments, we reveal that outdated information significantly degrades RAG performance in two critical ways: (1) it substantially reduces response accuracy by distracting models from correct information, and (2) it can mislead models into generating potentially harmful outputs, even when current information is available. Current RAG approaches struggle with both retrieval and generation aspects when handling outdated information. These findings highlight the urgent need for innovative solutions to address the temporal challenges in RAG.
pdf
bib
abs
Uncertainty Propagation on LLM Agent
Qiwei Zhao
|
Dong Li
|
Yanchi Liu
|
Wei Cheng
|
Yiyou Sun
|
Mika Oishi
|
Takao Osaki
|
Katsushi Matsuda
|
Huaxiu Yao
|
Chen Zhao
|
Haifeng Chen
|
Xujiang Zhao
Large language models (LLMs) integrated into multi-step agent systems enable complex decision-making processes across various applications. However, their outputs often lack reliability, making uncertainty estimation crucial. Existing uncertainty estimation methods primarily focus on final-step outputs, which fail to account for cumulative uncertainty over the multi-step decision-making process and the dynamic interactions between agents and their environments. To address these limitations, we propose SAUP (Situation Awareness Uncertainty Propagation), a novel framework that propagates uncertainty through each step of an LLM-based agent’s reasoning process. SAUP incorporates situational awareness by assigning situational weights to each step’s uncertainty during the propagation. Our method, compatible with various one-step uncertainty estimation techniques, provides a comprehensive and accurate uncertainty measure. Extensive experiments on benchmark datasets demonstrate that SAUP significantly outperforms existing state-of-the-art methods, achieving up to 20% improvement in AUROC.
pdf
bib
abs
Beyond Position: the emergence of wavelet-like properties in Transformers
Valeria Ruscio
|
Umberto Nanni
|
Fabrizio Silvestri
This paper studies how Transformer models with Rotary Position Embeddings (RoPE) develop emergent, wavelet-like properties that compensate for the positional encoding’s theoretical limitations. Through an analysis spanning model scales, architectures, and training checkpoints, we show that attention heads evolve to implement multi-resolution processing analogous to wavelet transforms. We demonstrate that this scale-invariant behavior is unique to RoPE, emerges through distinct evolutionary phases during training, and statistically adheres to the fundamental uncertainty principle. Our findings suggest that the effectiveness of modern Transformers stems from their remarkable ability to spontaneously develop optimal, multi-resolution decompositions to address inherent architectural constraints.
pdf
bib
abs
Are the Hidden States Hiding Something? Testing the Limits of Factuality-Encoding Capabilities in LLMs
Giovanni Servedio
|
Alessandro De Bellis
|
Dario Di Palma
|
Vito Walter Anelli
|
Tommaso Di Noia
Factual hallucinations are a major challenge for Large Language Models (LLMs). They undermine reliability and user trust by generating inaccurate or fabricated content. Recent studies suggest that when generating false statements, the internal states of LLMs encode information about truthfulness. However, these studies often rely on synthetic datasets that lack realism, which limits generalization when evaluating the factual accuracy of text generated by the model itself. In this paper, we challenge the findings of previous work by investigating truthfulness encoding capabilities, leading to the generation of a more realistic and challenging dataset. Specifically, we extend previous work by introducing: (1) a strategy for sampling plausible true-false factoid sentences from tabular data and (2) a procedure for generating realistic, LLM-dependent true-false datasets from Question Answering collections. Our analysis of two open-source LLMs reveals that while the findings from previous studies are partially validated, generalization to LLM-generated datasets remains challenging. This study lays the groundwork for future research on factuality in LLMs and offers practical guidelines for more effective evaluation.
pdf
bib
abs
Disentangling Biased Knowledge from Reasoning in Large Language Models via Machine Unlearning
Zheyuan Liu
|
Suraj Maharjan
|
Fanyou Wu
|
Rahil Parikh
|
Belhassen Bayar
|
Srinivasan H. Sengamedu
|
Meng Jiang
The rapid development of Large Language Models (LLMs) has led to their widespread adoption across various domains, leveraging vast pre-training knowledge and impressive generalization capabilities. However, these models often inherit biased knowledge, resulting in unfair decisions in sensitive applications. It is challenging to remove this biased knowledge without compromising reasoning abilities due to the entangled nature of the learned knowledge within LLMs. To solve this problem, existing approaches have attempted to mitigate the bias using techniques such as fine-tuning with unbiased datasets, model merging, and gradient ascent. While these methods have experimentally proven effective, they can still be sub-optimum in fully disentangling biases from reasoning. To address this gap, we propose Selective Disentanglement Unlearning (SDU), a novel unlearning framework that selectively removes biased knowledge while preserving reasoning capabilities. SDU operates in three stages: identifying biased parameters using a shadow LLM, fine-tuning with unbiased data, and performing selective parameter updates based on weight saliency. Experimental results across multiple LLMs show that SDU improves fairness accuracy by 14.7% and enhances reasoning performance by 62.6% compared to existing baselines.
pdf
bib
abs
LLaMAs Have Feelings Too: Unveiling Sentiment and Emotion Representations in LLaMA Models Through Probing
Dario Di Palma
|
Alessandro De Bellis
|
Giovanni Servedio
|
Vito Walter Anelli
|
Fedelucio Narducci
|
Tommaso Di Noia
Large Language Models (LLMs) have rapidly become central to NLP, demonstrating their ability to adapt to various tasks through prompting techniques, including sentiment analysis. However, we still have a limited understanding of how these models capture sentiment-related information. This study probes the hidden layers of LLaMA models to pinpoint where sentiment features are most represented and to assess how this affects sentiment analysis.Using probe classifiers, we analyze sentiment encoding across layers and scales, identifying the layers and pooling methods that best capture sentiment signals. Our results show that sentiment information is most concentrated in mid-layers for binary polarity tasks, with detection accuracy increasing up to 14% over prompting techniques. Additionally, we find that in decoder-only models, the last token is not consistently the most informative for sentiment encoding. Finally, this approach enables sentiment tasks to be performed with memory requirements reduced by an average of 57%.These insights contribute to a broader understanding of sentiment in LLMs, suggesting layer-specific probing as an effective approach for sentiment tasks beyond prompting, with potential to enhance model utility and reduce memory requirements.
pdf
bib
abs
CxGGEC: Construction-Guided Grammatical Error Correction
Yayu Cao
|
Tianxiang Wang
|
Lvxiaowei Xu
|
Zhenyao Wang
|
Ming Cai
The grammatical error correction (GEC) task aims to detect and correct grammatical errors in text to enhance its accuracy and readability. Current GEC methods primarily rely on grammatical labels for syntactic information, often overlooking the inherent usage patterns of language. In this work, we explore the potential of construction grammar (CxG) to improve GEC by leveraging constructions to capture underlying language patterns and guide corrections. We first establish a comprehensive construction inventory from corpora. Next, we introduce a construction prediction model that identifies potential constructions in ungrammatical sentences using a noise-tolerant language model. Finally, we train a CxGGEC model on construction-masked parallel data, which performs GEC by decoding construction tokens into their original forms and correcting erroneous tokens. Extensive experiments on English and Chinese GEC benchmarks demonstrate the effectiveness of our approach.
pdf
bib
abs
Beyond Sequences: Two-dimensional Representation and Dependency Encoding for Code Generation
Xiangyu Zhang
|
Yu Zhou
|
Guang Yang
|
Wei Cheng
|
Taolue Chen
The advent of large language models has significantly advanced automatic code generation, transforming the way programmers writing code. Inspired by natural language processing, mainstream code generation approaches represent code as a linear sequence of tokens. In this paper, we propose to represent code snippets as two-dimensional entities, where both code lines and tokens within lines are explicitly modeled. This representation allows us to capture the hierarchical and spatial structure of code, especially the dependencies between code lines. Our method CoDE introduces a dependency encoding approach that leverages dictionary learning to perform semantic matching between code lines. As such, it avoids the reliance on strict position indices, leading to better generalization to code with diverse context and lengths. We thoroughly evaluate CoDE based on four categories of tasks. The experimental results showcase its generalizability, context understanding and retrieval, as well as interpretability in code generation.
pdf
bib
abs
HD-NDEs: Neural Differential Equations for Hallucination Detection in LLMs
Qing Li
|
Jiahui Geng
|
Zongxiong Chen
|
Derui Zhu
|
Yuxia Wang
|
Congbo Ma
|
Chenyang Lyu
|
Fakhri Karray
In recent years, large language models (LLMs) have made remarkable advancements, yet hallucination, where models produce inaccurate or non-factual statements, remains a significant challenge for real-world deployment. Although current classification-based methods, such as SAPLMA, are highly efficient in mitigating hallucinations, they struggle when non-factual information arises in the early or mid-sequence of outputs, reducing their reliability. To address these issues, we propose Hallucination Detection-Neural Differential Equations (HD-NDEs), a novel method that systematically assesses the truthfulness of statements by capturing the full dynamics of LLMs within their latent space. Our approaches apply neural differential equations (Neural DEs) to model the dynamic system in the latent space of LLMs. Then, the sequence in the latent space is mapped to the classification space for truth assessment. The extensive experiments across five datasets and six widely used LLMs demonstrate the effectiveness of HD-NDEs, especially, achieving over 14% improvement in AUC-ROC on the True-False dataset compared to state-of-the-art techniques.
pdf
bib
abs
What Is That Talk About? A Video-to-Text Summarization Dataset for Scientific Presentations
Dongqi Liu
|
Chenxi Whitehouse
|
Xi Yu
|
Louis Mahon
|
Rohit Saxena
|
Zheng Zhao
|
Yifu Qiu
|
Mirella Lapata
|
Vera Demberg
Transforming recorded videos into concise and accurate textual summaries is a growing challenge in multimodal learning. This paper introduces VISTA, a dataset specifically designed for video-to-text summarization in scientific domains. VISTA contains 18,599 recorded AI conference presentations paired with their corresponding paper abstracts. We benchmark the performance of state-of-the-art large models and apply a plan-based framework to better capture the structured nature of abstracts. Both human and automated evaluations confirm that explicit planning enhances summary quality and factual consistency. However, a considerable gap remains between models and human performance, highlighting the challenges of our dataset. This study aims to pave the way for future research on scientific video-to-text summarization.
pdf
bib
abs
NeuSym-RAG: Hybrid Neural Symbolic Retrieval with Multiview Structuring for PDF Question Answering
Ruisheng Cao
|
Hanchong Zhang
|
Tiancheng Huang
|
Zhangyi Kang
|
Yuxin Zhang
|
Liangtai Sun
|
Hanqi Li
|
Yuxun Miao
|
Shuai Fan
|
Lu Chen
|
Kai Yu
The increasing number of academic papers poses significant challenges for researchers to efficiently acquire key details. While retrieval augmented generation (RAG) shows great promise in large language model (LLM) based automated question answering, previous works often isolate neural and symbolic retrieval despite their complementary strengths. Moreover, conventional single-view chunking neglects the rich structure and layout of PDFs, e.g., sections and tables. In this work, we propose NeuSym-RAG, a hybrid neural symbolic retrieval framework which combines both paradigms in an interactive process. By leveraging multi-view chunking and schema-based parsing, NeuSym-RAG organizes semi-structured PDF content into both the relational database and vectorstore, enabling LLM agents to iteratively gather context until sufficient to generate answers. Experiments on three full PDF-based QA datasets, including a self-annotated one AirQA-Real, show that NeuSym-RAG stably defeats both the vector-based RAG and various structured baselines, highlighting its capacity to unify both retrieval schemes and utilize multiple views.
pdf
bib
abs
ProvBench: A Benchmark of Legal Provision Recommendation for Contract Auto-Reviewing
Xiuxuan Shen
|
Zhongyuan Jiang
|
Junsan Zhang
|
Junxiao Han
|
Yao Wan
|
Chengjie Guo
|
Bingcheng Liu
|
Jie Wu
|
Renxiang Li
|
Philip S. Yu
Contract review is a critical process to protect the rights and interests of the parties involved. However, this process is time-consuming, labor-intensive, and costly, especially when a contract faces multiple rounds of review. To accelerate the contract review and promote the completion of transactions, this paper introduces a novel benchmark of legal provision recommendation and conflict detection for contract auto-reviewing (ProvBench), which aims to recommend the legal provisions related to contract clauses and detect possible legal conflicts. Specifically, we construct the first Legal Provision Recommendation Dataset: ProvData, which covers 8 common contract types. In addition, we conduct extensive experiments to evaluate ProvBench on various state-of-the-art models. Experimental results validate the feasibility of ProvBench and demonstrate the effectiveness of ProvData. Finally, we identify potential challenges in the ProvBench and advocate for further investigation.
pdf
bib
abs
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
Yushen Chen
|
Zhikang Niu
|
Ziyang Ma
|
Keqi Deng
|
Chunhui Wang
|
JianZhao JianZhao
|
Kai Yu
|
Xie Chen
This paper introduces F5-TTS, a fully non-autoregressive text-to-speech system based on flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as duration model, text encoder, and phoneme alignment, the text input is simply padded with filler tokens to the same length as input speech, and then the denoising is performed for speech generation, which was originally proved feasible by E2 TTS. However, the original design of E2 TTS makes it hard to follow due to its slow convergence and low robustness. To address these issues, we first model the input with ConvNeXt to refine the text representation, making it easy to align with the speech. We further propose an inference-time Sway Sampling strategy, which significantly improves our model’s performance and efficiency. This sampling strategy for flow step can be easily applied to existing flow matching based models without retraining. Our design allows faster training and achieves an inference RTF of 0.15, which is greatly improved compared to state-of-the-art diffusion-based TTS models. Trained on a public 100K hours multilingual dataset, our F5-TTS exhibits highly natural and expressive zero-shot ability, seamless code-switching capability, and speed control efficiency. We have released all codes and checkpoints to promote community development, at https://SWivid.github.io/F5-TTS/.
pdf
bib
abs
AutoMedEval: Harnessing Language Models for Automatic Medical Capability Evaluation
Xiechi Zhang
|
Zetian Ouyang
|
Linlin Wang
|
Gerard De Melo
|
Zhu Cao
|
Xiaoling Wang
|
Ya Zhang
|
Yanfeng Wang
|
Liang He
With the proliferation of large language models (LLMs) in the medical domain, there is increasing demand for improved evaluation techniques to assess their capabilities. However, traditional metrics like F1 and ROUGE, which rely on token overlaps to measure quality, significantly overlook the importance of medical terminology. While human evaluation tends to be more reliable, it can be very costly and may as well suffer from inaccuracies due to limits in human expertise and motivation. Although there are some evaluation methods based on LLMs, their usability in the medical field is limited due to their proprietary nature or lack of expertise. To tackle these challenges, we present AutoMedEval, an open-sourced automatic evaluation model with 13B parameters specifically engineered to measure the question-answering proficiency of medical LLMs. The overarching objective of AutoMedEval is to assess the quality of responses produced by diverse models, aspiring to significantly reduce the dependence on human evaluation. Specifically, we propose a hierarchical training method involving curriculum instruction tuning and an iterative knowledge introspection mechanism, enabling AutoMedEval to acquire professional medical assessment capabilities with limited instructional data. Human evaluations indicate that AutoMedEval surpasses other baselines in terms of correlation with human judgments.
pdf
bib
abs
CoT-based Synthesizer: Enhancing LLM Performance through Answer Synthesis
Bohan Zhang
|
Xiaokang Zhang
|
Jing Zhang
|
Jifan Yu
|
Sijia Luo
|
Jie Tang
Current inference scaling methods, such as Self-consistency and Best-of-N, have proven effective in improving the accuracy of LLMs on complex reasoning tasks. However, these methods rely heavily on the quality of candidate responses and are unable to produce correct answers when all candidates are incorrect. In this paper, we propose a novel inference scaling strategy, CoT-based Synthesizer, which leverages CoT reasoning to synthesize superior answers by analyzing complementary information from multiple candidate responses, even when all candidates are flawed. To support a lightweight and cost-effective implementation, we introduce an automated data generation pipeline that creates diverse training data. This enables smaller LLMs trained on this data to improve the inference accuracy of larger models, including API-based LLMs. Experimental results across four benchmark datasets with seven policy models demonstrate that our method significantly enhances performance, with gains of 11.8% for Llama3-8B and 10.3% for GPT-4o on the MATH dataset. The corresponding training data and code are publicly available on the [repository](https://github.com/RUCKBReasoning/CoT-based-Synthesizer).
pdf
bib
abs
Efficiently Identifying Watermarked Segments in Mixed-Source Texts
Xuandong Zhao
|
Chenwen Liao
|
Yu-Xiang Wang
|
Lei Li
Text watermarks in large language models (LLMs) are increasingly used to detect synthetic text, mitigating misuse cases like fake news and academic dishonesty. While existing watermarking detection techniques primarily focus on classifying entire documents as watermarked or not, they often neglect the common scenario of identifying individual watermark segments within longer, mixed-source documents. Drawing inspiration from plagiarism detection systems, we propose two novel methods for partial watermark detection. First, we develop a geometry cover detection framework aimed at determining whether there is a watermark segment in long text. Second, we introduce an adaptive online learning algorithm to pinpoint the precise location of watermark segments within the text. Evaluated on three popular watermarking techniques (KGW-Watermark, Unigram-Watermark, and Gumbel-Watermark), our approach achieves high accuracy, significantly outperforming baseline methods. Moreover, our framework is adaptable to other watermarking techniques, offering new insights for precise watermark detection. Our code is publicly available at
https://github.com/XuandongZhao/llm-watermark-location.
pdf
bib
abs
Assessing Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks
Fangru Lin
|
Shaoguang Mao
|
Emanuele La Malfa
|
Valentin Hofmann
|
Adrian de Wynter
|
Xun Wang
|
Si-Qing Chen
|
Michael J. Wooldridge
|
Janet B. Pierrehumbert
|
Furu Wei
Language is not monolithic. While benchmarks, including those designed for multiple languages, are often used as proxies to evaluate the performance of Large Language Models (LLMs), they tend to overlook the nuances of within-language variation and thus fail to model the experience of speakers of non-standard dialects. Focusing on African American Vernacular English (AAVE), we present the first study aimed at objectively assessing the fairness and robustness of LLMs in handling dialects across canonical reasoning tasks, including algorithm, math, logic, and integrated reasoning. We introduce **ReDial** (**Re**asoning with **Dial**ect Queries), a benchmark containing 1.2K+ parallel query pairs in Standardized English and AAVE. We hire AAVE speakers, including experts with computer science backgrounds, to rewrite seven popular benchmarks,such as HumanEval and GSM8K. With ReDial, we evaluate widely used LLMs, including GPT, Claude, Llama, Mistral, and the Phi model families. Our findings reveal that almost all of these widely used models show significant brittleness and unfairness to queries in AAVE. Our work establishes a systematic and objective framework for analyzing LLM bias in dialectal queries. Moreover, it highlights how mainstream LLMs provide unfair service to dialect speakers in reasoning tasks, laying a critical foundation for future research.
pdf
bib
abs
Towards a More Generalized Approach in Open Relation Extraction
Qing Wang
|
Yuepei Li
|
Qiao Qiao
|
Kang Zhou
|
Qi Li
Open Relation Extraction (OpenRE) seeks to identify and extract novel relational facts between named entities from unlabeled data without pre-defined relation schemas. Traditional OpenRE methods typically assume that the unlabeled data consists solely of novel relations or is pre-divided into known and novel instances. However, in real-world scenarios, novel relations are arbitrarily distributed. In this paper, we propose a generalized OpenRE setting that considers unlabeled data as a mixture of both known and novel instances. To address this, we propose MixORE, a two-phase framework that integrates relation classification and clustering to jointly learn known and novel relations. Experiments on three benchmark datasets demonstrate that MixORE consistently outperforms competitive baselines in known relation classification and novel relation clustering. Our findings contribute to the advancement of generalized OpenRE research and real-world applications.
pdf
bib
abs
Adaptive Retrieval Without Self-Knowledge? Bringing Uncertainty Back Home
Viktor Moskvoretskii
|
Maria Marina
|
Mikhail Salnikov
|
Nikolay Ivanov
|
Sergey Pletenev
|
Daria Galimzianova
|
Nikita Krayko
|
Vasily Konovalov
|
Irina Nikishina
|
Alexander Panchenko
Retrieval Augmented Generation (RAG) improves correctness of Question Answering (QA) and addresses hallucinations in Large Language Models (LLMs), yet greatly increase computational costs. Besides, RAG is not always needed as may introduce irrelevant information. Recent adaptive retrieval methods integrate LLMs’ intrinsic knowledge with external information appealing to LLM self-knowledge, but they often neglect efficiency evaluations and comparisons with uncertainty estimation techniques. We bridge this gap by conducting a comprehensive analysis of 35 adaptive retrieval methods, including 8 recent approaches and 27 uncertainty estimation techniques, across 6 datasets using 10 metrics for QA performance, self-knowledge, and efficiency. Our findings show that uncertainty estimation techniques often outperform complex pipelines in terms of efficiency and self-knowledge, while maintaining comparable QA performance.
pdf
bib
abs
Evaluating Language Models as Synthetic Data Generators
Seungone Kim
|
Juyoung Suk
|
Xiang Yue
|
Vijay Viswanathan
|
Seongyun Lee
|
Yizhong Wang
|
Kiril Gashteovski
|
Carolin Lawrence
|
Sean Welleck
|
Graham Neubig
Given the increasing use of synthetic data in language model (LM) post-training, an LM’s ability to generate high-quality data has become nearly as crucial as its ability to solve problems directly. While prior works have focused on developing effective data generation methods, they lack systematic comparison of different LMs as data generators in a unified setting. To address this gap, we propose AgoraBench, a benchmark that provides standardized settings and metrics to evaluate LMs’ data generation abilities. Through synthesizing 1.26 million training instances using 6 LMs and training 99 student models, we uncover key insights about LMs’ data generation capabilities. First, we observe that LMs exhibit distinct strengths. For instance, GPT-4o excels at generating new problems, while Claude-3.5-Sonnet performs better at enhancing existing ones. Furthermore, our analysis reveals that an LM’s data generation ability doesn’t necessarily correlate with its problem-solving ability. Instead, multiple intrinsic features of data quality—including response quality, perplexity, and instruction difficulty—collectively serve as better indicators. Finally, we demonstrate that strategic choices in output format and cost-conscious model selection significantly impact data generation effectiveness. Our code, checkpoints, and data are all publicly available at https://github.com/neulab/data-agora.
pdf
bib
abs
Can Graph Descriptive Order Affect Solving Graph Problems with LLMs?
Yuyao Ge
|
Shenghua Liu
|
Baolong Bi
|
Yiwei Wang
|
Lingrui Mei
|
Wenjie Feng
|
Lizhe Chen
|
Xueqi Cheng
Large language models (LLMs) have achieved significant success in reasoning tasks, including mathematical reasoning and logical deduction. Among these reasoning tasks, graph problems stand out due to their complexity and unique structural characteristics, attracting considerable attention from researchers. Previous studies have explored LLMs’ graph reasoning abilities through various techniques, such as different encoding methods for graph structures and the use of carefully designed prompts. However, a critical factor has been mostly overlooked: the prompt sequential order in which graph descriptions are presented to the models. In this study, we present the first comprehensive analysis of how the order of graph descriptions impacts LLM performance. Specifically, we comprehensively evaluate four graph description orders across six graph problems using six mainstream LLMs. The results reveal that: (1) ordered graph descriptions significantly improve LLMs’ comprehension of graph structures; (2) the robustness of LLMs to graph description order varies across different tasks; and (3) the impact of graph order on performance is closely related to the inherent characteristics of tasks. This study provides a critical advancement in the application of LLMs for solving graph-related problems, paving the way for future research to optimize model performance through strategic graph description ordering.
pdf
bib
abs
Learning to Rewrite: Generalized LLM-Generated Text Detection
Wei Hao
|
Ran Li
|
Weiliang Zhao
|
Junfeng Yang
|
Chengzhi Mao
Detecting text generated by Large Language Models (LLMs) is crucial, yet current detectors often struggle to generalize in open-world settings. We introduce Learning2Rewrite, a novel framework to detect LLM-generated text with exceptional generalization to unseen domains. Capitalized on the finding that LLMs inherently modify LLM-generated content less than human-written text when rewriting, we train an LLM to amplify this disparity, yielding a more distinguishable and generalizable edit distance across diverse text distributions. Extensive experiments on data from 21 independent domains and four major LLMs (GPT-3.5, GPT-4, Gemini, and Llama-3) demonstrate that our detector outperforms state-of-the-art detection methods by up to 23.04% in AUROC for in-distribution tests, 35.10% for out-of-distribution tests, and 48.66% under adversarial attacks. Our unique training objective ensures better generalizability compared to directly training for classification, even when leveraging the same amount of tunable parameters. Our findings suggest that reinforcing LLMs’ inherent rewriting tendencies offers a robust and scalable solution for detecting LLM-generated text.
pdf
bib
abs
Evaluating Multimodal Large Language Models on Video Captioning via Monte Carlo Tree Search
Linhao Yu
|
Xingguang Ji
|
Yahui Liu
|
Fanheng Kong
|
Chenxi Sun
|
Jingyuan Zhang
|
Hongzhi Zhang
|
V. W.
|
Fuzheng Zhang
|
Deyi Xiong
Video captioning can be used to assess the video understanding capabilities of Multimodal Large Language Models (MLLMs).However, existing benchmarks and evaluation protocols suffer from crucial issues, such as inadequate or homogeneous creation of key points, exorbitant cost of data creation, and limited evaluation scopes. To address these issues, we propose an automatic framework, named AutoCaption, which leverages Monte Carlo Tree Search (MCTS) to construct numerous and diverse descriptive sentences (
i.e., key points) that thoroughly represent video content in an iterative way. This iterative captioning strategy enables the continuous enhancement of video details such as actions, objects’ attributes, environment details, etc. We apply AutoCaption to curate MCTS-VCB, a fine-grained video caption benchmark covering video details, thereby enabling a comprehensive evaluation of MLLMs on the video captioning task. We evaluate more than 20 open- and closed-source MLLMs of varying sizes on MCTS-VCB. Results show that MCTS-VCB can effectively and comprehensively evaluate the video captioning capability, with Gemini-1.5-Pro achieving the highest F1 score of 71.2. Interestingly, we fine-tune InternVL2.5-8B with the AutoCaption-generated data, which helps the model achieve an overall improvement of 25.0% on MCTS-VCB and 16.3% on DREAM-1K, further demonstrating the effectiveness of AutoCaption. The code and data are available at
https://github.com/tjunlp-lab/MCTS-VCB.
pdf
bib
abs
GIFT-SW: Gaussian noise Injected Fine-Tuning of Salient Weights for LLMs
Maxim Zhelnin
|
Viktor Moskvoretskii
|
Egor Shvetsov
|
Maria Krylova
|
Venediktov Egor
|
Zuev Aleksandr
|
Evgeny Burnaev
Parameter Efficient Fine-Tuning (PEFT) methods have gained popularity and democratized the usage of Large Language Models (LLMs). Recent studies have shown that a small subset of weights significantly impacts performance. Based on this observation, we introduce a novel PEFT method, called Gaussian noise Injected Fine Tuning of Salient Weights (GIFT-SW). Our method updates only salient columns, while injecting Gaussian noise into non-salient ones. To identify these columns, we developed a generalized sensitivity metric that extends and unifies metrics from previous studies. Experiments with LLaMA models demonstrate that GIFT-SW outperforms full fine-tuning and modern PEFT methods under the same computational budget. Moreover, GIFT-SW offers practical advantages to recover performance of models subjected to mixed-precision quantization with keeping salient weights in full precision.
pdf
bib
abs
Quaff: Quantized Parameter-Efficient Fine-Tuning under Outlier Spatial Stability Hypothesis
Hong Huang
|
Dapeng Wu
Large language models (LLMs) have made exciting achievements across various domains, yet their deployment on resource-constrained personal devices remains hindered by the prohibitive computational and memory demands of task-specific fine-tuning. While quantization offers a pathway to efficiency, existing methods struggle to balance performance and overhead, either incurring high computational/memory costs or failing to address activation outliers—a critical bottleneck in quantized fine-tuning. To address these challenges, we propose the Outlier Spatial Stability Hypothesis (__OSSH__): _During fine-tuning, certain activation outlier channels retain stable spatial positions across training iterations._ Building on OSSH, we propose __Quaff__, a Quantized parameter-efficient fine-tuning framework for LLMs, optimizing low-precision activation representations through targeted momentum scaling. Quaff dynamically suppresses outliers exclusively in invariant channels using lightweight operations, eliminating full-precision weight storage and global rescaling while reducing quantization errors. Extensive experiments across ten benchmarks validate OSSH and demonstrate Quaff’s efficacy. Specifically, on the GPQA reasoning benchmark, Quaff achieves a 1.73× latency reduction and 30% memory savings over full-precision fine-tuning while improving accuracy by 0.6% on the Phi-3 model, reconciling the triple trade-off between efficiency, performance, and deployability. By enabling consumer-grade GPU fine-tuning (e.g., RTX 2080 Super) without sacrificing model utility, Quaff democratizes personalized LLM deployment. The code is available at https://anonymous.4open.science/r/Quaff-B322/.
pdf
bib
abs
Unsolvable Problem Detection: Robust Understanding Evaluation for Large Multimodal Models
Atsuyuki Miyai
|
Jingkang Yang
|
Jingyang Zhang
|
Yifei Ming
|
Qing Yu
|
Go Irie
|
Yixuan Li
|
Hai Helen Li
|
Ziwei Liu
|
Kiyoharu Aizawa
This paper introduces a novel task to evaluate the robust understanding capability of Large Multimodal Models (LMMs), termed Unsolvable Problem Detection (UPD). Multiple-choice question answering (MCQA) is widely used to assess the understanding capability of LMMs, but it does not guarantee that LMMs truly comprehend the answer. UPD assesses the LMM’s ability to withhold answers when encountering unsolvable problems of MCQA, verifying whether the model truly understands the answer. UPD encompasses three problems: Absent Answer Detection (AAD), Incompatible Answer Set Detection (IASD), and Incompatible Visual Question Detection (IVQD), covering unsolvable cases like answer-lacking or incompatible choices and image-question mismatches. For the evaluation, we introduce the MM-UPD Bench, a benchmark for assessing performance across various ability dimensions. Our experiments reveal that even most LMMs, which demonstrate adequate performance on existing benchmarks, struggle significantly with MM-UPD, underscoring a novel aspect of trustworthiness that current benchmarks have overlooked. A detailed analysis shows that LMMs have different bottlenecks and chain-of-thought and self-reflection improved performance for LMMs with the bottleneck in their LLM capability. We hope our insights will enhance the broader understanding and development of more reliable LMMs.
pdf
bib
abs
AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models
Yuhang Wu
|
Wenmeng Yu
|
Yean Cheng
|
Yan Wang
|
Xiaohan Zhang
|
Jiazheng Xu
|
Ming Ding
|
Yuxiao Dong
Evaluating the alignment capabilities of large Vision-Language Models (VLMs) is essential for determining their effectiveness as helpful assistants. However, existing benchmarks primarily focus on basic abilities using nonverbal methods, such as yes-no and multiple-choice questions. In this paper, we address this gap by introducing AlignMMBench, which provides more nuanced evaluations of alignment capabilities and is the first benchmark specifically designed for Chinese visual contexts. This benchmark is meticulously curated from real-world scenarios and internet sources, encompassing thirteen specific tasks across three categories, and includes both single-turn and multi-turn dialogue scenarios. Incorporating a prompt rewrite strategy, AlignMMBench encompasses 1,054 images and 4,978 question-answer pairs. To facilitate the evaluation pipeline, we develop CritiqueVLM, a rule-calibrated evaluator that exceeds GPT-4’s evaluation ability. Additionally, we measure the “alignment score”, a quantitative metric designed to assess the robustness and stability of models across diverse prompts. Finally, we evaluate the performance of representative VLMs on AlignMMBench, offering insights into the capabilities and limitations of different VLM architectures. The evaluation code and data are available at https://github.com/THUDM/AlignMMBench.
pdf
bib
abs
Biased LLMs can Influence Political Decision-Making
Jillian Fisher
|
Shangbin Feng
|
Robert Aron
|
Thomas Richardson
|
Yejin Choi
|
Daniel W Fisher
|
Jennifer Pan
|
Yulia Tsvetkov
|
Katharina Reinecke
As modern large language models (LLMs) become integral to everyday tasks, concerns about their inherent biases and their potential impact on human decision-making have emerged. While bias in models are well-documented, less is known about how these biases influence human decisions. This paper presents two interactive experiments investigating the effects of partisan bias in LLMs on political opinions and decision-making. Participants interacted freely with either a biased liberal, biased conservative, or unbiased control model while completing these tasks. We found that participants exposed to partisan biased models were significantly more likely to adopt opinions and make decisions which matched the LLM’s bias. Even more surprising, this influence was seen when the model bias and personal political partisanship of the participant were opposite. However, we also discovered that prior knowledge of AI was weakly correlated with a reduction of the impact of the bias, highlighting the possible importance of AI education for robust mitigation of bias effects. Our findings not only highlight the critical effects of interacting with biased LLMs and its ability to impact public discourse and political conduct, but also highlights potential techniques for mitigating these risks in the future.
pdf
bib
abs
LexTempus: Enhancing Temporal Generalizability of Legal Language Models Through Dynamic Mixture of Experts
Santosh T.y.s.s
|
Tuan-Quang Vuong
The rapid evolution of legal concepts over time necessitates that legal language models adapt swiftly accounting for the temporal dynamics. However, prior works have largely neglected this crucial dimension, treating legal adaptation as a static problem rather than a continuous process. To address this gap, we pioneer LexTempus, a dynamic mixture of experts model that explicitly models the temporal evolution of legal language in a parameter-efficient online learning framework. LexTempus starts with a single lightweight adapter expert and dynamically expands by adding new experts as significant deviations in the data distribution are detected. This self-expansion strategy allows LexTempus to adapt to new information without forgetting past knowledge, thereby improving temporal generalization. We use a a non-parametric similarity-based router to merge relevant experts into a unified expert for each test instance, ensuring efficient inference without additional overhead. We validate the effectiveness of LexTempus on ECHR and EU case law datasets, demonstrating its superiority in both perplexity and open-ended text generation quality metrics.
pdf
bib
abs
That is Unacceptable: the Moral Foundations of Canceling
Soda Marem Lo
|
Oscar Araque
|
Rajesh Sharma
|
Marco Antonio Stranisci
Canceling is a morally-driven phenomenon that hinders the development of safe social media platforms and contributes to ideological polarization. To address this issue we present the Canceling Attitudes Detection (CADE) dataset, an annotated corpus of canceling incidents aimed at exploring the factors of disagreements in evaluating people’s canceling attitudes on social media. Specifically, we study the impact of annotators’ morality in their perception of canceling, showing that morality is an independent axis for the explanation of disagreement on this phenomenon. Annotator’s judgments heavily depend on the type of controversial events and involved celebrities. This shows the need to develop more event-centric datasets to better understand how harms are perpetrated in social media and to develop more aware technologies for their detection.
pdf
bib
abs
FloorPlan-LLaMa: Aligning Architects’ Feedback and Domain Knowledge in Architectural Floor Plan Generation
Jun Yin
|
Pengyu Zeng
|
Haoyuan Sun
|
Yuqin Dai
|
Han Zheng
|
Miao Zhang
|
Yachao Zhang
|
Shuai Lu
Floor plans serve as a graphical language through which architects sketch and communicate their design ideas. Actually, in the Architecture, Engineering, and Construction (AEC) design stages, generating floor plans is a complex task requiring domain expertise and alignment with user requirements. However, existing evaluation methods for floor plan generation rely mainly on statistical metrics like FID, GED, and PSNR, which often fail to evaluate using domain knowledge. As a result, even high-performing models on these metrics struggle to generate viable floor plans in practice. To address this, (1) we propose ArchiMetricsNet, the first floor plan dataset that includes functionality, flow, and overall evaluation scores, along with detailed textual analyses. We trained FloorPlan-MPS (Multi-dimensional Preference Score) on it. (2) We develope FloorPlan-LLaMa, a floor plan generation model based on autoregressive framework. To integrate architects’ professional expertise and preferences, FloorPlan-MPS serves as the reward model during the RLHF (Reinforcement Learning from Human Feedback) process, aligning FP-LLaMa with the needs of the architectural community. (3) Comparative experiments demonstrate that our method outperforms baseline models in both text-conditional and class-conditional tasks. Validation by professional architects confirms that our approach yields more rational plans and aligns better with human preferences.
pdf
bib
abs
TheoremExplainAgent: Towards Video-based Multimodal Explanations for LLM Theorem Understanding
Max Ku
|
Cheuk Hei Chong
|
Jonathan Leung
|
Krish Shah
|
Alvin Yu
|
Wenhu Chen
Understanding domain-specific theorems often requires more than just text-based reasoning; effective communication through structured visual explanations is crucial for deeper comprehension. While large language models (LLMs) demonstrate strong performance in text-based theorem reasoning, their ability to generate coherent and pedagogically meaningful visual explanations remains an open challenge. In this work, we introduce TheoremExplainAgent, an agentic approach for generating long-form theorem explanation videos (over 5 minutes) using Manim animations. To systematically evaluate multimodal theorem explanations, we propose TheoremExplainBench, a benchmark covering 240 theorems across multiple STEM disciplines, along with 5 automated evaluation metrics. Our results reveal that agentic planning is essential for generating detailed long-form videos, and the o3-mini agent achieves a success rate of 93.8% and an overall score of 0.77. However, our quantitative and qualitative studies show that most of the videos produced exhibit minor issues with visual element layout. Furthermore, multimodal explanations expose deeper reasoning flaws that text-based explanations fail to reveal, highlighting the importance of multimodal explanations.
pdf
bib
abs
FineReason: Evaluating and Improving LLMs’ Deliberate Reasoning through Reflective Puzzle Solving
Guizhen Chen
|
Weiwen Xu
|
Hao Zhang
|
Hou Pong Chan
|
Chaoqun Liu
|
Lidong Bing
|
Deli Zhao
|
Anh Tuan Luu
|
Yu Rong
Many challenging reasoning tasks require not just rapid, intuitive responses, but a more deliberate, multi-step approach. Recent progress in large language models (LLMs) highlights an important shift from the “System 1” way of quick reactions to the “System 2” style of reflection-and-correction problem solving. However, current benchmarks heavily rely on the final-answer accuracy, leaving much of a model’s intermediate reasoning steps unexamined. This fails to assess the model’s ability to reflect and rectify mistakes within the reasoning process. To bridge this gap, we introduce FINEREASON, a logic-puzzle benchmark for systematic evaluation of LLMs’ reasoning capabilities. Each puzzle can be decomposed into atomic steps, making it ideal for rigorous validation of intermediate correctness. Building on this, we introduce two tasks: state checking and state transition, for a comprehensive evaluation of how models assess the current situation and plan the next move. To support broader research, we also provide a puzzle training set aimed at enhancing general reasoning. We show that models trained on our state checking and transition data demonstrate gains in mathematical reasoning by up to 5.1%.
pdf
bib
abs
The TIP of the Iceberg: Revealing a Hidden Class of Task-in-Prompt Adversarial Attacks on LLMs
Sergey Berezin
|
Reza Farahbakhsh
|
Noel Crespi
We present a novel class of jailbreak adversarial attacks on LLMs, termed Task-in-Prompt (TIP) attacks. Our approach embeds sequence-to-sequence tasks (e.g., cipher decoding, riddles, code execution) into the model’s prompt to indirectly generate prohibited inputs. To systematically assess the effectiveness of these attacks, we introduce the PHRYGE benchmark. We demonstrate that our techniques successfully circumvent safeguards in six state-of-the-art language models, including GPT-4o and LLaMA 3.2. Our findings highlight critical weaknesses in current LLM safety alignment and underscore the urgent need for more sophisticated defence strategies.
pdf
bib
abs
Identifying Reliable Evaluation Metrics for Scientific Text Revision
Leane Jourdan
|
Nicolas Hernandez
|
Florian Boudin
|
Richard Dufour
Evaluating text revision in scientific writing remains a challenge, as traditional metrics such as ROUGE and BERTScore primarily focus on similarity rather than capturing meaningful improvements. In this work, we analyse and identify the limitations of these metrics and explore alternative evaluation methods that better align with human judgments. We first conduct a manual annotation study to assess the quality of different revisions. Then, we investigate reference-free evaluation metrics from related NLP domains. Additionally, we examine LLM-as-a-judge approaches, analysing their ability to assess revisions with and without a gold reference. Our results show that LLMs effectively assess instruction-following but struggle with correctness, while domain-specific metrics provide complementary insights. We find that a hybrid approach combining LLM-as-a-judge evaluation and task-specific metrics offers the most reliable assessment of revision.
pdf
bib
abs
Can Language Models Reason about Individualistic Human Values and Preferences?
Liwei Jiang
|
Taylor Sorensen
|
Sydney Levine
|
Yejin Choi
Recent calls for pluralistic alignment emphasize that AI systems should address the diverse needs of all people. Yet, efforts in this space often require sorting people into fixed buckets of pre-specified diversity-defining dimensions (e.g., demographics), risking smoothing out individualistic variations or even stereotyping. To achieve an authentic representation of diversity that respects individuality, we propose individualistic alignment. While individualistic alignment can take various forms, in this paper, we introduce IndieValueCatalog, a dataset transformed from the influential World Values Survey (WVS), to study language models (LMs) on the specific challenge of individualistic value reasoning. Given a sample of an individual’s value-expressing statements, models are tasked with predicting their value judgments in novel cases. With IndieValueCatalog, we reveal critical limitations in frontier LMs’ abilities to predict individualistic values with accuracies only ranging between 55% to 65%. Moreover, our results highlight that a precise description of individualistic values cannot be approximated only via demographic information. Finally, we train a series of IndieValueReasoners to reveal new patterns and dynamics into global human values.
pdf
bib
abs
BERT-like Models for Slavic Morpheme Segmentation
Dmitry Morozov
|
Lizaveta Astapenka
|
Anna Glazkova
|
Timur Garipov
|
Olga Lyashevskaya
Automatic morpheme segmentation algorithms are applicable in various tasks, such as building tokenizers and language education. For Slavic languages, the development of such algorithms is complicated by the rich derivational capabilities of these languages. Previous research has shown that, on average, these algorithms have already reached expert-level quality. However, a key unresolved issue is the significant decline in performance when segmenting words containing roots not present in the training data. This problem can be partially addressed by using pre-trained language models to better account for word semantics. In this work, we explored the possibility of fine-tuning BERT-like models for morpheme segmentation using data from Belarusian, Czech, and Russian. We found that for Czech and Russian, our models outperform all previously proposed approaches, achieving word-level accuracy of 92.5-95.1%. For Belarusian, this task was addressed for the first time. The best-performing approach for Belarusian was an ensemble of convolutional neural networks with word-level accuracy of 90.45%.
pdf
bib
abs
Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling
Xianzhen Luo
|
Yixuan Wang
|
Qingfu Zhu
|
Zhiming Zhang
|
Xuanyu Zhang
|
Qing Yang
|
Dongliang Xu
The rapid growth in the parameters of LLMs has made inference latency a fundamental bottleneck. Speculative decoding represents a lossless approach to accelerate inference through a guess-and-verify paradigm. Some methods rely on additional architectures to guess draft tokens, which need extra training before use. Alternatively, retrieval-based train-free techniques build libraries from pre-existing corpora or by n-gram generation. However, they face challenges like large storage requirements, time-consuming retrieval, and limited adaptability. Observing that candidate tokens generated during the decoding process are likely to reoccur in future sequences, we propose Token Recycling. This approach stores candidate tokens in an adjacency matrix and employs a breadth-first-search (BFS)-like algorithm to construct a draft tree, which is then validated through tree attention. New candidate tokens from the decoding process are then used to update the matrix. Token Recycling requires <2MB of additional storage and achieves approximately 2x speedup across all sizes of LLMs. It significantly outperforms existing train-free methods by 30% and even a training method by 25%.
pdf
bib
abs
Unlocking General Long Chain-of-Thought Reasoning Capabilities of Large Language Models via Representation Engineering
Xinyu Tang
|
Xiaolei Wang
|
Zhihao Lv
|
Yingqian Min
|
Xin Zhao
|
Binbin Hu
|
Ziqi Liu
|
Zhiqiang Zhang
Recent advancements in long chain-of-thoughts (long CoTs) have significantly improved the reasoning capabilities of large language models (LLMs). Existing work finds that the capability of long CoT reasoning can be efficiently elicited by tuning on only a few examples and can easily transfer to other tasks. This motivates us to investigate whether long CoT reasoning is a general capability for LLMs. In this work, we conduct an empirical analysis for this question from the perspective of representation. We find that LLMs do encode long CoT reasoning as a general capability, with a clear distinction from vanilla CoTs. Furthermore, domain-specific representations are also required for the effective transfer of long CoT reasoning. Inspired by these findings, we propose GLORE, a novel representation engineering method to unleash the general long CoT reasoning capabilities of LLMs. Extensive experiments demonstrate the effectiveness and efficiency of GLORE in both in-domain and cross-domain scenarios. The code is available at https://github.com/txy77/GLoRE.
pdf
bib
abs
Drift: Enhancing LLM Faithfulness in Rationale Generation via Dual-Reward Probabilistic Inference
Jiazheng Li
|
Hanqi Yan
|
Yulan He
As Large Language Models (LLMs) are increasingly applied to complex reasoning tasks, achieving both accurate task performance and faithful explanations becomes crucial. However, LLMs often generate unfaithful explanations, partly because they do not consistently adhere closely to the provided context. Existing approaches to this problem either rely on superficial calibration methods, such as decomposed Chain-of-Thought prompting, or require costly retraining to improve model faithfulness. In this work, we propose a probabilistic inference paradigm that leverages task-specific and lookahead rewards to ensure that LLM-generated rationales are more faithful to model decisions and align better with input context. These rewards are derived from a domain-specific proposal distribution, allowing for optimized sequential Monte Carlo approximations. Our evaluations across three different reasoning tasks show that this method, which allows for controllable generation during inference, improves both accuracy and faithfulness of LLMs. This method offers a promising path towards making LLMs more reliable for reasoning tasks without sacrificing performance.
pdf
bib
abs
Fairness through Difference Awareness: Measuring Desired Group Discrimination in LLMs
Angelina Wang
|
Michelle Phan
|
Daniel E. Ho
|
Sanmi Koyejo
Algorithmic fairness has conventionally adopted the mathematically convenient perspective of racial color-blindness (i.e., difference unaware treatment). However, we contend that in a range of important settings, group difference awareness matters. For example, differentiating between groups may be necessary in legal contexts (e.g., the U.S. compulsory draft applies to men but not women) and harm assessments (e.g., referring to girls as “terrorists” may be less harmful than referring to Muslim people as such). Thus, in contrast to most fairness work, we study fairness through the perspective of treating people differently — when it is contextually appropriate to. We first introduce an important distinction between descriptive (fact-based), normative (value-based), and correlation (association-based) benchmarks. This distinction is significant because each category requires separate interpretation and mitigation tailored to its specific characteristics. Then, we present a benchmark suite composed of eight different scenarios for a total of 16k questions that enables us to assess difference awareness. Finally, we show results across ten models that demonstrate difference awareness is a distinct dimension to fairness where existing bias mitigation strategies may backfire.
pdf
bib
abs
MergePrint: Merge-Resistant Fingerprints for Robust Black-box Ownership Verification of Large Language Models
Shojiro Yamabe
|
Futa Kai Waseda
|
Tsubasa Takahashi
|
Koki Wataoka
Protecting the intellectual property of Large Language Models (LLMs) has become increasingly critical due to the high cost of training. Model merging, which integrates multiple expert models into a single multi-task model, introduces a novel risk of unauthorized use of LLMs due to its efficient merging process. While fingerprinting techniques have been proposed for verifying model ownership, their resistance to model merging remains unexplored. To address this gap, we propose a novel fingerprinting method, MergePrint, which embeds robust fingerprints capable of surviving model merging. MergePrint enables black-box ownership verification, where owners only need to check if a model produces target outputs for specific fingerprint inputs, without accessing model weights or intermediate outputs. By optimizing against a pseudo-merged model that simulates merged behavior, MergePrint ensures fingerprints that remain detectable after merging. Additionally, to minimize performance degradation, we pre-optimize the fingerprint inputs. MergePrint pioneers a practical solution for black-box ownership verification, protecting LLMs from misappropriation via merging, while also excelling in resistance to broader model theft threats.
pdf
bib
abs
Dynamic Scaling of Unit Tests for Code Reward Modeling
Zeyao Ma
|
Xiaokang Zhang
|
Jing Zhang
|
Jifan Yu
|
Sijia Luo
|
Jie Tang
Current large language models (LLMs) often struggle to produce accurate responses on the first attempt for complex reasoning tasks like code generation. Prior research tackles this challenge by generating multiple candidate solutions and validating them with LLM-generated unit tests. The execution results of unit tests serve as reward signals to identify correct solutions. As LLMs always confidently make mistakes, these unit tests are not reliable, thereby diminishing the quality of reward signals. Motivated by the observation that scaling the number of solutions improves LLM performance, we explore the impact of scaling unit tests to enhance reward signal quality. Our pioneer experiment reveals a positive correlation between the number of unit tests and reward signal quality, with greater benefits observed in more challenging problems. Based on these insights, we propose CodeRM-8B, a lightweight yet effective unit test generator that enables efficient and high-quality unit test scaling. Additionally, we implement a dynamic scaling mechanism that adapts the number of unit tests based on problem difficulty, further improving efficiency. Experimental results show that our approach significantly improves performance across various models on three benchmarks (e.g., with gains of 18.43 for Llama3-8B and 3.42 for GPT-4o-mini on HumanEval Plus). The parameters of CodeRM-8B and corresponding training data will be available upon publication.
pdf
bib
abs
UniConv: Unifying Retrieval and Response Generation for Large Language Models in Conversations
Fengran Mo
|
Yifan Gao
|
Chuan Meng
|
Xin Liu
|
Zhuofeng Wu
|
Kelong Mao
|
Zhengyang Wang
|
Pei Chen
|
Zheng Li
|
Xian Li
|
Bing Yin
|
Meng Jiang
The rapid advancement of conversational search systems revolutionizes how information is accessed by enabling the multi-turn interaction between the user and the system. Existing conversational search systems are usually built with two different models. This separation restricts the system from leveraging the intrinsic knowledge of the models simultaneously, which cannot ensure the effectiveness of retrieval benefiting the generation. The existing studies for developing unified models cannot fully address the aspects of understanding conversational context, managing retrieval independently, and generating responses. In this paper, we explore how to unify dense retrieval and response generation for large language models in conversation. We conduct joint fine-tuning with different objectives and design two mechanisms to reduce the inconsistency risks while mitigating data discrepancy. The evaluations on five conversational search datasets demonstrate that our unified model can mutually improve both tasks and outperform the existing baselines.
pdf
bib
abs
Tracking Life’s Ups and Downs: Mining Life Events from Social Media Posts for Mental Health Analysis
Minghao Lv
|
Siyuan Chen
|
Haoan Jin
|
Minghao Yuan
|
Qianqian Ju
|
Yujia Peng
|
Kenny Q. Zhu
|
Mengyue Wu
Social media platforms possess considerable potential in the realm of exploring mental health. Previous research has indicated that major life events can greatly impact individuals’ mental health. However, due to the complexity and ambiguity nature of life events, shedding its light on social media data is quite challenging. In this paper, we are dedicated to uncovering life events mentioned in posts on social media. We hereby provide a carefully-annotated social media event dataset, PsyEvent, which encompasses 12 major life event categories that are likely to occur in everyday life. This dataset is human-annotated under iterative procedure and boasts a high level of quality. Furthermore, by applying the life events extracted from posts to downstream tasks such as early risk detection of depression and suicide risk prediction, we have observed a considerable improvement in performance. This suggests that extracting life events from social media can be beneficial for the analysis of individuals’ mental health.
pdf
bib
abs
ControlSpeech: Towards Simultaneous and Independent Zero-shot Speaker Cloning and Zero-shot Language Style Control
Shengpeng Ji
|
Qian Chen
|
Wen Wang
|
Jialong Zuo
|
Minghui Fang
|
Ziyue Jiang
|
Hai Huang
|
Zehan Wang
|
Xize Cheng
|
Siqi Zheng
|
Zhou Zhao
In this paper, we present ControlSpeech, a text-to-speech (TTS) system capable of fully cloning the speaker’s voice and enabling arbitrary control and adjustment of speaking style. Prior zero-shot TTS models only mimic the speaker’s voice without further control and adjustment capabilities while prior controllable TTS models cannot perform speaker-specific voice generation. Therefore, ControlSpeech focuses on a more challenging task—a TTS system with controllable timbre, content, and style at the same time. ControlSpeech takes speech prompts, content prompts, and style prompts as inputs and utilizes bidirectional attention and mask-based parallel decoding to capture codec representations corresponding to timbre, content, and style in a discrete decoupling codec space. Moreover, we analyze the many-to-many issue in textual style control and propose the Style Mixture Semantic Density (SMSD) module, which is based on Gaussian mixture density networks, to resolve this problem. To facilitate empirical validations, we make available a new style controllable dataset called VccmDataset. Our experimental results demonstrate that ControlSpeech exhibits comparable or state-of-the-art (SOTA) performance in terms of controllability, timbre similarity, audio quality, robustness, and generalizability. Codes are available at https://github.com/jishengpeng/ControlSpeech.
pdf
bib
abs
PIC: Unlocking Long-Form Text Generation Capabilities of Large Language Models via Position ID Compression
Haoran Que
|
Wenge Rong
Long-context understanding is crucial for large language models (LLMs) and has become a fundamental capability for most LLMs. However, beyond the focus on “input-long”, the ability to “output-long” is equally significant, yet it remains underexplored. To address this limitation, we propose a simple, efficient, and plug-in approach, Position ID Compression (PIC), to unlock the long-form text generation potential of LLMs. The idea is straightforward: by compressing the position ids of the context, we provoke and guide LLMs to generate coherent and longer output. Specifically, we find that directly reducing the position ids by a fixed ratio significantly impacts the generation quality. To mitigate this, we propose two variants of PIC: NTK-aware PIC and Dynamic PIC. Without additional training, both methods enable LLMs to extend their generation length by approximately 1.5 times without compromising generation quality. Furthermore, by integrating supervised fine-tuning (SFT) with PIC, we propose PIC-SFT, which further improves LLMs’ long-form text generation capabilities, achieving top performance on HelloBench and LongBench-Write. Extensive experiments demonstrate the effectiveness of our approach.
pdf
bib
abs
Towards Effective Extraction and Evaluation of Factual Claims
Dasha Metropolitansky
|
Jonathan Larson
A common strategy for fact-checking long-form content generated by Large Language Models (LLMs) is extracting simple claims that can be verified independently. Since inaccurate or incomplete claims compromise fact-checking results, ensuring claim quality is critical. However, the lack of a standardized evaluation framework impedes assessment and comparison of claim extraction methods. To address this gap, we propose a framework for evaluating claim extraction in the context of fact-checking along with automated, scalable, and replicable methods for applying this framework, including novel approaches for measuring coverage and decontextualization. We also introduce Claimify, an LLM-based claim extraction method, and demonstrate that it outperforms existing methods under our evaluation framework. A key feature of Claimify is its ability to handle ambiguity and extract claims only when there is high confidence in the correct interpretation of the source text.
pdf
bib
abs
Beyond Facts: Evaluating Intent Hallucination in Large Language Models
Yijie Hao
|
Haofei Yu
|
Jiaxuan You
When exposed to complex queries containing multiple conditions, today’s large language models (LLMs) tend to produce responses that only partially satisfy the query while neglecting certain conditions. We, therefore, introduce the concept of Intent Hallucination, a phenomenon where LLMs either omit (failing to address certain parts) or misinterpret (responding to invented query parts) elements of the given query, leading to responses misaligned with the original query. To systematically evaluate intent hallucination, we introduce FAITHQA, a novel benchmark for intent hallucination that contains 20,068 problems, covering both query-only and retrieval-augmented generation (RAG) setups with varying topics and difficulty. FAITHQA is the first hallucination benchmark that goes beyond factual verification, tailored to identify the fundamental cause of intent hallucination. By evaluating various LLMs on FAITHQA, we find that (1) intent hallucination is a common issue even for state-of-the-art models, and (2) such a phenomenon stems from omission or misinterpretation of LLMs. To facilitate future research, we introduce an automatic LLM generation evaluation metric, named INTENT CONSTRAINT, for detecting intent hallucination. Human evaluation results demonstrate that INTENT CONSTRAINT is closer to human performance for intent hallucination compared to baselines.
pdf
bib
abs
A Systematic Study of Compositional Syntactic Transformer Language Models
Yida Zhao
|
Hao Xve
|
Xiang Hu
|
Kewei Tu
Syntactic language models (SLMs) enhance Transformers by incorporating syntactic biases through the modeling of linearized syntactic parse trees alongside surface sentences. This paper focuses on compositional SLMs that are based on constituency parse trees and contain explicit bottom-up composition of constituent representations. We identify key aspects of design choices in existing compositional SLMs and propose a unified framework encompassing both existing models and novel variants. We conduct a comprehensive empirical evaluation of all the variants in our framework across language modeling, syntactic generalization, summarization, and inference efficiency. Based on the experimental results, we make multiple recommendations on the design of compositional SLMs. Our code is released at https://github.com/zhaoyd1/compositional_SLMs.
pdf
bib
abs
M-MAD: Multidimensional Multi-Agent Debate for Advanced Machine Translation Evaluation
Zhaopeng Feng
|
Jiayuan Su
|
Jiamei Zheng
|
Jiahan Ren
|
Yan Zhang
|
Jian Wu
|
Hongwei Wang
|
Zuozhu Liu
Recent advancements in large language models (LLMs) have given rise to the LLM-as-a-judge paradigm, showcasing their potential to deliver human-like judgments. However, in the field of machine translation (MT) evaluation, current LLM-as-a-judge methods fall short of learned automatic metrics. In this paper, we propose Multidimensional Multi-Agent Debate (M-MAD), a systematic LLM-based multi-agent framework for advanced LLM-as-a-judge MT evaluation. Our findings demonstrate that M-MAD achieves significant advancements by (1) decoupling heuristic MQM criteria into distinct evaluation dimensions for fine-grained assessments; (2) employing multi-agent debates to harness the collaborative reasoning capabilities of LLMs; (3) synthesizing dimension-specific results into a final evaluation judgment to ensure robust and reliable outcomes. Comprehensive experiments show that M-MAD not only outperforms all existing LLM-as-a-judge methods but also competes with state-of-the-art reference-based automatic metrics, even when powered by a suboptimal model like GPT-4o mini. Detailed ablations and analysis highlight the superiority of our framework design, offering a fresh perspective for LLM-as-a-judge paradigm. Our code and data are publicly available at https://github.com/SU-JIAYUAN/M-MAD.
pdf
bib
abs
SongComposer: A Large Language Model for Lyric and Melody Generation in Song Composition
Shuangrui Ding
|
Zihan Liu
|
Xiaoyi Dong
|
Pan Zhang
|
Rui Qian
|
Junhao Huang
|
Conghui He
|
Dahua Lin
|
Jiaqi Wang
Creating lyrics and melodies for the vocal track in a symbolic format, known as song composition, demands expert musical knowledge of melody, an advanced understanding of lyrics, and precise alignment between them. Despite achievements in sub-tasks such as lyric generation, lyric-to-melody, and melody-to-lyric, etc, a unified model for song composition has not yet been achieved. In this paper, we introduce SongComposer, a pioneering step towards a unified song composition model that can readily create symbolic lyrics and melodies following instructions. SongComposer is a music-specialized large language model (LLM) that, for the first time, integrates the capability of simultaneously composing lyrics and melodies into LLMs by leveraging three key innovations: 1) a flexible tuple format for word-level alignment of lyrics and melodies, 2) an extended tokenizer vocabulary for song notes, with scalar initialization based on musical knowledge to capture rhythm, and 3) a multi-stage pipeline that captures musical structure, starting with motif-level melody patterns and progressing to phrase-level structure for improved coherence. Extensive experiments demonstrate that SongComposer outperforms advanced LLMs, including GPT-4, in tasks such as lyric-to-melody generation, melody-to-lyric generation, song continuation, and text-to-song creation. Moreover, we will release SongCompose, a large-scale dataset for training, containing paired lyrics and melodies in Chinese and English.
pdf
bib
abs
Personalized Text Generation with Contrastive Activation Steering
Jinghao Zhang
|
Yuting Liu
|
Wenjie Wang
|
Qiang Liu
|
Shu Wu
|
Liang Wang
|
Tat-Seng Chua
Personalized text generation aims to infer users’ writing style preferences from their historical texts and generate outputs that faithfully reflect these stylistic characteristics. Existing solutions primarily adopt two paradigms: retrieval-augmented generation (RAG) and parameter-efficient fine-tuning (PEFT). While these approaches have advanced the field, they suffer from two critical limitations: (1) the entanglement of content semantics and stylistic patterns in historical texts impedes accurate modeling of user-specific writing preferences; and (2) scalability challenges arising from both RAG’s inference latency by retrieval operations and PEFT’s parameter storage requirements for per user model. To overcome these limitations, we propose StyleVector, a training-free framework that disentangles and represents personalized writing style as a vector in LLM’s activation space, enabling style-steered generation during inference without requiring costly retrieval or parameter storage. Comprehensive experiments demonstrate that our framework achieves a significant 8% relative improvement in personalized generation while reducing storage requirements by 1700 × over PEFT method.
pdf
bib
abs
Gumbel Reranking: Differentiable End-to-End Reranker Optimization
Siyuan Huang
|
Zhiyuan Ma
|
Jintao Du
|
Changhua Meng
|
Weiqiang Wang
|
Jingwen Leng
|
Minyi Guo
|
Zhouhan Lin
RAG systems rely on rerankers to identify relevant documents. However, fine-tuning these models remains challenging due to the scarcity of annotated query-document pairs. Existing distillation-based approaches suffer from training-inference misalignment and fail to capture interdependencies among candidate documents. To overcome these limitations, we reframe the reranking process as an attention-mask problem and propose Gumbel Reranking, an end-to-end training framework for rerankers aimed at minimizing the training-inference gap. In our approach, reranker optimization is reformulated as learning a stochastic, document-wise Top-k attention mask using the Gumbel Trick and Relaxed Top-k Sampling. This formulation enables end-to-end optimization by minimizing the overall language loss. Experiments across various settings consistently demonstrate performance gains, including a 10.4% improvement in recall on HotpotQA for distinguishing indirectly relevant documents.
pdf
bib
abs
Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback
Lester James Validad Miranda
|
Yizhong Wang
|
Yanai Elazar
|
Sachin Kumar
|
Valentina Pyatkin
|
Faeze Brahman
|
Noah A. Smith
|
Hannaneh Hajishirzi
|
Pradeep Dasigi
Learning from human feedback has enabled the alignment of language models (LMs) with human preferences. However, collecting human preferences is expensive and time-consuming, with highly variable annotation quality. An appealing alternative is to distill preferences from LMs as a source of synthetic annotations, offering a cost-effective and scalable alternative, albeit susceptible to other biases and errors. In this work, we introduce HyPER, a Hybrid Preference routER that defers an annotation to either humans or LMs, achieving better annotation quality while reducing the cost of human-only annotation. We formulate this as an optimization problem: given a preference dataset and an evaluation metric, we (1) train a performance prediction model (PPM) to predict a reward model’s (RM) performance on an arbitrary combination of human and LM annotations and (2) employ a routing strategy that selects a combination that maximizes predicted performance. We train the PPM on MultiPref, a new preference dataset with 10K instances paired with human and LM labels. We show that the selected hybrid mixture of synthetic and direct human preferences using HyPER achieves better RM performance compared to using either one exclusively by 7-13% on RewardBench and generalizes across unseen preference datasets and other base models. We also observe the same trend in other benchmarks using Best-of-N reranking, where the hybrid mix has 2-3% better performance. Finally, we analyze features from HyPER and find that prompts with moderate safety concerns or complexity benefit the most from human feedback.
pdf
bib
abs
SEOE: A Scalable and Reliable Semantic Evaluation Framework for Open Domain Event Detection
Yi-Fan Lu
|
Xian-Ling Mao
|
Tian Lan
|
Tong Zhang
|
Yu-Shi Zhu
|
Heyan Huang
Automatic evaluation for Open Domain Event Detection (ODED) is a highly challenging task, because ODED is characterized by a vast diversity of un-constrained output labels from various domains. Nearly all existing evaluation methods for ODED usually first construct evaluation benchmarks with limited labels and domain coverage, and then evaluate ODED methods using metrics based on token-level label matching rules. However, this kind of evaluation framework faces two issues: (1) The limited evaluation benchmarks lack representatives of the real world, making it difficult to accurately reflect the performance of various ODED methods in real-world scenarios; (2) Evaluation metrics based on token-level matching rules fail to capture semantic similarity between predictions and golden labels. To address these two problems above, we propose a scalable and reliable Semantic-level Evaluation framework for Open domain Event detection (SEOE) by constructing a more representative evaluation benchmark and introducing a semantic evaluation metric. Specifically, our proposed framework first constructs a scalable evaluation benchmark that currently includes 564 event types covering 7 major domains, with a cost-effective supplementary annotation strategy to ensure the benchmark’s representativeness. The strategy also allows for the supplement of new event types and domains in the future. Then, the proposed SEOE leverages large language models (LLMs) as automatic evaluation agents to compute a semantic F1-score, incorporating fine-grained definitions of semantically similar labels to enhance the reliability of the evaluation. Extensive experiments validate the representatives of the benchmark and the reliability of the semantic evaluation metric. Existing ODED methods are thoroughly evaluated, and the error patterns of predictions are analyzed, revealing several insightful findings.
pdf
bib
abs
The UD-NewsCrawl Treebank: Reflections and Challenges from a Large-scale Tagalog Syntactic Annotation Project
Angelina Aspra Aquino
|
Lester James Validad Miranda
|
Elsie Marie T. Or
This paper presents UD-NewsCrawl, the largest Tagalog treebank to date, containing 15.6k trees manually annotated according tothe Universal Dependencies framework. We detail our treebank development process, including data collection, pre-processing, manual annotation, and quality assurance procedures. We provide baseline evaluations using multiple transformer-based models to assess the performance of state-of-the-art dependency parsers on Tagalog. We also highlight challenges in the syntactic analysis of Tagalog given its distinctive grammatical properties, and discuss its implications for the annotation of this treebank. We anticipate that UD-NewsCrawl and our baseline model implementations will serve as valuable resources for advancing computational linguistics research in underrepresented languages like Tagalog.
pdf
bib
abs
DRAG: Distilling RAG for SLMs from LLMs to Transfer Knowledge and Mitigate Hallucination via Evidence and Graph-based Distillation
Jennifer Chen
|
Aidar Myrzakhan
|
Yaxin Luo
|
Hassaan Muhammad Khan
|
Sondos Mahmoud Bsharat
|
Zhiqiang Shen
Retrieval-Augmented Generation (RAG) methods have proven highly effective for tasks requiring factual consistency and robust knowledge retrieval. However, large-scale RAG systems consume significant computational resources and are prone to generating “hallucinated” content from Humans. In this work, we introduce DRAG, a novel framework for distilling RAG knowledge from large-scale Language Models (LLMs) into small LMs (SLMs). Our approach leverages evidence- and knowledge graph–based distillation, ensuring that the distilled model retains critical factual knowledge while significantly reducing model size and computational cost. By aligning the smaller model’s predictions with a structured knowledge graph and ranked evidence, DRAG effectively mitigates hallucinations and improves factual accuracy. We further present a case demonstrating how our framework mitigates user privacy risks and introduce a corresponding benchmark. Experimental evaluations on multiple benchmarks demonstrate that our method outperforms the prior competitive RAG methods like MiniRAG for SLMs by up to 27.7% using the same models, preserving high-level efficiency and reliability. With DRAG, we provide a practical and resource-efficient roadmap to deploying enhanced retrieval and generation capabilities in small-size LLMs. Code is available at https://github.com/VILA-Lab/DRAG.
pdf
bib
abs
G-Safeguard: A Topology-Guided Security Lens and Treatment on LLM-based Multi-agent Systems
Shilong Wang
|
Guibin Zhang
|
Miao Yu
|
Guancheng Wan
|
Fanci Meng
|
Chongye Guo
|
Kun Wang
|
Yang Wang
Large Language Model (LLM)-based Multi-agent Systems (MAS) have demonstrated remarkable capabilities in various complex tasks, ranging from collaborative problem-solving to autonomous decision-making. However, as these systems become increasingly integrated into critical applications, their vulnerability to adversarial attacks, misinformation propagation, and unintended behaviors have raised significant concerns. To address this challenge, we introduce G-Safeguard, a topology-guided security lens and treatment for robust LLM-MAS, which leverages graph neural networks to detect anomalies on the multi-agent utterance graph and employ topological intervention for attack remediation. Extensive experiments demonstrate that G-Safeguard: (I) exhibits significant effectiveness under various attack strategies, recovering over 40% of the performance for prompt injection; (II) is highly adaptable to diverse LLM backbones and large-scale MAS; (III) can seamlessly combine with mainstream MAS with security guarantees.
pdf
bib
abs
Deontological Keyword Bias: The Impact of Modal Expressions on Normative Judgments of Language Models
Bumjin Park
|
Leejinsil Leejinsil
|
Jaesik Choi
Large language models (LLMs) are increasingly engaging in moral and ethical reasoning, where criteria for judgment are often unclear, even for humans. While LLM alignment studies cover many areas, one important yet underexplored area is how LLMs make judgments about obligations. This work reveals a strong tendency in LLMs to judge non-obligatory contexts as obligations when prompts are augmented with modal expressions such as must or ought to. We introduce this phenomenon as Deontological Keyword Bias (DKB). We find that LLMs judge over 90% of commonsense scenarios as obligations when modal expressions are present. This tendency is consist across various LLM families, question types, and answer formats. To mitigate DKB, we propose a judgment strategy that integrates few-shot examples with reasoning prompts. This study sheds light on how modal expressions, as a form of linguistic framing, influence the normative decisions of LLMs and underscores the importance of addressing such biases to ensure judgment alignment.
pdf
bib
abs
LegalReasoner: Step-wised Verification-Correction for Legal Judgment Reasoning
Weijie Shi
|
Han Zhu
|
Jiaming Ji
|
Mengze Li
|
Jipeng Zhang
|
Ruiyuan Zhang
|
Jia Zhu
|
Jiajie Xu
|
Sirui Han
|
Yike Guo
Legal judgment prediction (LJP) aims to function as a judge by making final rulings based on case claims and facts, which plays a vital role in the judicial domain for supporting court decision-making and improving judicial efficiency. However, existing methods often struggle with logical errors when conducting complex legal reasoning. We propose LegalReasoner, which enhances LJP reliability through step-wise verification and correction of the reasoning process. Specifically, it first identifies dispute points to decompose complex cases, and then conducts step-wise reasoning while employing a process verifier to validate each step’s logic from correctness, progressiveness, and potential perspectives. When errors are detected, expert-designed attribution and resolution strategies are applied for correction. To fine-tune LegalReasoner, we release the LegalHK dataset, containing 58,130 Hong Kong court cases with detailed annotations of dispute points, step-by-step reasoning chains, and process verification labels. Experiments demonstrate that LegalReasoner significantly improves concordance with court decisions from 72.37 to 80.27 on LLAMA-3.1-70B. The data is available at https://huggingface.co/datasets/weijiezz/LegalHK.
pdf
bib
abs
Rolling the DICE on Idiomaticity: How LLMs Fail to Grasp Context
Maggie Mi
|
Aline Villavicencio
|
Nafise Sadat Moosavi
Human processing of idioms heavily depends on interpreting the surrounding context in which they appear. While large language models (LLMs) have achieved impressive performance on idiomaticity detection benchmarks, this success may be driven by reasoning shortcuts present in existing datasets. To address this, we introduce a novel, controlled contrastive dataset (DICE) specifically designed to assess whether LLMs can effectively leverage context to disambiguate idiomatic meanings. Furthermore, we investigate the influence of collocational frequency and sentence probability—proxies for human processing known to affect idiom resolution—on model performance. Our results show that LLMs frequently fail to resolve idiomaticity when it depends on contextual understanding, performing better on sentences deemed more likely by the model. Additionally, idiom frequency influences performance but does not guarantee accurate interpretation. Our findings emphasize the limitations of current models in grasping contextual meaning and highlight the need for more context-sensitive evaluation.
pdf
bib
abs
ChartCoder: Advancing Multimodal Large Language Model for Chart-to-Code Generation
Xuanle Zhao
|
Xianzhen Luo
|
Qi Shi
|
Chi Chen
|
Shuo Wang
|
Zhiyuan Liu
|
Maosong Sun
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in chart understanding tasks. However, interpreting charts with textual descriptions often leads to information loss, as it fails to fully capture the dense information embedded in charts. In contrast, parsing charts into code provides lossless representations that can effectively contain all critical details. Although existing open-source MLLMs have achieved success in chart understanding tasks, they still face two major challenges when applied to chart-to-code tasks: (1) Low executability and poor restoration of chart details in the generated code and (2) Lack of large-scale and diverse training data. To address these challenges, we propose
ChartCoder, the first dedicated chart-to-code MLLM, which leverages Code LLMs as the language backbone to enhance the executability of the generated code. Furthermore, we introduce
Chart2Code-160k, the first large-scale and diverse dataset for chart-to-code generation, and propose the
Snippet-of-Thought (SoT) method, which transforms direct chart-to-code generation data into step-by-step generation. Experiments demonstrate that ChartCoder, with only 7B parameters, surpasses existing open-source MLLMs on chart-to-code benchmarks, achieving superior chart restoration and code excitability. Our code is available at
https://github.com/thunlp/ChartCoder.
pdf
bib
abs
The Cross-linguistic Role of Animacy in Grammar Structures
Nina Gregorio
|
Matteo Gay
|
Sharon Goldwater
|
Edoardo Ponti
Animacy is a semantic feature of nominals and follows a hierarchy: personal pronouns > human > animate > inanimate. In several languages, animacy imposes hard constraints on grammar. While it has been argued that these constraints may emerge from universal soft tendencies, it has been difficult to provide empirical evidence for this conjecture due to the lack of data annotated with animacy classes. In this work, we first propose a method to reliably classify animacy classes of nominals in 11 languages from 5 families, leveraging multilingual large language models (LLMs) and word sense disambiguation datasets. Then, through this newly acquired data, we verify that animacy displays consistent cross-linguistic tendencies in terms of preferred morphosyntactic constructions, although not always in line with received wisdom: animacy in nouns correlates with the alignment role of agent, early positions in a clause, and syntactic pivot (e.g., for relativisation), but not necessarily with grammatical subjecthood. Furthermore, the behaviour of personal pronouns in the hierarchy is idiosyncratic as they are rarely plural and relativised, contrary to high-animacy nouns.
pdf
bib
abs
LexGen: Domain-aware Multilingual Lexicon Generation
Ayush Maheshwari
|
Atul Kumar Singh
|
N J Karthika
|
Krishnakant Bhatt
|
Preethi Jyothi
|
Ganesh Ramakrishnan
Lexicon or dictionary generation across domains has the potential for societal impact, as it can potentially enhance information accessibility for a diverse user base while preserving language identity. Prior work in the field primarily focuses on bilingual lexical induction, which deals with word alignments using mapping-based or corpora-based approaches. However, these approaches do not cater to domain-specific lexicon generation that consists of domain-specific terminology. This task becomes particularly important in specialized medical, engineering, and other technical domains, owing to the highly infrequent usage of the terms and scarcity of data involving domain-specific terms especially for low-resource languages. We propose a new model to generate dictionary words for 6 Indian languages in the multi-domain setting. Our model consists of domain-specific and domain-generic layers that encode information, and these layers are invoked via a learnable routing technique. We also release a new benchmark dataset consisting of >75K translation pairs across 6 Indian languages spanning 8 diverse domains. We conduct both zero-shot and few-shot experiments across multiple domains to show the efficacy of our proposed model in generalizing to unseen domains and unseen languages. Additionally, we also perform a human post-hoc evaluation on unseen languages. The source code and dataset is present at https://github.com/Atulkmrsingh/lexgen.
pdf
bib
abs
How to Train Long-Context Language Models (Effectively)
Tianyu Gao
|
Alexander Wettig
|
Howard Yen
|
Danqi Chen
We study continued training and supervised fine-tuning (SFT) of a language model (LM) to make effective use of long-context information. We first establish a reliable evaluation protocol to guide model development—instead of perplexity or simple needle-in-a-haystack (NIAH) tests, we use a broad set of long-context downstream tasks, and we evaluate models after SFT as this better reveals long-context abilities. Supported by our robust evaluations, we run thorough experiments to decide the data mix for continued pre-training, the instruction tuning dataset, and many other design choices such as position extrapolation. We find that (1) code repositories and books are excellent sources of long data, but it is crucial to combine them with high-quality short-context data; (2) training with a sequence length beyond the evaluation length boosts long-context performance; (3) for SFT, using only short instruction datasets yields strong performance on long-context tasks. Our final model, ProLong-8B, which is initialized from Llama-3 and trained on 40B tokens, demonstrates state-of-the-art long-context performance among similarly sized models at a length of 128K. ProLong outperforms Llama-3.1-8B-Instruct on the majority of long-context tasks despite using only 5% as many tokens during long-context training. Additionally, ProLong can effectively process up to 512K tokens, one of the longest context windows of publicly available LMs.
pdf
bib
abs
MathFusion: Enhancing Mathematical Problem-solving of LLM through Instruction Fusion
Qizhi Pei
|
Lijun Wu
|
Zhuoshi Pan
|
Yu Li
|
Honglin Lin
|
Chenlin Ming
|
Xin Gao
|
Conghui He
|
Rui Yan
Large Language Models (LLMs) have shown impressive progress in mathematical reasoning. While data augmentation is promising to enhance mathematical problem-solving ability, current approaches are predominantly limited to instance-level modifications—such as rephrasing or generating syntactic variations—which fail to capture and leverage the intrinsic relational structures inherent in mathematical knowledge. Inspired by human learning processes, where mathematical proficiency develops through systematic exposure to interconnected concepts, we introduce MathFusion, a novel framework that enhances mathematical reasoning through cross-problem instruction synthesis. MathFusion implements this through three fusion strategies: (1) sequential fusion, which chains related problems to model solution dependencies; (2) parallel fusion, which combines analogous problems to reinforce conceptual understanding; and (3) conditional fusion, which creates context-aware selective problems to enhance reasoning flexibility. By applying these strategies, we generate a new dataset, MathFusionQA, followed by fine-tuning models (DeepSeekMath-7B, Mistral-7B, Llama3-8B) on it. Experimental results demonstrate that MathFusion achieves substantial improvements in mathematical reasoning while maintaining high data efficiency, boosting performance by 18.0 points in accuracy across diverse benchmarks while requiring only 45K additional synthetic instructions, representing a substantial improvement over traditional single-instruction approaches.
pdf
bib
abs
Mining Complex Patterns of Argumentative Reasoning in Natural Language Dialogue
Ramon Ruiz-Dolz
|
Zlata Kikteva
|
John Lawrence
Argumentation scheme mining is the task of automatically identifying reasoning mechanisms behind argument inferences. These mechanisms provide insights into underlying argument structures and guide the assessment of natural language arguments. Research on argumentation scheme mining, however, has always been limited by the scarcity of large enough publicly available corpora containing scheme annotations. In this paper, we present the first state-of-the-art results for mining argumentation schemes in natural language dialogue. For this purpose, we create QT-Schemes, a new corpus of 441 arguments annotated with 24 argumentation schemes. Using this corpus, we leverage the capabilities of LLMs and Transformer-based models, pre-training them on a large corpus containing textbook-like argumentation schemes and validating their applicability in real-world scenarios.
pdf
bib
abs
OS Agents: A Survey on MLLM-based Agents for Computer, Phone and Browser Use
Xueyu Hu
|
Tao Xiong
|
Biao Yi
|
Zishu Wei
|
Ruixuan Xiao
|
Yurun Chen
|
Jiasheng Ye
|
Meiling Tao
|
Xiangxin Zhou
|
Ziyu Zhao
|
Yuhuai Li
|
Shengze Xu
|
Shenzhi Wang
|
Xinchen Xu
|
Shuofei Qiao
|
Zhaokai Wang
|
Kun Kuang
|
Tieyong Zeng
|
Liang Wang
|
Jiwei Li
|
Yuchen Eleanor Jiang
|
Wangchunshu Zhou
|
Guoyin Wang
|
Keting Yin
|
Zhou Zhao
|
Hongxia Yang
|
Fan Wu
|
Shengyu Zhang
|
Fei Wu
The dream to create AI assistants as capable and versatile as the fictional J.A.R.V.I.S from Iron Man has long captivated imaginations. With the evolution of multi-modal large language models ((M)LLMs), this dream is closer to reality, as (M)LLM-based Agents using computers, mobile phones and web browsers by operating within the environments and interfaces (e.g., Graphical User Interface (GUI) and Command Line Interface (CLI)) provided by operating systems (OS) to automate tasks have significantly advanced. This paper presents a comprehensive survey on these advanced agents, designated as OS Agents. We begin by elucidating the fundamentals of OS Agents, exploring their key components and capabilities. We then examine methodologies for constructing OS Agents, focusing on domain-specific foundation models and agent frameworks. A detailed review of evaluation metrics and benchmarks highlights how OS Agents are assessed across diverse platforms and tasks. Finally, we discuss current challenges and identify promising directions for future research. An open-source GitHub repository is maintained as a dynamic resource to foster further innovation in this field.
pdf
bib
abs
Data Quality Issues in Multilingual Speech Datasets: The Need for Sociolinguistic Awareness and Proactive Language Planning
Mingfei Lau
|
Qian Chen
|
Yeming Fang
|
Tingting Xu
|
Tongzhou Chen
|
Pavel Golik
Our quality audit for three widely used public multilingual speech datasets Mozilla Common Voice 17.0, FLEURS, and VoxPopuli shows that in some languages, these datasets suffer from significant quality issues. We believe addressing these issues will make these datasets more useful as evaluation sets, and improve downstream models. We divide these quality issues into two categories: micro-level and macro-level. We find that macro-level issues are more prevalent in less institutionalized, often under-resourced languages. We provide a case analysis of Taiwanese Southern Min (nan_tw) that highlights the need for proactive language planning (e.g. orthography prescriptions, dialect boundary definition) and enhanced data quality control in the process of Automatic Speech Recognition (ASR) dataset creation. We conclude by proposing guidelines and recommendations to mitigate these issues in future dataset development, emphasizing the importance of sociolinguistic awareness in creating robust and reliable speech data resources.
pdf
bib
abs
LLM as a Broken Telephone: Iterative Generation Distorts Information
Amr Mohamed
|
Mingmeng Geng
|
Michalis Vazirgiannis
|
Guokan Shang
As large language models are increasingly responsible for online content, concerns arise about the impact of repeatedly processing their own outputs.Inspired by the “broken telephone” effect in chained human communication, this study investigates whether LLMs similarly distort information through iterative generation.Through translation-based experiments, we find that distortion accumulates over time, influenced by language choice and chain complexity. While degradation is inevitable, it can be mitigated through strategic prompting techniques. These findings contribute to discussions on the long-term effects of AI-mediated information propagation, raising important questions about the reliability of LLM-generated content in iterative workflows.
pdf
bib
abs
VLM2-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues
Jianshu Zhang
|
Dongyu Yao
|
Renjie Pi
|
Paul Pu Liang
|
Yi R. Fung
Visually linking matching cues is a crucial ability in daily life, such as identifying the same person in multiple photos based on their cues, even without knowing who they are. Despite the extensive knowledge that vision-language models (VLMs) possess, it remains largely unexplored whether they are capable of performing this fundamental task. To address this, we introduce VLM2-Bench, a benchmark designed to assess whether VLMs can Visually Link Matching cues, with 9 subtasks and over 3,000 test cases. Comprehensive evaluation across twelve VLMs, along with further analysis of various language-side and vision-side prompting methods, leads to a total of eight key findings. We identify critical challenges in models’ ability to link visual cues, highlighting a significant performance gap. Based on these insights, we advocate for (i) enhancing core visual capabilities to improve adaptability and reduce reliance on prior knowledge, (ii) establishing clearer principles for integrating language-based reasoning in vision-centric tasks to prevent unnecessary biases, and (iii) shifting vision-text training paradigms toward fostering models’ ability to independently structure and infer relationships among visual cues.
pdf
bib
abs
Alleviating Distribution Shift in Synthetic Data for Machine Translation Quality Estimation
Xiang Geng
|
Zhejian Lai
|
Jiajun Chen
|
Hao Yang
|
Shujian Huang
Quality Estimation (QE) models evaluate the quality of machine translations without reference translations, serving as the reward models for the translation task.Due to the data scarcity, synthetic data generation has emerged as a promising solution.However, synthetic QE data often suffers from distribution shift, which can manifest as discrepancies between pseudo and real translations, or in pseudo labels that do not align with human preferences.To tackle this issue, we introduce DCSQE, a novel framework for alleviating distribution shift in synthetic QE data.To reduce the difference between pseudo and real translations, we employ the constrained beam search algorithm and enhance translation diversity through the use of distinct generation models.DCSQE uses references—i.e., translation supervision signals—to guide both the generation and annotation processes, enhancing the quality of token-level labels.DCSQE further identifies the shortest phrase covering consecutive error tokens, mimicking human annotation behavior, to assign the final phrase-level labels.Specially, we underscore that the translation model can not annotate translations of itself accurately.Extensive experiments demonstrate that DCSQE outperforms SOTA baselines like CometKiwi in both supervised and unsupervised settings.Further analysis offers insights into synthetic data generation that could benefit reward models for other tasks.The code is available at https://github.com/NJUNLP/njuqe.
pdf
bib
abs
Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models
Fan Zhang
|
Shulin Tian
|
Ziqi Huang
|
Yu Qiao
|
Ziwei Liu
Recent advancements in visual generative models have enabled high-quality image and video generation, opening diverse applications. However, evaluating these models often demands sampling hundreds or thousands of images or videos, making the process computationally expensive, especially for diffusion-based models with inherently slow sampling. Moreover, existing evaluation methods rely on rigid pipelines that overlook specific user needs and provide numerical results without clear explanations. In contrast, humans can quickly form impressions of a model’s capabilities by observing only a few samples. To mimic this, we propose the Evaluation Agent framework, which employs human-like strategies for efficient, dynamic, multi-round evaluations using only a few samples per round, while offering detailed, user-tailored analyses. It offers four key advantages: 1) efficiency, 2) promptable evaluation tailored to diverse user needs, 3) explainability beyond single numerical scores, and 4) scalability across various models and tools. Experiments show that Evaluation Agent reduces evaluation time to 10% of traditional methods while delivering comparable results. The Evaluation Agent framework is fully open-sourced to advance research in visual generative models and their efficient evaluation.
pdf
bib
abs
Large Language Models Struggle to Describe the Haystack without Human Help: A Social Science-Inspired Evaluation of Topic Models
Zongxia Li
|
Lorena Calvo-Bartolomé
|
Alexander Miserlis Hoyle
|
Paiheng Xu
|
Daniel Kofi Stephens
|
Juan Francisco Fung
|
Alden Dima
|
Jordan Lee Boyd-Graber
A common use of NLP is to facilitate the understanding of large document collections, with models based on Large Language Models (LLMs) replacing probabilistic topic models. Yet the effectiveness of LLM-based approaches in real-world applications remains under explored. This study measures the knowledge users acquire with topic models—including traditional, unsupervised and supervised LLM- based approaches—on two datasets. While LLM-based methods generate more human- readable topics and show higher average win probabilities than traditional models for data exploration, they produce overly generic topics for domain-specific datasets that do not easily allow users to learn much about the documents. Adding human supervision to LLM-based topic models improves data exploration by addressing hallucination and genericity but requires more human efforts. In contrast, traditional models like Latent Dirichlet Allocation (LDA) remain effective for exploration but are less user-friendly. This paper provides best practices—there is no one right model, the choice of models is situation-specific—and suggests potential improvements for scalable LLM- based topic models.
pdf
bib
abs
ActiView: Evaluating Active Perception Ability for Multimodal Large Language Models
Ziyue Wang
|
Chi Chen
|
Fuwen Luo
|
Yurui Dong
|
Yuanchi Zhang
|
Yuzhuang Xu
|
Xiaolong Wang
|
Peng Li
|
Yang Liu
Active perception, a crucial human capability, involves setting a goal based on the current understanding of the environment and performing actions to achieve that goal. Despite significant efforts in evaluating Multimodal Large Language Models (MLLMs), active perception has been largely overlooked. To address this gap, we propose a novel benchmark named ActiView to evaluate active perception in MLLMs. We focus on a specialized form of Visual Question Answering (VQA) that eases and quantifies the evaluation yet challenging for existing MLLMs. Meanwhile, intermediate reasoning behaviors of models are also discussed. Given an image, we restrict the perceptual field of a model, requiring it to actively zoom or shift its perceptual field based on reasoning to answer the question successfully. We conduct extensive evaluation over 30 models, including proprietary and open-source models, and observe that restricted perceptual fields play a significant role in enabling active perception. Results reveal a significant gap in the active perception capability of MLLMs, indicating that this area deserves more attention. We hope that ActiView could help develop methods for MLLMs to understand multimodal inputs in more natural and holistic ways.
pdf
bib
abs
Enough Coin Flips Can Make LLMs Act Bayesian
Ritwik Gupta
|
Rodolfo Corona
|
Jiaxin Ge
|
Eric Wang
|
Dan Klein
|
Trevor Darrell
|
David M. Chan
Large language models (LLMs) exhibit the ability to generalize given few-shot examples in their input prompt, an emergent capability known as in-context learning (ICL). We investigate whether LLMs use ICL to perform structured reasoning in ways that are consistent with a Bayesian framework or rely on pattern matching. Using a controlled setting of biased coin flips, we find that: (1) LLMs often possess biased priors, causing initial divergence in zero-shot settings, (2) in-context evidence outweighs explicit bias instructions, (3) LLMs broadly follow Bayesian posterior updates, with deviations primarily due to miscalibrated priors rather than flawed updates, and (4) attention magnitude has negligible effect on Bayesian inference. With sufficient demonstrations of biased coin flips via ICL, LLMs update their priors in a Bayesian manner. Code and visualizations are available on the [project page](https://ai-climate.berkeley.edu/llm-coin-flips/).
pdf
bib
abs
GAMEBoT: Transparent Assessment of LLM Reasoning in Games
Wenye Lin
|
Jonathan Roberts
|
Yunhan Yang
|
Samuel Albanie
|
Zongqing Lu
|
Kai Han
Large Language Models (LLMs) are increasingly deployed in real-world applications that demand complex reasoning. To track progress, robust benchmarks are required to evaluate their capabilities beyond superficial pattern recognition. However, current LLM reasoning benchmarks often face challenges such as insufficient interpretability, performance saturation or data contamination. To address these challenges, we introduce GAMEBoT, a gaming arena designed for rigorous and transparent assessment of LLM reasoning capabilities. GAMEBoT decompose complex reasoning in games into predefined modular subproblems. This decomposition allows us to design a suite of Chain-of-Thought (CoT) prompts infused with domain knowledge to guide LLMs in addressing these subproblems before action selection. Furthermore, we develop a suite of rule-based algorithms to generate ground truth for these subproblems, enabling rigorous validation of the LLMs’ intermediate reasoning steps. This approach facilitates evaluation of both the quality of final actions and the accuracy of the underlying reasoning process. GAMEBoT also naturally alleviates the risk of data contamination through dynamic games and head-to-head LLM competitions. We benchmark 17 prominent LLMs across eight games, encompassing various strategic abilities and game characteristics. Our results suggest that GAMEBoT presents a significant challenge, even when LLMs are provided with detailed CoT prompts.
pdf
bib
abs
A Text is Worth Several Tokens: Text Embedding from LLMs Secretly Aligns Well with The Key Tokens
Zhijie Nie
|
Richong Zhang
|
Zhanyu Wu
Text embeddings from large language models (LLMs) have achieved excellent results in tasks such as information retrieval, semantic textual similarity, etc. In this work, we show an interesting finding: when feeding a text into the LLM-based embedder, the obtained text embedding will be able to be aligned with the key tokens in the input text. We first fully analyze this phenomenon on eight LLM-based embedders and show that this phenomenon is universal and is not affected by model architecture, training strategy, and embedding method. With a deeper analysis, we find that the main change in embedding space between these embedders and their LLM backbones is in the first principal component. By adjusting the first principal component, we can align text embedding with the key tokens. Finally, we give several examples to demonstrate the vast application potential of this finding: (1) we propose a simple and practical sparse retrieval method based on the aligned tokens, which can achieve 80% of the dense retrieval effect of the same model while reducing the computation significantly; (2) we show that our findings provide a novel perspective to help understand novel technologies (e.g., instruction-following embedding) and fuzzy concepts (e.g., semantic relatedness vs. similarity) in this field.
pdf
bib
abs
Commonsense Reasoning in Arab Culture
Abdelrahman Sadallah
|
Junior Cedric Tonga
|
Khalid Almubarak
|
Saeed Almheiri
|
Farah Atif
|
Chatrine Qwaider
|
Karima Kadaoui
|
Sara Shatnawi
|
Yaser Alesh
|
Fajri Koto
Despite progress in Arabic large language models, such as Jais and AceGPT, their evaluation on commonsense reasoning has largely relied on machine-translated datasets, which lack cultural depth and may introduce Anglocentric biases. Commonsense reasoning is shaped by geographical and cultural contexts, and existing English datasets fail to capture the diversity of the Arab world. To address this, we introduce , a commonsense reasoning dataset in Modern Standard Arabic (MSA), covering cultures of 13 countries across the Gulf, Levant, North Africa, and the Nile Valley. The dataset was built from scratch by engaging native speakers to write and validate culturally relevant questions for their respective countries. spans 12 daily life domains with 54 fine-grained subtopics, reflecting various aspects of social norms, traditions, and everyday experiences. Zero-shot evaluations show that open-weight language models with up to 32B parameters struggle to comprehend diverse Arab cultures, with performance varying across regions. These findings highlight the need for more culturally aware models and datasets tailored to the Arabic-speaking world.
pdf
bib
abs
AXIS: Efficient Human-Agent-Computer Interaction with API-First LLM-Based Agents
Junting Lu
|
Zhiyang Zhang
|
Fangkai Yang
|
Jue Zhang
|
Lu Wang
|
Chao Du
|
Qingwei Lin
|
Saravan Rajmohan
|
Dongmei Zhang
|
Qi Zhang
Multimodal large language models (MLLMs) have enabled LLM-based agents to directly interact with application user interfaces (UIs), enhancing agents’ performance in complex tasks. However, these agents often suffer from high latency and low reliability due to the extensive sequential UI interactions. To address this issue, we propose AXIS, a novel LLM-based agents framework that prioritize actions through application programming interfaces (APIs) over UI actions. This framework also facilitates the creation and expansion of APIs through automated exploration of applications. Our experiments on Microsoft Word demonstrate that AXIS reduces task completion time by 65%-70% and cognitive workload by 38%-53%, while maintaining accuracy of 97%-98% compared to humans. Our work contributes to a new human-agent-computer interaction (HACI) framework and explores a fresh UI design principle for application providers to turn applications into agents in the era of LLMs, paving the way towards an agent-centric operating system (Agent OS). The code and dataset will be available at https://aka.ms/haci_axis.
pdf
bib
abs
Translation and Fusion Improves Cross-lingual Information Extraction
Yang Chen
|
Vedaant Shah
|
Alan Ritter
Large language models (LLMs) combined with instruction tuning have shown significant progress in information extraction (IE) tasks, exhibiting strong generalization capabilities to unseen datasets by following annotation guidelines. However, their applicability to low-resource languages remains limited due to lack of both labeled data for fine-tuning, and unlabeled text for pre-training. In this paper, we propose TransFusion, a framework in which models are fine-tuned to use English translations of low-resource language data, enabling more precise predictions through annotation fusion. Based on TransFusion, we introduce GoLLIE-TF, a cross-lingual instruction-tuned LLM for IE tasks, designed to close the performance gap between high and low-resource languages. Our experiments across twelve multilingual IE datasets spanning 50 languages demonstrate that GoLLIE-TF achieves better cross-lingual transfer over the base model. In addition, we show that TransFusion significantly improves low-resource language named entity recognition when applied to proprietary models such as GPT-4 (+5 F1) with a prompting approach, or fine-tuning different language models including decoder-only (+14 F1) and encoder-only (+13 F1) architectures.
pdf
bib
abs
Conditional Dichotomy Quantification via Geometric Embedding
Shaobo Cui
|
Wenqing Liu
|
Yiyang Feng
|
Jiawei Zhou
|
Boi Faltings
Conditional dichotomy, the contrast between two outputs conditioned on the same context, is vital for applications such as debate, defeasible inference, and causal reasoning. Existing methods that rely on semantic similarity often fail to capture the nuanced oppositional dynamics essential for these applications. Motivated by these limitations, we introduce a novel task, Conditional Dichotomy Quantification (ConDQ), which formalizes the direct measurement of conditional dichotomy and provides carefully constructed datasets covering debate, defeasible natural language inference, and causal reasoning scenarios. To address this task, we develop the Dichotomy-oriented Geometric Embedding (DoGE) framework, which leverages complex-valued embeddings and a dichotomous objective to model and quantify these oppositional relationships effectively. Extensive experiments validate the effectiveness and versatility of DoGE, demonstrating its potential in understanding and quantifying conditional dichotomy across diverse NLP applications. Our code and datasets are available at https://github.com/cui-shaobo/conditional-dichotomy-quantification.
pdf
bib
abs
Aligning Large Language Models with Implicit Preferences from User-Generated Content
Zhaoxuan Tan
|
Zheng Li
|
Tianyi Liu
|
Haodong Wang
|
Hyokun Yun
|
Ming Zeng
|
Pei Chen
|
Zhihan Zhang
|
Yifan Gao
|
Ruijie Wang
|
Priyanka Nigam
|
Bing Yin
|
Meng Jiang
Learning from preference feedback is essential for aligning large language models (LLMs) with human values and improving the quality of generated responses. However, existing preference learning methods rely heavily on curated data from humans or advanced LLMs, which is costly and difficult to scale. In this work, we present PUGC, a novel framework that leverages implicit human Preferences in unlabeled User-Generated Content (UGC) to generate preference data. Although UGC is not explicitly created to guide LLMs in generating human-preferred responses, it often reflects valuable insights and implicit preferences from its creators that has the potential to address readers’ questions. PUGC transforms UGC into user queries and generates responses from the policy model. The UGC is then leveraged as a reference text for response scoring, aligning the model with these implicit preferences. This approach improves the quality of preference data while enabling scalable, domain-specific alignment. Experimental results on Alpaca Eval 2 show that models trained with DPO and PUGC achieve a 9.37% performance improvement over traditional methods, setting a 35.93% state-of-the-art length-controlled win rate using Mistral-7B-Instruct. Further studies highlight gains in reward quality, domain-specific alignment effectiveness, robustness against UGC quality, and theory of mind capabilities. Our code and dataset are available at https://zhaoxuan.info/PUGC.github.io/.
pdf
bib
abs
VQAGuider: Guiding Multimodal Large Language Models to Answer Complex Video Questions
Yuyan Chen
|
Jiyuan Jia
|
Jiaxin Lu
|
Siyue Li
|
Yu Guan
|
Ming Yang
|
Qingpei Guo
Complex video question-answering (VQA) requires in-depth understanding of video contents including object and action recognition as well as video classification and summarization, which exhibits great potential in emerging applications in education and entertainment, etc. Multimodal large language models (MLLMs) may accomplish this task by grasping the intention of a question and decomposing it to a series of visual recognition sub-tasks to find out the answer with the help of an agent. To tackle this task, we first collect a new dedicated Complex VQA dataset named CVQA and then propose VQAGuider, an innovative framework planning a few atomic visual recognition tools by video-related API matching. VQAGuider facilitates a deep engagement with video content and precise responses to complex video-related questions by MLLMs, which is beyond aligning visual and language features for simple VQA tasks. Our experiments demonstrate VQAGuider is capable of navigating the complex VQA tasks by MLLMs and improves the accuracy by 29.6% and 17.2% on CVQA and the existing VQA datasets, respectively, highlighting its potential in advancing MLLMs’s capabilities in video understanding.
pdf
bib
abs
Large Language Models are Good Relational Learners
Fang Wu
|
Vijay Prakash Dwivedi
|
Jure Leskovec
Large language models (LLMs) have demonstrated remarkable capabilities across various domains, yet their application to relational deep learning (RDL) remains underexplored. Existing approaches adapt LLMs by traversing relational links between entities in a database and converting the structured data into flat text documents, but this text-based serialization disregards critical relational structures, introduces redundancy, and often exceeds standard LLM context lengths. We introduce Rel-LLM, a novel architecture that employs a graph neural network (GNN) based encoder to create structured relational prompts for LLMs within a retrieval-augmented generation (RAG) framework. Unlike traditional text-based serialization approaches, our method preserves the inherent relational structure of databases while enabling LLMs to effectively process and reason over complex entity relationships. Specifically, the GNN encoder extracts a local subgraph around an entity to build feature representations that contain relevant entity relationships and temporal dependencies. These representations are transformed into structured prompts using a denormalization process, effectively allowing the LLM to reason over relational structures. Through extensive experiments, we demonstrate that Rel-LLM outperforms existing methods on key RDL tasks, offering a scalable and efficient approach to integrating LLMs with structured data sources. Code is available at
https://github.com/smiles724/Rel-LLM.
pdf
bib
abs
SpaRE: Enhancing Spatial Reasoning in Vision-Language Models with Synthetic Data
Michael Ogezi
|
Freda Shi
Vision-language models (VLMs) work well in tasks ranging from image captioning to visual question answering (VQA), yet they struggle with spatial reasoning, a key skill for understanding our physical world that humans excel at. We find that spatial relations are generally rare in widely used VL datasets, with only a few being well represented, while most form a long tail of underrepresented relations. This gap leaves VLMs ill-equipped to handle diverse spatial relationships. To bridge it, we construct a synthetic VQA dataset focused on spatial reasoning generated from hyper-detailed image descriptions in Localized Narratives, DOCCI, and PixMo-Cap. Our dataset consists of 455k samples containing 3.4 million QA pairs. Trained on this dataset, our Spatial-Reasoning Enhanced (SpaRE) VLMs show strong improvements on spatial reasoning benchmarks, achieving up to a 49% performance gain on the What’s Up benchmark, while maintaining strong results on general tasks. Our work narrows the gap between human and VLM spatial reasoning and makes VLMs more capable in real-world tasks such as robotics and navigation. We plan to share our code and dataset in due course.
pdf
bib
abs
Distilling an End-to-End Voice Assistant Without Instruction Training Data
William Barr Held
|
Yanzhe Zhang
|
Weiyan Shi
|
Minzhi Li
|
Michael J Ryan
|
Diyi Yang
Voice assistants, such as Siri and Google Assistant, typically model audio and text separately, resulting in lost speech information and increased complexity. Recent efforts to address this with end-to-end Speech Large Language Models (speech-in, text-out) trained with supervised finetuning (SFT) have led to models “forgetting” capabilities from text-only LLMs. Our work proposes an alternative paradigm for training Speech LLMs without instruction data, using the response of a text-only LLM to transcripts as self-supervision. Importantly, this process can be performed without annotated responses. We show that our Distilled Voice Assistant (DiVA) generalizes to Spoken Question Answering, Classification, and Translation. Furthermore, DiVA better matches user preferences, achieving a 72% win rate compared with state-of-the-art models like Qwen 2 Audio, despite using >100x less training compute.
pdf
bib
abs
CoMet: Metaphor-Driven Covert Communication for Multi-Agent Language Games
Shuhang Xu
|
Fangwei Zhong
Metaphors are a crucial way for humans to express complex or subtle ideas by comparing one concept to another, often from a different domain. However, many large language models (LLMs) struggle to interpret and apply metaphors in multi-agent language games, hindering their ability to engage in covert communication and semantic evasion, which are crucial for strategic communication. To address this challenge, we introduce CoMet, a framework that enables LLM-based agents to engage in metaphor processing. CoMet combines a hypothesis-based metaphor reasoner with a metaphor generator that improves through self-reflection and knowledge integration. This enhances the agents’ ability to interpret and apply metaphors, improving the strategic and nuanced quality of their interactions. We evaluate CoMet on two multi-agent language games—Undercover and Adversarial Taboo—which emphasize “covert communication” and “semantic evasion”. Experimental results demonstrate that CoMet significantly enhances the agents’ ability to communicate strategically using metaphors.
pdf
bib
abs
CER: Confidence Enhanced Reasoning in LLMs
Ali Razghandi
|
Seyed Mohammad Hadi Hosseini
|
Mahdieh Soleymani Baghshah
Ensuring the reliability of Large Language Models (LLMs) in complex reasoning tasks remains a formidable challenge, particularly in scenarios that demand precise mathematical calculations and knowledge-intensive open-domain generation. In this work, we introduce an uncertainty-aware framework designed to enhance the accuracy of LLM responses by systematically incorporating model confidence at critical decision points. We propose an approach that encourages multi-step reasoning in LLMs and quantify the confidence of intermediate answers such as numerical results in mathematical reasoning and proper nouns in open-domain generation. Then, the overall confidence of each reasoning chain is evaluated based on confidence of these critical intermediate steps. Finally, we aggregate the answer of generated response paths in a way that reflects the reliability of each generated content (as opposed to self-consistency in which each generated chain contributes equally to majority voting). We conducted extensive experiments in five datasets, three mathematical datasets and two open-domain datasets, using four LLMs. The results consistently validate the effectiveness of our novel confidence-aggregation method, leading to an accuracy improvement of up to 7.4% and 5.8% over baseline approaches in math and open-domain generation tasks, respectively. Code is publicly available at https://github.com/sharif-ml-lab/CER.
pdf
bib
abs
Watermarking Large Language Models: An Unbiased and Low-risk Method
Minjia Mao
|
Dongjun Wei
|
Zeyu Chen
|
Xiao Fang
|
Michael Chau
Recent advancements in large language models (LLMs) have highlighted the risk of misusing them, raising the need for accurate detection of LLM-generated content. In response, a viable solution is to inject imperceptible identifiers into LLMs, known as watermarks. Our research extends the existing watermarking methods by proposing the novel Sampling One Then Accepting (STA-1) method. STA-1 is an unbiased watermark that preserves the original token distribution in expectation and has a lower risk of producing unsatisfactory outputs in low-entropy scenarios compared to existing unbiased watermarks. In watermark detection, STA-1 does not require prompts or a white-box LLM, provides statistical guarantees, demonstrates high efficiency in detection time, and remains robust against various watermarking attacks. Experimental results on low-entropy and high-entropy datasets demonstrate that STA-1 achieves the above properties simultaneously, making it a desirable solution for watermarking LLMs. Implementation codes for this study are available online.
pdf
bib
abs
On Synthetic Data Strategies for Domain-Specific Generative Retrieval
Haoyang Wen
|
Jiang Guo
|
Yi Zhang
|
Jiarong Jiang
|
Zhiguo Wang
This paper investigates synthetic data generation strategies in developing generative retrieval models for domain-specific corpora, thereby addressing the scalability challenges inherent in manually annotating in-domain queries. We study the data strategies for a two-stage training framework: in the first stage, which focuses on learning to decode document identifiers from queries, we investigate LLM-generated queries across multiple granularity (e.g. chunks, sentences) and domain-relevant search constraints that can better capture nuanced relevancy signals. In the second stage, which aims to refine document ranking through preference learning, we explore the strategies for mining hard negatives based on the initial model’s predictions. Experiments on public datasets over diverse domains demonstrate the effectiveness of our synthetic data generation and hard negative sampling approach.
pdf
bib
abs
LLM Braces: Straightening Out LLM Predictions with Relevant Sub-Updates
Ying Shen
|
Lifu Huang
Recent findings reveal that much of the knowledge in a Transformer-based Large Language Model (LLM) is encoded in its feed-forward (FFN) layers, where each FNN layer can be interpreted as the summation of sub-updates, each corresponding to a weighted column vector from the FFN’s value parameter matrix that often encodes human-interpretable concepts. In light of this, we hypothesize that model performance and behaviors can be further enhanced and controlled by modulating the contributions of these sub-updates based on their relevance to the input or target output style, and propose LLMBraces, a novel and efficient method that computes relevance scores associated with value vectors in FFN layers and leverages these scores to dynamically adjust the contribution of sub-updates. By optimizing sub-update contributions, LLMBraces refines the prediction process, leading to more accurate and reliable outputs, much like a ‘brace’ providing support and stability. Moreover, LLMBraces can be extended to support conditional control over generation characteristics, such as sentiment, thereby offering fine-grained steering of LLM outputs. Extensive experiments on various LLMs—including Qwen2.5-1.5B, Llama2-7B, and Llama3-8B—demonstrate that LLMBraces outperforms baseline approaches in both fine-tuning and zero-shot settings while requiring significantly fewer tunable parameters, up to 75% fewer compared to LoRA. Furthermore, LLMBraces excels in sentiment-controlled generation and toxicity reduction, highlighting its potential for flexible, controlled text generation across applications.
pdf
bib
abs
CONFETTI: Conversational Function-Calling Evaluation Through Turn-Level Interactions
Tamer Alkhouli
|
Katerina Margatina
|
James Gung
|
Raphael Shu
|
Claudia Zaghi
|
Monica Sunkara
|
Yi Zhang
We introduce Conversational Function-Calling Evaluation Through Turn-Level Interactions (CONFETTI), a conversational benchmark designed to evaluate the function-calling capabilities and response quality of large language models (LLMs). Current benchmarks lack comprehensive assessment of LLMs in complex conversational scenarios. CONFETTI addresses this gap through 109 human-simulated conversations, comprising 313 user turns and covering 86 APIs. These conversations explicitly target various conversational complexities, such as follow-ups, goal correction and switching, ambiguous and implicit goals. We perform off-policy turn-level evaluation using this benchmark targeting function-calling. Our benchmark also incorporates dialog act annotations to assess agent responses. We evaluate a series of state-of-the-art LLMs and analyze their performance with respect to the number of available APIs, conversation lengths, and chained function calling. Our results reveal that while some models are able to handle long conversations, and leverage more than 20+ APIs successfully, other models struggle with longer context or when increasing the number of APIs. We also report that the performance on chained function-calls is severely limited across the models. Overall, the top performing models onCONFETTI are Nova Pro (40.01%), Claude Sonnet v3.5 (35.46%) and Llama 3.1 405B (33.19%) followed by command-r-plus (31.18%) and Mistral-Large-2407 (30.07%).
pdf
bib
abs
Evaluating Theory of (an uncertain) Mind: Predicting the Uncertain Beliefs of Others from Conversational Cues
Anthony Sicilia
|
Malihe Alikhani
Typically, when evaluating Theory of Mind, we consider the beliefs of others to be binary: held or not held. But what if someone is unsure about their own beliefs? How can we quantify this uncertainty? We propose a new suite of tasks, challenging language models (LMs) to model the uncertainty of participants in a dialogue. We design these tasks around conversation forecasting, where the goal is to predict the probability of an unobserved conversation outcome. Uniquely, we view conversation agents themselves as forecasters, asking an LM to predict the uncertainty of an individual from their language use. We experiment with scaling methods, bagging, and demographic context for this regression task, conducting experiments on three dialogue corpora (social, negotiation, task-oriented) with eight LMs. While LMs can explain up to 7% variance in the uncertainty of others, we highlight the difficulty of the tasks and room for future work, especially in tasks that require explicit shifts in perspective.
pdf
bib
abs
Uncertainty in Causality: A New Frontier
Shaobo Cui
|
Luca Mouchel
|
Boi Faltings
Understanding uncertainty in causality is vital in various domains, including core NLP tasks like event causality extraction, commonsense reasoning, and counterfactual text generation. However, existing literature lacks a comprehensive examination of this area. This survey aims to fill this gap by thoroughly reviewing uncertainty in causality. We first introduce a novel trichotomy, categorizing causal uncertainty into aleatoric (inherent randomness in causal data), epistemic (causal model limitations), and ontological (existence of causal links) uncertainty. We then survey methods for quantifying uncertainty in causal analysis and highlight the complementary relationship between causal uncertainty and causal strength. Furthermore, we examine the challenges that large language models (LLMs) face in handling causal uncertainty, such as hallucinations and inconsistencies, and propose key traits for an optimal causal LLM. Our paper reviews current approaches and outlines future research directions, aiming to serve as a practical guide for researchers and practitioners in this emerging field.
pdf
bib
abs
SynthesizeMe! Inducing Persona-Guided Prompts for Personalized Reward Models in LLMs
Michael J Ryan
|
Omar Shaikh
|
Aditri Bhagirath
|
Daniel Frees
|
William Barr Held
|
Diyi Yang
Recent calls for pluralistic alignment of Large Language Models (LLMs) encourage adapting models to diverse user preferences. However, most prior work on personalized reward models heavily rely on additional identity information, such as demographic details or a predefined set of preference categories. To this end, we introduce SynthesizeMe, an approach to inducing synthetic user personas from user interactions for personalized reward modeling. SynthesizeMe first generates and verifies reasoning to explain user preferences, then induces synthetic user personas from that reasoning, and finally filters to informative prior user interactions in order to build personalized prompts for a particular user. We show that using SynthesizeMe induced prompts improves personalized LLM-as-a-judge accuracy by 4.4% on Chatbot Arena. Combining SynthesizeMe derived prompts with a reward model achieves top performance on PersonalRewardBench: a new curation of user-stratified interactions with chatbots collected from 854 users of Chatbot Arena and PRISM.
pdf
bib
abs
When People are Floods: Analyzing Dehumanizing Metaphors in Immigration Discourse with Large Language Models
Julia Mendelsohn
|
Ceren Budak
Metaphor, discussing one concept in terms of another, is abundant in politics and can shape how people understand important issues. We develop a computational approach to measure metaphorical language, focusing on immigration discourse on social media. Grounded in qualitative social science research, we identify seven concepts evoked in immigration discourse (e.g. water or vermin). We propose and evaluate a novel technique that leverages both word-level and document-level signals to measure metaphor with respect to these concepts. We then study the relationship between metaphor, political ideology, and user engagement in 400K US tweets about immigration. While conservatives tend to use dehumanizing metaphors more than liberals, this effect varies widely across concepts. Moreover, creature-related metaphor is associated with more retweets, especially for liberal authors. Our work highlights the potential for computational methods to complement qualitative approaches in understanding subtle and implicit language in political discourse.
pdf
bib
abs
AGrail: A Lifelong Agent Guardrail with Effective and Adaptive Safety Detection
Weidi Luo
|
Shenghong Dai
|
Xiaogeng Liu
|
Suman Banerjee
|
Huan Sun
|
Muhao Chen
|
Chaowei Xiao
The rapid advancements in Large Language Models (LLMs) have enabled their deployment as autonomous agents for handling complex tasks in dynamic environments. These LLMs demonstrate strong problem-solving capabilities and adaptability to multifaceted scenarios. However, their use as agents also introduces significant risks, including task-specific risks, which are identified by the agent administrator based on the specific task requirements and constraints, and systemic risks, which stem from vulnerabilities in their design or interactions, potentially compromising confidentiality, integrity, or availability (CIA) of information and triggering security risks. Existing defense agencies fail to adaptively and effectively mitigate these risks. In this paper, we propose AGrail, a lifelong agent guardrail to enhance LLM agent safety, which features adaptive safety check generation, effective safety check optimization, and tool compatibility & flexibility. Extensive experiments demonstrate that AGrail not only achieves strong performance against task-specific and system risks but also exhibits transferability across different LLM agents’ tasks.
pdf
bib
abs
Improving Model Factuality with Fine-grained Critique-based Evaluator
Yiqing Xie
|
Wenxuan Zhou
|
Pradyot Prakash
|
Di Jin
|
Yuning Mao
|
Quintin Fettes
|
Arya Talebzadeh
|
Sinong Wang
|
Han Fang
|
Carolyn Rose
|
Daniel Fried
|
Hejia Zhang
Factuality evaluation aims to detect factual errors produced by language models (LMs) and hence guide the development of more factual models. Towards this goal, we train a factuality evaluator, FenCE, that provides LM generators with claim-level factuality feedback. In particular, we train FenCE to (1) generate textual critiques along with scores and (2) make claim-level judgment based on diverse source documents obtained by various tools, via data augmentation on a combination of public judgment datasets. We then present a framework that leverages FenCE to improve the factuality of LM generators by constructing training data. Specifically, we generate a set of candidate responses, ask FenCE to revise and score each response without introducing lesser-known facts, and train the generator by preferring highly scored revised responses. Experiments show that our data augmentation methods improve the evaluator’s accuracy by 2.9% on LLM-AggreFact. With FenCE, we improve Llama2-7B-chat/Llama3-8B-chat’s factuality rate by 16.86%/14.45% on FActScore, outperforming state-of-the-art factuality finetuning methods by 8.83%/6.96%.
pdf
bib
abs
Building a Long Text Privacy Policy Corpus with Multi-Class Labels
Florencia Marotta-Wurgler
|
David Stein
Legal text poses distinctive challenges for natural language processing. The legal import of a term may depend on omissions, cross-references, or silence, Further, legal text is often susceptible to multiple valid, conflicting interpretations; as the saying goes: a good lawyer’s answer to any question is “it depends.”This work introduces a new, hand-coded dataset for the interpretation of privacy policies. It includes privacy policies from 149 firms, including materials incorporated by reference. The policies are annotated across 64 dimension that reflect the applicable legal rules and contested terms from EU and US privacy regulation and litigation. Our annotation methodology is designed to capture the capture core challenges peculiar to legal language, including indeterminacy, interdependence between clauses, meaningful silence, and the implications of legal defaults. We present a set of baseline results for the dataset using current large language models.
pdf
bib
abs
R2-MultiOmnia: Leading Multilingual Multimodal Reasoning via Self-Training
Leonardo Ranaldi
|
Federico Ranaldi
|
Giulia Pucci
Reasoning is an intricate process that transcends both language and vision; yet, despite its inherently modality-agnostic nature, develop-ing effective multilingual and multimodal reasoning capabilities remains a substantial challenge for Multimodal Large Language Models (MLLMs). They struggle to activate complex reasoning behaviours, delivering step-wise explanation, questioning and reflection, particularly in multilingual settings where high-quality supervision across languages is lacking. Recent works have introduced eclectic strategies to enhance MLLMs’ reasoning; however, they remain related to a single language.To make MLLMs’ reasoning capabilities aligned among languages and improve modality performances, we propose R2-MultiOmnia, a modular approach that instructs the models to abstract key elements of the reasoning process and then refine reasoning trajectories via self-correction. Specifically, we instruct the models producing multimodal synthetic resources by bridging modalities and then self-improving their capabilities. To stabilise learning and the reasoning processes structure, we propose Curriculum Learning Reasoning Stabilisation with structured output rewards to gradually refine the models’ capabilities to learn and deliver robust reasoning processes. Experiments show that R2-MultiOmnia improves multimodal reasoning, gets aligned performances among the languages approaching strong models.
pdf
bib
abs
When the LM misunderstood the human chuckled: Analyzing garden path effects in humans and language models
Samuel Joseph Amouyal
|
Aya Meltzer-Asscher
|
Jonathan Berant
Modern Large Language Models (LLMs) have shown human-like abilities in many language tasks, sparking interest in comparing LLMs’ and humans’ language processing. In this paper, we try to answer two questions: 1. What makes garden-path sentences hard to understand for humans? 2. Do the same reasons make garden-path sentences hard for LLMs as well? Based on psycholinguistic research, we formulate hypotheses on why garden-path sentences are hard, and test these hypotheses on human participants and a large suite of LLMs using comprehension questions. Our findings reveal that both LLMs and humans struggle with specific syntactic complexities, with some models showing high correlation with human comprehension. To complement our findings, we test LLM comprehension of garden-path constructions with paraphrasing and text-to-image generation tasks, and find that the results mirror the sentence comprehension question results, further validating our findings on LLM understanding of these constructions.
pdf
bib
abs
Cross-Lingual Pitfalls: Automatic Probing Cross-Lingual Weakness of Multilingual Large Language Models
Zixiang Xu
|
Yanbo Wang
|
Yue Huang
|
Xiuying Chen
|
Jieyu Zhao
|
Meng Jiang
|
Xiangliang Zhang
Large Language Models (LLMs) have achieved remarkable success in Natural Language Processing (NLP), yet their cross-lingual consistency remains a significant challenge. This paper introduces a novel methodology for efficiently identifying inherent cross-lingual weaknesses in LLMs. Our approach leverages beam search and LLM-based simulation to generate bilingual question pairs that expose performance discrepancies between English and target languages. We construct a new dataset of over 6,000 bilingual pairs across 16 languages using this methodology, demonstrating its effectiveness in revealing weaknesses even in state-of-the-art models. The extensive experiments demonstrate that our method precisely and cost-effectively pinpoints cross-lingual weaknesses, consistently revealing over 50% accuracy drops in target languages across a wide range of models. Moreover, further experiments investigate the relationship between linguistic similarity and cross-lingual weaknesses, revealing that linguistically related languages share similar performance patterns and benefit from targeted post-training. Code is available at https://github.com/xzx34/Cross-Lingual-Pitfalls.
pdf
bib
abs
VLSBench: Unveiling Visual Leakage in Multimodal Safety
Xuhao Hu
|
Dongrui Liu
|
Hao Li
|
Xuanjing Huang
|
Jing Shao
Safety concerns of Multimodal large language models (MLLMs) have gradually become an important problem in various applications. Surprisingly, previous works indicate a counterintuitive phenomenon that using textual unlearning to align MLLMs achieves comparable safety performances with MLLMs aligned with image-text pairs. To explain such a phenomenon, we discover a Visual Safety Information Leakage (VSIL) problem in existing multimodal safety benchmarks, i.e., the potentially risky content in the image has been revealed in the textual query. Thus, MLLMs can easily refuse these sensitive image-text pairs according to textual queries only, leading to unreliable cross-modality safety evaluation of MLLMs. We also conduct a further comparison experiment between textual alignment and multimodal alignment to highlight this drawback. To this end, we construct Visual Leakless Safety Bench (VLSBench) with 2.2k image-text pairs through an automated data pipeline. Experimental results indicate that VLSBench poses a significant challenge to both open-source and close-source MLLMs, i.e., LLaVA, Qwen2-VL and GPT-4o. Besides, we empirically compare textual and multimodal alignment methods on VLSBench and find that textual alignment is effective enough for multimodal safety scenarios with VSIL, while multimodal alignment is preferable for safety scenarios without VSIL.
pdf
bib
abs
Browsing Lost Unformed Recollections: A Benchmark for Tip-of-the-Tongue Search and Reasoning
Sky CH-Wang
|
Darshan Girish Deshpande
|
Smaranda Muresan
|
Anand Kannappan
|
Rebecca Qian
We introduce Browsing Lost Unformed Recollections, a tip-of-the-tongue known-item search and reasoning benchmark for general AI assistants. BLUR introduces a set of 573 real-world validated questions that demand searching and reasoning across multimodal and multilingual inputs, as well as proficient tool use, in order to excel on. Humans easily ace these questions (scoring on average 98%), while the best-performing system scores around 56%. To facilitate progress toward addressing this challenging and aspirational use case for general AI assistants, we release 350 questions through a public leaderboard, retain the answers to 250 of them, and have the rest as a private test set.
pdf
bib
abs
Data Laundering: Artificially Boosting Benchmark Results through Knowledge Distillation
Jonibek Mansurov
|
Akhmed Sakip
|
Alham Fikri Aji
In this paper, we show that knowledge distillation can be subverted to manipulate language model benchmark scores, revealing a critical vulnerability in current evaluation practices. We introduce “Data Laundering,” a process that enables the covert transfer of benchmark-specific knowledge through seemingly legitimate intermediate training steps. Through extensive experiments with a 2-layer BERT student model, we show how this approach can achieve substantial improvements in benchmark accuracy (up to 75% on GPQA) without developing genuine reasoning capabilities. Notably, this method can be exploited intentionally or even unintentionally, as researchers may inadvertently adopt this method and inflate scores without realising the implications. While our findings demonstrate the effectiveness of this technique, we present them as a cautionary tale highlighting the urgent need for more robust evaluation methods in AI. This work aims to contribute to the ongoing discussion about evaluation integrity in AI development and the need for benchmarks that more accurately reflect true model capabilities. The code is available at
https://github.com/mbzuai-nlp/data_laundering.
pdf
bib
abs
Conspiracy Theories and Where to Find Them on TikTok
Francesco Corso
|
Francesco Pierri
|
Gianmarco De Francisci Morales
TikTok has skyrocketed in popularity over recent years, especially among younger audiences. However, there are public concerns about the potential of this platform to promote and amplify harmful content. This study presents the first systematic analysis of conspiracy theories on TikTok. By leveraging the official TikTok Research API we collect a longitudinal dataset of 1.5M videos shared in the U.S. over three years. We estimate a lower bound on the prevalence of conspiratorial videos (up to 1000 new videos per month) and evaluate the effects of TikTok’s Creativity Program for monetization, observing an overall increase in video duration regardless of content. Lastly, we evaluate the capabilities of state-of-the-art open-weight Large Language Models to identify conspiracy theories from audio transcriptions of videos. While these models achieve high precision in detecting harmful content (up to 96%), their overall performance remains comparable to fine-tuned traditional models such as RoBERTa. Our findings suggest that Large Language Models can serve as an effective tool for supporting content moderation strategies aimed at reducing the spread of harmful content on TikTok.
pdf
bib
abs
Growing Through Experience: Scaling Episodic Grounding in Language Models
Chunhui Zhang
|
Sirui Wang
|
Zhongyu Ouyang
|
Xiangchi Yuan
|
Soroush Vosoughi
Language models (LMs) require effective episodic grounding—the ability to learn from and apply past experiences—to perform well at physical planning tasks. While current approaches struggle with scalability and integration of episodic memory, which is particularly limited for medium-sized LMs (7B parameters), larger LMs (70-405B) offer untapped potential through their hierarchical representations and extensive pre-trained knowledge. Therefore, to unlock larger LMs’ potential for grounding, we present a scalable weak-to-strong episodic learning framework that efficiently transfers episodic behaviors from smaller to larger LMs. It uses Monte Carlo tree search for structured experience collection with a novel distillation method that preserves LM capabilities while incorporating episodic memory. This enables larger LMs to leverage their inherent advantages for improved physical planning. Experiments show our solution outperforms top proprietary LMs by 3.45% across diverse planning and question-answering tasks. Layer-wise probing reveals systematic improvements in task alignment, particularly in later LM layers. It shows stable generalization to even unseen scenarios, even as planning steps increase, whereas baselines deteriorate sharply beyond a complexity threshold of four planning steps.
pdf
bib
abs
Exploiting the Shadows: Unveiling Privacy Leaks through Lower-Ranked Tokens in Large Language Models
Yuan Zhou
|
Zhuo Zhang
|
Xiangyu Zhang
Large language models (LLMs) play a crucial role in modern applications but face vulnerabilities related to the extraction of sensitive information. This includes unauthorized accesses to internal prompts and retrieval of personally identifiable information (PII) (e.g., in Retrieval-Augmented Generation based agentic applications). We examine these vulnerabilities in a question-answering (QA) setting where LLMs use retrieved documents or training knowledge as few-shot prompts. Although these documents remain confidential under normal use, adversaries can manipulate input queries to extract private content. In this paper, we propose a novel attack method by exploiting the model’s lower-ranked output tokens to leak sensitive information. We systematically evaluate our method, demonstrating its effectiveness in both the agentic application privacy extraction setting and the direct training data extraction. These findings reveal critical privacy risks in LLMs and emphasize the urgent need for enhanced safeguards against information leakage.
pdf
bib
abs
Attacking Vision-Language Computer Agents via Pop-ups
Yanzhe Zhang
|
Tao Yu
|
Diyi Yang
Autonomous agents powered by large vision and language models (VLM) have demonstrated significant potential in completing daily computer tasks, such as browsing the web to book travel and operating desktop software, which requires agents to understand these interfaces. Despite such visual inputs becoming more integrated into agentic applications, what types of risks and attacks exist around them still remain unclear. In this work, we demonstrate that VLM agents can be easily attacked by a set of carefully designed adversarial pop-ups, which human users would typically recognize and ignore. This distraction leads agents to click these pop-ups instead of performing their tasks as usual. Integrating these pop-ups into existing agent testing environments like OSWorld and VisualWebArena leads to an attack success rate (the frequency of the agent clicking the pop-ups) of 86% on average and decreases the task success rate by 47%. Basic defense techniques, such as asking the agent to ignore pop-ups or including an advertisement notice, are ineffective against the attack. Code is available at [this link](https://github.com/SALT-NLP/PopupAttack).
pdf
bib
abs
Explicit and Implicit Data Augmentation for Social Event Detection
Congbo Ma
|
Yuxia Wang
|
Jia Wu
|
Jian Yang
|
Jing Du
|
Zitai Qiu
|
Qing Li
|
Hu Wang
|
Preslav Nakov
Social event detection involves identifying and categorizing important events from social media, which relies on labeled data, but annotation is costly and labor-intensive. To address this problem, we propose Augmentation framework for Social Event Detection (SED-Aug), a plug-and-play dual augmentation framework, which combines explicit text-based and implicit feature-space augmentation to enhance data diversity and model robustness. The explicit augmentation utilizes LLMs to enhance textual information through five diverse generation strategies. For implicit augmentation, we design five novel perturbation techniques that operate in the feature space on structural fused embeddings. These perturbations are crafted to keep the semantic and relational properties of the embeddings and make them more diverse. Specifically, SED-Aug outperforms the best baseline model by approximately 17.67% on the Twitter2012 dataset and by about 15.57% on the Twitter2018 dataset in terms of the average F1 score.
pdf
bib
abs
In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialogue Agents
Zhen Tan
|
Jun Yan
|
I-Hung Hsu
|
Rujun Han
|
Zifeng Wang
|
Long Le
|
Yiwen Song
|
Yanfei Chen
|
Hamid Palangi
|
George Lee
|
Anand Rajan Iyer
|
Tianlong Chen
|
Huan Liu
|
Chen-Yu Lee
|
Tomas Pfister
Large Language Models (LLMs) have made significant progress in open-ended dialogue, yet their inability to retain and retrieve relevant information from long-term interactions limits their effectiveness in applications requiring sustained personalization. External memory mechanisms have been proposed to address this limitation, enabling LLMs to maintain conversational continuity. However, existing approaches struggle with two key challenges. First, rigid memory granularity fails to capture the natural semantic structure of conversations, leading to fragmented and incomplete representations. Second, fixed retrieval mechanisms cannot adapt to diverse dialogue contexts and user interaction patterns. In this work, we propose Reflective Memory Management (RMM), a novel mechanism for long-term dialogue agents, integrating forward- and backward-looking reflections: (1) Prospective Reflection, which dynamically summarizes interactions across granularities—utterances, turns, and sessions—into a personalized memory bank for effective future retrieval, and (2) Retrospective Reflection, which iteratively refines the retrieval in an online reinforcement learning (RL) manner based on LLMs’ cited evidence. Experiments show that RMM demonstrates consistent improvement across various metrics and benchmarks. For example, RMM shows more than 10% accuracy improvement over the baseline without memory management on the LongMemEval dataset.
pdf
bib
abs
Revisiting Classical Chinese Event Extraction with Ancient Literature Information
Xiaoyi Bao
|
Zhongqing Wang
|
Jinghang Gu
|
Chu-Ren Huang
The research on classical Chinese event extraction trends to directly graft the complex modeling from English or modern Chinese works, neglecting the utilization of the unique characteristic of this language. We argue that, compared with grafting the sophisticated methods from other languages, focusing on classical Chinese’s inimitable source of __Ancient Literature__ could provide us with extra and comprehensive semantics in event extraction. Motivated by this, we propose a Literary Vision-Language Model (VLM) for classical Chinese event extraction, integrating with literature annotations, historical background and character glyph to capture the inner- and outer-context information from the sequence. Extensive experiments build a new state-of-the-art performance in the GuwenEE, CHED datasets, which underscores the effectiveness of our proposed VLM, and more importantly, these unique features can be obtained precisely at nearly zero cost.
pdf
bib
abs
Unanswerability Evaluation for Retrieval Augmented Generation
Xiangyu Peng
|
Prafulla Kumar Choubey
|
Caiming Xiong
|
Chien-Sheng Wu
Existing evaluation frameworks for retrieval-augmented generation (RAG) systems focus on answerable queries, but they overlook the importance of appropriately rejecting unanswerable requests. In this paper, we introduce UAEval4RAG, a comprehensive evaluation framework designed to evaluate whether RAG systems effectively handle unanswerable queries specific to a given knowledge base. We first define a taxonomy with six unanswerable categories, and UAEval4RAG automatically synthesizes diverse and challenging queries for any given knowledge base and evaluate the RAG systems with unanswered ratio and acceptable ratio metrics. We also conduct experiments with various RAG components and prompting strategies across four datasets, which reveals that due to varying knowledge distribution across datasets, no single configuration consistently delivers optimal performance on both answerable and unanswerable requests across different knowledge bases. Our findings highlight the critical role of component selection and prompt design in optimizing RAG systems to balance the accuracy of answerable queries with high rejection rates of unanswerable ones. UAEval4RAG provides valuable insights and tools for developing more robust and reliable RAG systems.
pdf
bib
abs
SCALE: Towards Collaborative Content Analysis in Social Science with Large Language Model Agents and Human Intervention
Chengshuai Zhao
|
Zhen Tan
|
Chau-Wai Wong
|
Xinyan Zhao
|
Tianlong Chen
|
Huan Liu
Content analysis breaks down complex and unstructured texts into theory-informed numerical categories. Particularly, in social science, this process usually relies on multiple rounds of manual annotation, domain expert discussion, and rule-based refinement. In this paper, we introduce SCALE, a novel multi-agent framework that effectively ̲Simulates ̲Content ̲Analysis via ̲Large language model (LLM) ag ̲Ents. SCALE imitates key phases of content analysis, including text coding, collaborative discussion, and dynamic codebook evolution, capturing the reflective depth and adaptive discussions of human researchers. Furthermore, by integrating diverse modes of human intervention, SCALE is augmented with expert input to further enhance its performance. Extensive evaluations on real-world datasets demonstrate that SCALE achieves human-approximated performance across various complex content analysis tasks, offering an innovative potential for future social science research.
pdf
bib
abs
Self-Error-Instruct: Generalizing from Errors for LLMs Mathematical Reasoning
Erxin Yu
|
Jing Li
|
Ming Liao
|
Qi Zhu
|
Boyang Xue
|
Minghui Xu
|
Baojun Wang
|
Lanqing Hong
|
Fei Mi
|
Lifeng Shang
Although large language models demonstrate strong performance across various domains, they still struggle with numerous bad cases in mathematical reasoning. Previous approaches to learning from errors synthesize training data by solely extrapolating from isolated bad cases, thereby failing to generalize the extensive patterns inherent within these cases. This paper presents Self-Error-Instruct (SEI), a framework that addresses these model weaknesses and synthesizes more generalized targeted training data. Specifically, we explore a target model on two mathematical datasets, GSM8K and MATH, to pinpoint bad cases. Then, we generate error keyphrases for these cases based on the instructor model’s (GPT-4o) analysis and identify error types by clustering these keyphrases. Next, we sample a few bad cases during each generation for each identified error type and input them into the instructor model, which synthesizes additional training data using a self-instruct approach. This new data is refined through a one-shot learning process to ensure that only the most effective examples are kept. Finally, we use these curated data to fine-tune the target model, iteratively repeating the process to enhance performance. We apply our framework to various models and observe improvements in their reasoning abilities across both in-domain and out-of-domain mathematics datasets. These results demonstrate the effectiveness of self-error instruction in improving LLMs’ mathematical reasoning through error generalization.
pdf
bib
abs
RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework
Kunlun Zhu
|
Yifan Luo
|
Dingling Xu
|
Yukun Yan
|
Zhenghao Liu
|
Shi Yu
|
Ruobing Wang
|
Shuo Wang
|
Yishan Li
|
Nan Zhang
|
Xu Han
|
Zhiyuan Liu
|
Maosong Sun
Retrieval-Augmented Generation (RAG) is a powerful approach that enables large language models (LLMs) to incorporate external knowledge. However, evaluating the effectiveness of RAG systems in specialized scenarios remains challenging due to the high costs of data construction and the lack of suitable evaluation metrics. This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios by generating high-quality documents, questions, answers, and references through a schema-based pipeline. With a focus on factual accuracy, we propose three novel metrics—Completeness, Hallucination, and Irrelevance—to evaluate LLM-generated responses rigorously. Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples. Furthermore, the use of LLMs for scoring the proposed metrics demonstrates a high level of consistency with human evaluations. RAGEval establishes a new paradigm for evaluating RAG systems in real-world applications. The code and dataset are released at https://github.com/OpenBMB/RAGEval.
pdf
bib
abs
A Survey on Patent Analysis: From NLP to Multimodal AI
Homaira Huda Shomee
|
Zhu Wang
|
Sathya N. Ravi
|
Sourav Medya
Recent advances in Pretrained Language Models (PLMs) and Large Language Models (LLMs) have demonstrated transformative capabilities across diverse domains. The field of patent analysis and innovation is not an exception, where natural language processing (NLP) techniques presents opportunities to streamline and enhance important tasks—such as patent classification and patent retrieval—in the patent cycle. This not only accelerates the efficiency of patent researchers and applicants, but also opens new avenues for technological innovation and discovery. Our survey provides a comprehensive summary of recent NLP-based methods—including multimodal ones—in patent analysis. We also introduce a novel taxonomy for categorization based on tasks in the patent life cycle, as well as the specifics of the methods. This interdisciplinary survey aims to serve as a comprehensive resource for researchers and practitioners who work at the intersection of NLP, Multimodal AI, and patent analysis, as well as patent offices to build efficient patent systems.
pdf
bib
abs
SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification
Chengye Wang
|
Yifei Shen
|
Zexi Kuang
|
Arman Cohan
|
Yilun Zhao
We introduce SciVer, the first benchmark specifically designed to evaluate the ability of foundation models to verify claims within a multimodal scientific context.SciVer consists of 3,000 expert-annotated examples over 1,113 scientific papers, covering four subsets, each representing a common reasoning type in multimodal scientific claim verification. To enable fine-grained evaluation, each example includes expert-annotated supporting evidence.We assess the performance of 21 state-of-the-art multimodal foundation models, including o4-mini, Gemini-2.5-Flash, Llama-3.2-Vision, and Qwen2.5-VL. Our experiment reveals a substantial performance gap between these models and human experts on SciVer.Through an in-depth analysis of retrieval-augmented generation (RAG), and human-conducted error evaluations, we identify critical limitations in current open-source models, offering key insights to advance models’ comprehension and reasoning in multimodal scientific literature tasks.
pdf
bib
abs
MultiAgentBench : Evaluating the Collaboration and Competition of LLM agents
Kunlun Zhu
|
Hongyi Du
|
Zhaochen Hong
|
Xiaocheng Yang
|
Shuyi Guo
|
Zhe Wang
|
Zhenhailong Wang
|
Cheng Qian
|
Robert Tang
|
Heng Ji
|
Jiaxuan You
Large Language Models (LLMs) have shown remarkable capabilities as autonomous agents; yet existing benchmarks either focus on single-agent tasks or are confined to narrow domains, failing to capture the dynamics of multi-agent coordination and competition. In this paper, we introduce MultiAgentBench, a comprehensive benchmark designed to evaluate LLM-based multi-agent systems across diverse, interactive scenarios. Our framework measures not only task completion but also the quality of collaboration and competition using novel, milestone-based key performance indicators. Moreover, we evaluate various coordination protocols (including star, chain, tree, and graph topologies) and innovative strategies such as group discussion and cognitive planning. Notably, cognitive planning improves milestone achievement rates by 3%. Code and dataset will be made publicly available. Code and datasets are publicavailable at https://github.com/ulab-uiuc/MARBLE
pdf
bib
abs
Sinhala Encoder-only Language Models and Evaluation
Tharindu Ranasinghe
|
Hansi Hettiarachchi
|
Nadeesha Chathurangi Naradde Vidana Pathirana
|
Damith Premasiri
|
Lasitha Uyangodage
|
Isuri Nanomi Arachchige
|
Alistair Plum
|
Paul Rayson
|
Ruslan Mitkov
Recently, language models (LMs) have produced excellent results in many natural language processing (NLP) tasks. However, their effectiveness is highly dependent on available pre-training resources, which is particularly challenging for low-resource languages such as Sinhala. Furthermore, the scarcity of benchmarks to evaluate LMs is also a major concern for low-resource languages. In this paper, we address these two challenges for Sinhala by (i) collecting the largest monolingual corpus for Sinhala, (ii) training multiple LMs on this corpus and (iii) compiling the first Sinhala NLP benchmark (Sinhala-GLUE) and evaluating LMs on it. We show the Sinhala LMs trained in this paper outperform the popular multilingual LMs, such as XLM-R and existing Sinhala LMs in downstream NLP tasks. All the trained LMs are publicly available. We also make Sinhala-GLUE publicly available as a public leaderboard, and we hope that it will enable further advancements in developing and evaluating LMs for Sinhala.
pdf
bib
abs
LLMs can Perform Multi-Dimensional Analytic Writing Assessments: A Case Study of L2 Graduate-Level Academic English Writing
Zhengxiang Wang
|
Veronika Makarova
|
Zhi Li
|
Jordan Kodner
|
Owen Rambow
The paper explores the performance of LLMs in the context of multi-dimensional analytic writing assessments, i.e. their ability to provide both scores and comments based on multiple assessment criteria. Using a corpus of literature reviews written by L2 graduate students and assessed by human experts against 9 analytic criteria, we prompt several popular LLMs to perform the same task under various conditions. To evaluate the quality of feedback comments, we apply a novel feedback comment quality evaluation framework. This framework is interpretable, cost-efficient, scalable, and reproducible, compared to existing methods that rely on manual judgments. We find that LLMs can generate reasonably good and generally reliable multi-dimensional analytic assessments. We release our corpus and code for reproducibility.
pdf
bib
abs
SEUF: Is Unlearning One Expert Enough for Mixture-of-Experts LLMs?
Haomin Zhuang
|
Yihua Zhang
|
Kehan Guo
|
Jinghan Jia
|
Gaowen Liu
|
Sijia Liu
|
Xiangliang Zhang
Recent advancements in LLMs unlearning have shown remarkable success in removing unwanted data-model influences while preserving the model’s utility for legitimate knowledge. Despite these strides, sparse Mixture-of-Experts (MoE) LLMs–a key subset of the LLM family–have remained unexplored in the context of unlearning. As MoE LLMs are celebrated for their exceptional performance, we ask:How can unlearning be performed effectively and efficiently on MoE LLMs? Our pilot study shows that the dynamic routing nature of MoE LLMs introduces unique challenges, leading to excessive forgetting, uncontrolled knowledge erasure and substantial utility drops when existing unlearning methods are applied. To address this, we propose a novel Selected-Expert Unlearning Framework (SEUF). Through expert attribution, unlearning is concentrated on the most actively engaged experts for the specified knowledge. Concurrently, an anchor loss is applied to the router to stabilize the active state of this targeted expert, ensuring focused and controlled unlearning. SEUF is compatible with various standard unlearning algorithms. Extensive experiments demonstrate that SEUF enhances both forget quality up to 5% and model utility by 35% on MoE LLMs across various benchmarks and LLM architectures (compared to standard unlearning algorithms), while only unlearning 0.06% of the model parameters.
pdf
bib
abs
Pragmatics in the Era of Large Language Models: A Survey on Datasets, Evaluation, Opportunities and Challenges
Bolei Ma
|
Yuting Li
|
Wei Zhou
|
Ziwei Gong
|
Yang Janet Liu
|
Katja Jasinskaja
|
Annemarie Friedrich
|
Julia Hirschberg
|
Frauke Kreuter
|
Barbara Plank
Understanding pragmatics—the use of language in context—is crucial for developing NLP systems capable of interpreting nuanced language use. Despite recent advances in language technologies, including large language models, evaluating their ability to handle pragmatic phenomena such as implicatures and references remains challenging. To advance pragmatic abilities in models, it is essential to understand current evaluation trends and identify existing limitations. In this survey, we provide a comprehensive review of resources designed for evaluating pragmatic capabilities in NLP, categorizing datasets by the pragmatic phenomena they address. We analyze task designs, data collection methods, evaluation approaches, and their relevance to real-world applications. By examining these resources in the context of modern language models, we highlight emerging trends, challenges, and gaps in existing benchmarks. Our survey aims to clarify the landscape of pragmatic evaluation and guide the development of more comprehensive and targeted benchmarks, ultimately contributing to more nuanced and context-aware NLP models.
pdf
bib
abs
LocAgent: Graph-Guided LLM Agents for Code Localization
Zhaoling Chen
|
Robert Tang
|
Gangda Deng
|
Fang Wu
|
Jialong Wu
|
Zhiwei Jiang
|
Viktor Prasanna
|
Arman Cohan
|
Xingyao Wang
Code localization–identifying precisely where in a codebase changes need to be made–is a fundamental yet challenging task in software maintenance. Existing approaches struggle to efficiently navigate complex codebases when identifying relevant code snippets.The challenge lies in bridging natural language problem descriptions with the target code elements, often requiring reasoning across hierarchical structures and multiple dependencies.We introduce LocAgent, a framework that addresses code localization through a graph-guided agent.By parsing codebases into directed heterogeneous graphs, LocAgent creates a lightweight representation that captures code structures and their dependencies, enabling LLM agents to effectively search and locate relevant entities through powerful multi-hop reasoning.Experimental results on real-world benchmarks demonstrate that our approach significantly enhances accuracy in code localization.Notably, our method with the fine-tuned Qwen-2.5-Coder-Instruct-32B model achieves comparable results to SOTA proprietary models at greatly reduced cost (approximately 86% reduction), reaching up to 92.7% accuracy on file-level localization while improving downstream GitHub issue resolution success rates by 12% for multiple attempts (Pass@10). Our code is available at
https://github.com/gersteinlab/LocAgent.
pdf
bib
abs
COSMMIC: Comment-Sensitive Multimodal Multilingual Indian Corpus for Summarization and Headline Generation
Raghvendra Kumar
|
Mohammed Salman S A
|
Aryan Sahu
|
Tridib Nandi
|
Pragathi Y P
|
Sriparna Saha
|
Jose G Moreno
Despite progress in comment-aware multimodal and multilingual summarization for English and Chinese, research in Indian languages remains limited. This study addresses this gap by introducing COSMMIC, a pioneering comment-sensitive multimodal, multilingual dataset featuring nine major Indian languages. COSMMIC comprises 4,959 article-image pairs and 24,484 reader comments, with ground-truth summaries available in all included languages. Our approach enhances summaries by integrating reader insights and feedback. We explore summarization and headline generation across four configurations: (1) using article text alone, (2) incorporating user comments, (3) utilizing images, and (4) combining text, comments, and images. To assess the dataset’s effectiveness, we employ state-of-the-art language models such as LLama3 and GPT-4. We conduct a comprehensive study to evaluate different component combinations, including identifying supportive comments, filtering out noise using a dedicated comment classifier using IndicBERT, and extracting valuable insights from images with a multilingual CLIP-based classifier. This helps determine the most effective configurations for natural language generation (NLG) tasks. Unlike many existing datasets that are either text-only or lack user comments in multimodal settings, COSMMIC uniquely integrates text, images, and user feedback. This holistic approach bridges gaps in Indian language resources, advancing NLP research and fostering inclusivity.
pdf
bib
abs
Mind the Gap: Static and Interactive Evaluations of Large Audio Models
Minzhi Li
|
William Barr Held
|
Michael J Ryan
|
Kunat Pipatanakul
|
Potsawee Manakul
|
Hao Zhu
|
Diyi Yang
As AI chatbots become ubiquitous, voice interaction presents a compelling way to enable rapid, high-bandwidth communication for both semantic and social signals. This has driven research into Large Audio Models (LAMs) to power voice-native experiences. However, aligning LAM development with user goals requires a clear understanding of user needs and preferences to establish reliable progress metrics. This study addresses these challenges by introducing an interactive approach to evaluate LAMs and collecting 7,500 LAM interactions from 484 participants. Through topic modeling of user queries, we identify primary use cases for audio interfaces. We then analyze user preference rankings and qualitative feedback to determine which models best align with user needs. Finally, we evaluate how static benchmarks predict interactive performance - our analysis reveals no individual benchmark strongly correlates with interactive results (𝜏 ≤ 0.33 for all benchmarks). While combining multiple coarse-grained features yields modest predictive power (R2=0.30), only two out of twenty datasets on spoken question answering and age prediction show significantly positive correlations. This suggests a clear need to develop LAM evaluations that better correlate with user preferences.
pdf
bib
abs
Understanding In-Context Machine Translation for Low-Resource Languages: A Case Study on Manchu
Renhao Pei
|
Yihong Liu
|
Peiqin Lin
|
François Yvon
|
Hinrich Schuetze
In-context machine translation (MT) with large language models (LLMs) is a promising approach for low-resource MT, as it can readily take advantage of linguistic resources such as grammar books and dictionaries.Such resources are usually selectively integrated into the prompt so that LLMs can directly perform translation without any specific training, via their in-context learning capability (ICL).However, the relative importance of each type of resource, e.g., dictionary, grammar book, and retrieved parallel examples, is not entirely clear.To address this gap, this study systematically investigates how each resource and its quality affect the translation performance, with the Manchu language as our case study. To remove any prior knowledge of Manchu encoded in the LLM parameters and single out the effect of ICL, we also experiment with an enciphered version of Manchu texts.Our results indicate that high-quality dictionaries and good parallel examples are very helpful, while grammars hardly help.In a follow-up study, we showcase a promising application of in-context MT: parallel data augmentation as a way to bootstrap a conventional MT model. When monolingual data abound, generating synthetic parallel data through in-context MT offers a pathway to mitigate data scarcity and build effective and efficient low-resource neural MT systems.
pdf
bib
abs
CKnowEdit: A New Chinese Knowledge Editing Dataset for Linguistics, Facts, and Logic Error Correction in LLMs
Jizhan Fang
|
Tianhe Lu
|
Yunzhi Yao
|
Ziyan Jiang
|
Xin Xu
|
Huajun Chen
|
Ningyu Zhang
Chinese, as a linguistic system rich in depth and complexity, is characterized by distinctive elements such as ancient poetry, proverbs, idioms, and other cultural constructs. However, current Large Language Models (LLMs) face limitations in these specialized domains, highlighting the need for the development of comprehensive datasets that can assess, continuously update, and progressively improve these culturally-grounded linguistic competencies through targeted training optimizations. To address this gap, we introduce CKnowEdit, the first-ever Chinese knowledge editing dataset designed to correct linguistic, factual, and logical errors in LLMs. We collect seven types of knowledge from a wide range of sources, including classical texts, idioms, and content from Baidu Tieba Ruozhiba, taking into account the unique polyphony, antithesis, and logical structures inherent in the Chinese language. By analyzing this dataset, we highlight the challenges current LLMs face in mastering Chinese. Furthermore, our evaluation of state-of-the-art knowledge editing techniques reveals opportunities to advance the correction of Chinese knowledge.
pdf
bib
abs
TripleFact: Defending Data Contamination in the Evaluation of LLM-driven Fake News Detection
Cheng Xu
|
Nan Yan
The proliferation of large language models (LLMs) has introduced unprecedented challenges in fake news detection due to benchmark data contamination (BDC), where evaluation benchmarks are inadvertently memorized during the pre-training, leading to the inflated performance metrics. Traditional evaluation paradigms, reliant on static datasets and closed-world assumptions, fail to account the BDC risk in large-scale pre-training of current LLMs. This paper introduces TripleFact, a novel evaluation framework for fake news detection task, which designed to mitigate BDC risk while prioritizing real-world applicability. TripleFact integrates three components: (1) Human-Adversarial Preference Testing (HAPT) to assess robustness against human-crafted misinformation, (2) Real-Time Web Agent with Asynchronous Validation (RTW-AV) to evaluate temporal generalization using dynamically sourced claims, and (3) Entity-Controlled Virtual Environment (ECVE) to eliminate entity-specific biases. Through experiments on 17 state-of-the-art LLMs, including GPT, LLaMA, and DeepSeek variants, TripleFact demonstrates superior contamination resistance compared to traditional benchmarks. Results reveal that BDC artificially inflates performance by up to 23% in conventional evaluations, while TripleFact Score (TFS) remain stable within 4% absolute error under controlled contamination. The framework’s ability to disentangle genuine detection capabilities from memorization artifacts underscores its potential as a fake news detection benchmark for the LLM era.
pdf
bib
abs
Meaning Beyond Truth Conditions: Evaluating Discourse Level Understanding via Anaphora Accessibility
Xiaomeng Zhu
|
Zhenghao Zhou
|
Simon Charlow
|
Robert Frank
We present a hierarchy of natural language understanding abilities and argue for the importance of moving beyond assessments of understanding at the lexical and sentence levels to the discourse level. We propose the task of anaphora accessibility as a diagnostic for assessing discourse understanding, and to this end, present an evaluation dataset inspired by theoretical research in dynamic semantics. We evaluate human and LLM performance on our dataset and find that LLMs and humans align on some tasks and diverge on others. Such divergence can be explained by LLMs’ reliance on specific lexical items during language comprehension, in contrast to human sensitivity to structural abstractions.
pdf
bib
abs
Large Language and Reasoning Models are Shallow Disjunctive Reasoners
Irtaza Khalid
|
Amir Masoud Nourollah
|
Steven Schockaert
Large Language Models (LLMs) have been found to struggle with systematic reasoning. Even on tasks where they appear to perform well, their performance often depends on shortcuts, rather than on genuine reasoning abilities, leading them to collapse on out-of-distribution (OOD) examples. Post-training strategies based on reinforcement learning and chain-of-thought prompting have recently been hailed as a step change. However, little is known about the potential of the resulting “Large Reasoning Models” (LRMs) beyond maths and programming-based problem solving, where genuine OOD problems can be sparse. In this paper, we focus on tasks that require systematic relational composition for qualitative spatial and temporal reasoning. The setting allows fine control over problem difficulty to precisely measure OOD generalization. We find that, zero-shot LRMs generally outperform their LLM counterparts in single-path reasoning tasks but struggle in the multi-path setting. Whilst showing comparatively better results, fine-tuned LLMs are also not capable of multi-path generalization. We also provide evidence for the behavioral interpretation for this, i.e., that LRMs are shallow disjunctive reasoners.
pdf
bib
abs
Warmup Generations: A Task-Agnostic Approach for Guiding Sequence-to-Sequence Learning with Unsupervised Initial State Generation
Senyu Li
|
Zipeng Sun
|
Jiayi Wang
|
Xue Liu
|
Pontus Stenetorp
|
Siva Reddy
|
David Ifeoluwa Adelani
Traditional supervised fine-tuning (SFT) strategies for sequence-to-sequence tasks often train models to directly generate the target output. Recent work has shown that guiding models with intermediate steps—such as keywords, outlines, or reasoning chains—can significantly improve performance, coherence, and interpretability. However, these methods often depend on predefined intermediate formats and annotated data, limiting their scalability and generalizability. In this work, we introduce a task-agnostic framework that enables models to generate intermediate “warmup” sequences. These warmup sequences, serving as an initial state for subsequent generation, are optimized to enhance the probability of generating the target sequence without relying on external supervision or human-designed structures. Drawing inspiration from reinforcement learning principles, our method iteratively refines these intermediate steps to maximize their contribution to the final output, similar to reward-driven optimization in reinforcement learning with human feedback. Experimental results across tasks such as translation, summarization, and multi-choice question answering for logical reasoning show that our approach outperforms traditional SFT methods, and offers a scalable and flexible solution for sequence-to-sequence tasks.
pdf
bib
abs
Building Better: Avoiding Pitfalls in Developing Language Resources when Data is Scarce
Nedjma Ousidhoum
|
Meriem Beloucif
|
Saif M. Mohammad
Language is a form of symbolic capital that affects people’s lives in many ways (Bourdieu1977,1991). As a powerful means of communication, it reflects identities, cultures, traditions, and societies more broadly. Therefore, data in a given language should be regarded as more than just a collection of tokens. Rigorous data collection and labeling practices are essential for developing more human-centered and socially aware technologies. Although there has been growing interest in under-resourced languages within the NLP community, work in this area faces unique challenges, such as data scarcity and limited access to qualified annotators.In this paper, we collect feedback from individuals directly involved in and impacted by NLP artefacts for medium- and low-resource languages. We conduct both quantitative and qualitative analyses of their responses and highlight key issues related to: (1) data quality, including linguistic and cultural appropriateness; and (2) the ethics of common annotation practices, such as the misuse of participatory research. Based on these findings, we make several recommendations for creating high-quality language artefacts that reflect the cultural milieu of their speakers, while also respecting the dignity and labor of data workers.
pdf
bib
abs
BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Languages
Shamsuddeen Hassan Muhammad
|
Nedjma Ousidhoum
|
Idris Abdulmumin
|
Jan Philip Wahle
|
Terry Ruas
|
Meriem Beloucif
|
Christine de Kock
|
Nirmal Surange
|
Daniela Teodorescu
|
Ibrahim Said Ahmad
|
David Ifeoluwa Adelani
|
Alham Fikri Aji
|
Felermino D. M. A. Ali
|
Ilseyar Alimova
|
Vladimir Araujo
|
Nikolay Babakov
|
Naomi Baes
|
Ana-Maria Bucur
|
Andiswa Bukula
|
Guanqun Cao
|
Rodrigo Tufiño
|
Rendi Chevi
|
Chiamaka Ijeoma Chukwuneke
|
Alexandra Ciobotaru
|
Daryna Dementieva
|
Murja Sani Gadanya
|
Robert Geislinger
|
Bela Gipp
|
Oumaima Hourrane
|
Oana Ignat
|
Falalu Ibrahim Lawan
|
Rooweither Mabuya
|
Rahmad Mahendra
|
Vukosi Marivate
|
Alexander Panchenko
|
Andrew Piper
|
Charles Henrique Porto Ferreira
|
Vitaly Protasov
|
Samuel Rutunda
|
Manish Shrivastava
|
Aura Cristina Udrea
|
Lilian Diana Awuor Wanzare
|
Sophie Wu
|
Florian Valentin Wunderlich
|
Hanif Muhammad Zhafran
|
Tianhui Zhang
|
Yi Zhou
|
Saif M. Mohammad
People worldwide use language in subtle and complex ways to express emotions. Although emotion recognition–an umbrella term for several NLP tasks–impacts various applications within NLP and beyond, most work in this area has focused on high-resource languages. This has led to significant disparities in research efforts and proposed solutions, particularly for under-resourced languages, which often lack high-quality annotated datasets.In this paper, we present BRIGHTER–a collection of multi-labeled, emotion-annotated datasets in 28 different languages and across several domains. BRIGHTER primarily covers low-resource languages from Africa, Asia, Eastern Europe, and Latin America, with instances labeled by fluent speakers. We highlight the challenges related to the data collection and annotation processes, and then report experimental results for monolingual and crosslingual multi-label emotion identification, as well as emotion intensity recognition. We analyse the variability in performance across languages and text domains, both with and without the use of LLMs, and show that the BRIGHTER datasets represent a meaningful step towards addressing the gap in text-based emotion recognition.
pdf
bib
abs
SkillVerse : Assessing and Enhancing LLMs with Tree Evaluation
Yufei Tian
|
Jiao Sun
|
Nanyun Peng
|
Zizhao Zhang
As language models evolve to tackle complex, multifaceted tasks, their evaluation must adapt to capture this intricacy. A granular, skill-specific understanding of model capabilities can empower researchers to make informed model development plans. In this paper, we introduce SkillVerse, an unsupervised tree-structured diagnosis framework for understanding model proficiency in specific abilities. With LLM as a judge, SkillVerse first critiques the model responses, and then organizes them into a hierarchical structure termed dendrogram. Given proficiency at arbitrary levels of granularity, SkillVerse is flexible to produce insights of behaviors of modern large models. We also demonstrate its efficacy in two downstream tasks: 1) improving model in-context learning by 25% using a tree-search algorithm to select more informative few-shot demonstrations, and 2) accurately predicting new model weaknesses with a 55% success rate, 22% higher than without SkillVerse.
pdf
bib
abs
CypherBench: Towards Precise Retrieval over Full-scale Modern Knowledge Graphs in the LLM Era
Yanlin Feng
|
Simone Papicchio
|
Sajjadur Rahman
Retrieval from graph data is crucial for augmenting large language models (LLM) with both open-domain knowledge and private enterprise data, and it is also a key component in the recent GraphRAG system (CITATION). Despite decades of research on knowledge graphs and knowledge base question answering, leading LLM frameworks (Langchain and LlamaIndex) have only minimal support for retrieval from modern encyclopedic knowledge graphs like Wikidata. In this paper, we analyze the root cause and suggest that modern RDF knowledge graphs (Wikidata, Freebase) are less efficient for LLMs due to overly large schemas that far exceed the typical LLM context window, use of resource identifiers, overlapping and ambiguous relation types and lack of normalization. As a solution, we propose property graph views on top of the underlying RDF graph that can be efficiently queried by LLMs using Cypher. We instantiated this idea on Wikidata and introduced CypherBench, the first benchmark with 11 large-scale, multi-domain property graphs with 7.8 million entities and over 10,000 questions. To achieve this, we tackled several key challenges, including developing an RDF-to-property graph conversion engine, creating a systematic pipeline for text-to-Cypher task generation, and designing new evaluation metrics.
pdf
bib
abs
Empathy Prediction from Diverse Perspectives
Francine Chen
|
Scott Carter
|
Tatiana Lau
|
Nayeli Suseth Bravo
|
Sumanta Bhattacharyya
|
Kate Sieck
|
Charlene C. Wu
A person’s perspective on a topic can influence their empathy towards a story. To investigate the use of personal perspective in empathy prediction, we collected a dataset, EmpathyFromPerspectives, where a user rates their empathy towards a story by a person with a different perspective on a prompted topic. We observed in the dataset that user perspective can be important for empathy prediction and developed a model, PPEP, that uses a rater’s perspective as context for predicting the rater’s empathy towards a story. Experiments comparing PPEP with baseline models show that use of personal perspective significantly improves performance. A user study indicated that human empathy ratings of stories generally agreed with PPEP’s relative empathy rankings.
pdf
bib
abs
Are LLMs effective psychological assessors? Leveraging adaptive RAG for interpretable mental health screening through psychometric practice
Federico Ravenda
|
Seyed Ali Bahrainian
|
Andrea Raballo
|
Antonietta Mira
|
Noriko Kando
In psychological practice, standardized questionnaires serve as essential tools for assessing mental health through structured, clinically-validated questions (i.e., items). While social media platforms offer rich data for mental health screening, computational approaches often bypass these established clinical assessment tools in favor of black-box classification. We propose a novel questionnaire-guided screening framework that bridges psychological practice and computational methods through adaptive Retrieval-Augmented Generation (aRAG). Our approach links unstructured social media content and standardized clinical assessments by retrieving relevant posts for each questionnaire item and using Large Language Models (LLMs) to complete validated psychological instruments. Our findings demonstrate two key advantages of questionnaire-guided screening: First, when completing the Beck Depression Inventory-II (BDI-II), our approach matches or outperforms state-of-the-art performance on Reddit-based benchmarks without requiring training data. Second, we show that guiding LLMs through standardized questionnaires yields superior results compared to directly prompting them for depression screening. Additionally, we show as a proof-of-concept how our questionnaire-based methodology successfully extends to self-harm screening.
pdf
bib
abs
INTERACT: Enabling Interactive, Question-Driven Learning in Large Language Models
Aum Kendapadi
|
Kerem Zaman
|
Rakesh R Menon
|
Shashank Srivastava
Large language models (LLMs) excel at answering questions but remain passive learners—absorbing static data without the ability to question and refine knowledge. This paper explores how LLMs can transition to interactive, question-driven learning through student-teacher dialogues. We introduce INTERACT (INTERactive learning for Adaptive Concept Transfer), a framework in which a “student” LLM engages a “teacher” LLM through iterative inquiries to acquire knowledge across 1,347 contexts, including song lyrics, news articles, movie plots, academic papers, and images. Our experiments show that across a wide range of scenarios and LLM architectures, interactive learning consistently enhances performance, achieving up to a 25% improvement, with ‘cold-start’ student models matching static learning baselines in as few as five dialogue turns. Interactive setups can also mitigate the disadvantages of weaker teachers, showcasing the robustness of question-driven learning.
pdf
bib
abs
Circuit Stability Characterizes Language Model Generalization
Alan Sun
Extensively evaluating the capabilities of (large) language models is difficult. Rapid development of state-of-the-art models induce benchmark saturation, while creating more challenging datasets is labor-intensive. Inspired by the recent developments in mechanistic interpretability, we introduce circuit stability as a new way to assess model performance. Circuit stability refers to a model’s ability to apply a consistent reasoning process–its circuit–across various inputs. We mathematically formalize circuit stability and circuit equivalence. Then, through three case studies, we empirically show that circuit stability and the lack thereof can characterize and predict different aspects of generalization. Our proposed methods offer a step towards rigorously relating the generality of models to their interpretability.
pdf
bib
abs
Comparing LLM-generated and human-authored news text using formal syntactic theory
Olga Zamaraeva
|
Dan Flickinger
|
Francis Bond
|
Carlos Gómez-Rodríguez
This study provides the first comprehensive comparison of New York Times-style text generated by six large language models against real, human-authored NYT writing. The comparison is based on a formal syntactic theory. We use Head-driven Phrase Structure Grammar (HPSG) to analyze the grammatical structure of the texts. We then investigate and illustrate the differences in the distributions of HPSG grammar types, revealing systematic distinctions between human and LLM-generated writing. These findings contribute to a deeper understanding of the syntactic behavior of LLMs as well as humans, within the NYT genre.
pdf
bib
abs
Improving Preference Extraction In LLMs By Identifying Latent Knowledge Through Classifying Probes
Sharan Maiya
|
Yinhong Liu
|
Ramit Debnath
|
Anna Korhonen
Large Language Models (LLMs) are often used as automated judges to evaluate text, but their effectiveness can be hindered by various unintentional biases. We propose using linear classifying probes, trained by leveraging differences between contrasting pairs of prompts, to directly access LLMs’ latent knowledge and extract more accurate preferences. Through extensive experiments using models of varying size from four different families and six diverse datasets assessing text quality evaluation and common sense reasoning, we demonstrate that both supervised and unsupervised probing approaches consistently outperform traditional generation-based judgement while maintaining similar computational costs. These probes generalise under domain shifts and can even outperform finetuned evaluators with the same training data size. Our results suggest linear probing offers an accurate, robust and computationally efficient approach for LLM-as-judge tasks while providing interpretable insights into how models encode judgement-relevant knowledge. Our data and code will be openly released in the future.
pdf
bib
abs
White Men Lead, Black Women Help? Benchmarking and Mitigating Language Agency Social Biases in LLMs
Yixin Wan
|
Kai-Wei Chang
Social biases can manifest in language agency. However, very limited research has investigated such biases in Large Language Model (LLM)-generated content. In addition, previous works often rely on string-matching techniques to identify agentic and communal words within texts, falling short of accurately classifying language agency. We introduce the **Language Agency Bias Evaluation (LABE)** benchmark, which comprehensively evaluates biases in LLMs by analyzing agency levels attributed to different demographic groups in model generations. LABE tests for gender, racial, and intersectional language agency biases in LLMs on 3 text generation tasks: biographies, professor reviews, and reference letters. Using LABE, we unveil language agency social biases in 3 recent LLMs: ChatGPT, Llama3, and Mistral. We observe that: (1) LLM generations tend to demonstrate greater gender bias than human-written texts; (2) Models demonstrate remarkably higher levels of intersectional bias than the other bias aspects. (3) Prompt-based mitigation is unstable and frequently leads to bias exacerbation. Based on our observations, we propose **Mitigation via Selective Rewrite (MSR)**, a novel bias mitigation strategy that leverages an agency classifier to identify and selectively revise parts of generated texts that demonstrate communal traits. Empirical results prove MSR to be more effective and reliable than prompt-based mitigation method, showing a promising research direction.
pdf
bib
abs
AIMSCheck: Leveraging LLMs for AI-Assisted Review of Modern Slavery Statements Across Jurisdictions
Adriana Eufrosina Bora
|
Akshatha Arodi
|
Duoyi Zhang
|
Jordan Bannister
|
Mirko Bronzi
|
Arsene Fansi Tchango
|
Md Abul Bashar
|
Richi Nayak
|
Kerrie Mengersen
Modern Slavery Acts mandate that corporations disclose their efforts to combat modern slavery, aiming to enhance transparency and strengthen practices for its eradication. However, verifying these statements remains challenging due to their complex, diversified language and the sheer number of statements that must be reviewed. The development of NLP tools to assist in this task is also difficult due to a scarcity of annotated data. Furthermore, as modern slavery transparency legislation has been introduced in several countries, the generalizability of such tools across legal jurisdictions must be studied. To address these challenges, we work with domain experts to make two key contributions. First, we present AIMS.uk and AIMS.ca, newly annotated datasets from the UK and Canada to enable cross-jurisdictional evaluation. Second, we introduce AIMSCheck, an end-to-end framework for compliance validation. AIMSCheck decomposes the compliance assessment task into three levels, enhancing interpretability and practical applicability. Our experiments show that models trained on an Australian dataset generalize well across UK and Canadian jurisdictions, demonstrating the potential for broader application in compliance monitoring. We release the benchmark datasets and AIMSCheck to the public to advance AI-adoption in compliance assessment and drive further research in this field.
pdf
bib
abs
Collapse of Dense Retrievers: Short, Early, and Literal Biases Outranking Factual Evidence
Mohsen Fayyaz
|
Ali Modarressi
|
Hinrich Schuetze
|
Nanyun Peng
Dense retrieval models are commonly used in Information Retrieval (IR) applications, such as Retrieval-Augmented Generation (RAG). Since they often serve as the first step in these systems, their robustness is critical to avoid downstream failures. In this work, we repurpose a relation extraction dataset (e.g., Re-DocRED) to design controlled experiments that quantify the impact of heuristic biases, such as a preference for shorter documents, on retrievers like Dragon+ and Contriever. We uncover major vulnerabilities, showing retrievers favor shorter documents, early positions, repeated entities, and literal matches, all while ignoring the answer’s presence! Notably, when multiple biases combine, models exhibit catastrophic performance degradation, selecting the answer-containing document in less than 10% of cases over a synthetic biased document without the answer. Furthermore, we show that these biases have direct consequences for downstream applications like RAG, where retrieval-preferred documents can mislead LLMs, resulting in a 34% performance drop than providing no documents at all.https://huggingface.co/datasets/mohsenfayyaz/ColDeR
pdf
bib
abs
SelfElicit: Your Language Model Secretly Knows Where is the Relevant Evidence
Zhining Liu
|
Rana Ali Amjad
|
Ravinarayana Adkathimar
|
Tianxin Wei
|
Hanghang Tong
Providing Language Models (LMs) with relevant evidence in the context (either via retrieval or user-provided) can significantly improve their ability to provide better-grounded responses. However, recent studies have found that LMs often struggle to fully comprehend and utilize key evidence from the context, especially when it contains noise and irrelevant information—an issue common in real-world scenarios.To address this, we propose SelfElicit, an inference-time approach that helps LMs focus on key contextual evidence through self-guided explicit highlighting.By leveraging the inherent evidence-finding capabilities of LMs using the attention scores of deeper layers, our method automatically identifies and emphasizes key evidence within the input context, facilitating more accurate and grounded responses without additional training or iterative prompting.We demonstrate that SelfElicit brings consistent and significant improvement on multiple evidence-based QA tasks for various LM families while maintaining computational efficiency.Our code and documentation are available at https://github.com/ZhiningLiu1998/SelfElicit.
pdf
bib
abs
The Male CEO and the Female Assistant: Evaluation and Mitigation of Gender Biases in Text-To-Image Generation of Dual Subjects
Yixin Wan
|
Kai-Wei Chang
Recent large-scale T2I models like DALLE-3 have made progress in reducing gender stereotypes when generating single-person images. However, significant biases remain when generating images with more than one person. To systematically evaluate this, we propose the **Paired Stereotype Test (PST)** framework, which queries T2I models to depict two individuals assigned with male-stereotyped and female-stereotyped social identities, respectively (e.g. “a CEO” and “an Assistant”). This contrastive setting often triggers T2I models to generate gender-stereotyped images. Using PST, we evaluate two aspects of gender biases – the well-known **bias in gendered occupation** and a novel aspect: **bias in organizational power**. Experiments show that **over 74% images generated by DALLE-3 display gender-occupational biases**. Additionally, compared to single-person settings, DALLE-3 is more likely to perpetuate male-associated stereotypes under PST. We further propose **FairCritic**, a novel and interpretable framework that leverages an LLM-based critic model to i) detect bias in generated images, and ii) adaptively provide feedback to T2I models for improving fairness. FairCritic achieves near-perfect fairness on PST, overcoming the limitations of previous prompt-based intervention approaches.
pdf
bib
abs
Mitigating Shortcut Learning with InterpoLated Learning
Michalis Korakakis
|
Andreas Vlachos
|
Adrian Weller
Empirical risk minimization (ERM) incentivizes models to exploit shortcuts, i.e., spurious correlations between input attributes and labels that are prevalent in the majority of the training data but unrelated to the task at hand. This reliance hinders generalization on minority examples, where such correlations do not hold. Existing shortcut mitigation approaches are model-specific, difficult to tune, computationally expensive, and fail to improve learned representations. To address these issues, we propose InterpoLated Learning (InterpoLL) which interpolates the representations of majority examples to include features from intra-class minority examples with shortcut-mitigating patterns. This weakens shortcut influence, enabling models to acquire features predictive across both minority and majority examples. Experimental results on multiple natural language understanding tasks demonstrate that InterpoLL improves minority generalization over both ERM and state-of-the-art mitigation methods, without compromising accuracy on majority examples. Notably, these gains persist across encoder, encoder-decoder, and decoder-only architectures, demonstrating the method’s broad applicability.
pdf
bib
abs
Toward Automatic Discovery of a Canine Phonetic Alphabet
Theron S. Wang
|
Xingyuan Li
|
Hridayesh Lekhak
|
Tuan Minh Dang
|
Mengyue Wu
|
Kenny Q. Zhu
Dogs communicate intelligently but little is known about the phonetic properties of their vocalization communication. For the first time, this paper presents an iterative algorithm inspired by human phonetic discovery, which is based on minimal pairs that determine phonemes by distinguishing different words in human language, and is able to produce a complete alphabet of distinct canine phoneme-like units. In addition, the algorithm produces a number of canine repeated acoustic units, which may correspond to specific environments and activities of a dog, composed exclusively of the canine phoneme-like units in the alphabet. The framework outlined in this paper is expected to function not only on canines but other animal species.
pdf
bib
abs
DavIR: Data Selection via Implicit Reward for Large Language Models
Haotian Zhou
|
Tingkai Liu
|
Qianli Ma
|
Yufeng Zhang
|
Jianbo Yuan
|
Pengfei Liu
|
Yang You
|
Hongxia Yang
We introduce DavIR, a model-based data selection method for post-training Large Language Models. DavIR generalizes Reducible Holdout Loss to core-set selection problem of causal language modeling, and quantifies the learnability of a given datum with respect to a pre-trained LLM based on relative reduction in loss during fine-tuning, a metric we show to be closely related to the implicit reward model described in Direct Preference Optimization (DPO). We show that 6% of Alpaca dataset selected with DavIR can steer both the LLaMA and Gemma model family to produce superior performance compared to the same models trained on the full 52K dataset. We also show that Alpaca dataset compressed with DavIR can be combined with GSM8K dataset to effectively balance open-domain freeform QA and mathematical reasoning capabilities. Finally, we apply the DavIR objective to DPO and develop a normalized DavIR-DPO objective which improves alignment performance of Zephyr-7B-SFT model by 8% (relative) on AlpacaEval, compared against training on vanilla DPO objective.
pdf
bib
abs
Byte Latent Transformer: Patches Scale Better Than Tokens
Artidoro Pagnoni
|
Ramakanth Pasunuru
|
Pedro Rodriguez
|
John Nguyen
|
Benjamin Muller
|
Margaret Li
|
Chunting Zhou
|
Lili Yu
|
Jason E Weston
|
Luke Zettlemoyer
|
Gargi Ghosh
|
Mike Lewis
|
Ari Holtzman
|
Srini Iyer
We introduce the Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performance at scale with significant improvements in inference efficiency and robustness. BLT encodes bytes into dynamically sized patches, which serve as the primary units of computation. Patches are segmented based on the entropy of the next byte, allocating more compute and model capacity where increased data complexity demands it. We present the first FLOP controlled scaling study of byte-level models – up to 8B parameters and 4T training bytes – demonstrating the feasibility of scaling models trained on raw bytes without a fixed vocabulary. Both training and inference efficiency improve due to dynamically selecting long patches when data is predictable, along with qualitative improvements on reasoning and long tail generalization. For fixed inference costs, BLT shows significantly better scaling than tokenization-based models, by simultaneously growing both patch and model size.
pdf
bib
abs
DiffuseDef: Improved Robustness to Adversarial Attacks via Iterative Denoising
Zhenhao Li
|
Huichi Zhou
|
Marek Rei
|
Lucia Specia
Pretrained language models have significantly advanced performance across various natural language processing tasks. However, adversarial attacks continue to pose a critical challenge to system built using these models, as they can be exploited with carefully crafted adversarial texts. Inspired by the ability of diffusion models to predict and reduce noise in computer vision, we propose a novel and flexible adversarial defense method for language classification tasks, DiffuseDef, which incorporates a diffusion layer as a denoiser between the encoder and the classifier. The diffusion layer is trained on top of the existing classifier, ensuring seamless integration with any model in a plug-and-play manner. During inference, the adversarial hidden state is first combined with sampled noise, then denoised iteratively and finally ensembled to produce a robust text representation. By integrating adversarial training, denoising, and ensembling techniques, we show that DiffuseDef improves over existing adversarial defense methods and achieves state-of-the-art performance against common black-box and white-box adversarial attacks.
pdf
bib
abs
Identifying Cellular Niches in Spatial Transcriptomics: An Investigation into the Capabilities of Large Language Models
Huanhuan Wei
|
Xiao Luo
|
Hongyi Yu
|
Jinping Liang
|
Luning Yang
|
Lixing Lin
|
Alexandra Popa
|
Xiting Yan
Spatial transcriptomic technologies enable measuring gene expression profile and spatial information of cells in tissues simultaneously. Clustering of captured cells/spots in the spatial transcriptomic data is crucial for understanding tissue niches and uncovering disease-related changes.Current methods to cluster spatial transcriptomic data encounter obstacles, including inefficiency in handling multi-replicate data, lack of prior knowledge incorporation, and producing uninterpretable cluster labels.We introduce a novel approach, LLMiniST, to identify spatial niche using a zero-shot large language models (LLMs) by transforming spatial transcriptomic data into spatial context prompts, leveraging gene expression of neighboring cells/spots, cell type composition, tissue information, and external knowledge. The model was further enhanced using a two-stage fine-tuning strategy for improved generalizability. We also develop a user-friendly annotation tool to accelerate the creation of well-annotated spatial dataset for fine-tuning.Comprehensive method performance evaluations showed that both zero-shot and fine-tunned LLMiniST had superior performance than current non-LLM methods in many circumstances. Notably, the two-stage fine-tuning strategy facilitated substantial cross-subject generalizability. The results demonstrate the feasibility of LLMs for tissue niche identification using spatial transcriptomic data and the potential of LLMs as a scalable solution to efficiently integrate minimal human guidance for improved performance in large-scale datasets.
pdf
bib
abs
Culture Matters in Toxic Language Detection in Persian
Zahra Bokaei
|
Walid Magdy
|
Bonnie Webber
Toxic language detection is crucial for creating safer online environments and limiting the spread of harmful content. While toxic language detection has been under-explored in Persian, the current work compares different methods for this task, including fine-tuning, data enrichment, zero-shot and few-shot learning, and cross-lingual transfer learning. What is especially compelling is the impact of cultural context on transfer learning for this task: We show that the language of a country with cultural similarities to Persian yields better results in transfer learning. Conversely, the improvement is lower when the language comes from a culturally distinct country.
pdf
bib
abs
Bitnet.cpp: Efficient Edge Inference for Ternary LLMs
Jinheng Wang
|
Hansong Zhou
|
Ting Song
|
Shijie Cao
|
Yan Xia
|
Ting Cao
|
Jianyu Wei
|
Shuming Ma
|
Hongyu Wang
|
Furu Wei
The advent of 1-bit large language models (LLMs), led by BitNet b1.58, has spurred interest in ternary LLMs. Despite this, research and practical applications focusing on efficient edge inference for ternary LLMs remain scarce. To bridge this gap, we introduce Bitnet.cpp, an inference system optimized for BitNet b1.58 and ternary LLMs. Given that mixed-precision matrix multiplication (mpGEMM) constitutes the bulk of inference time in ternary LLMs, Bitnet.cpp incorporates a novel mpGEMM library to facilitate sub-2-bits-per-weight, efficient and lossless inference. The library features two core solutions: Ternary Lookup Table (TL), which addresses spatial inefficiencies of previous bit-wise methods, and Int2 with a Scale (I2_S), which ensures lossless edge inference, both enabling high-speed inference. Our experiments show that Bitnet.cpp achieves up to a 6.25x increase in speed over full-precision baselines and up to 2.32x over low-bit baselines, setting new benchmarks in the field. Additionally, we expand TL to element-wise lookup table (ELUT) for low-bit LLMs in the appendix, presenting both theoretical and empirical evidence of its considerable potential. Bitnet.cpp is publicly available at https://github.com/microsoft/BitNet/tree/paper, offering a sophisticated solution for the efficient and practical deployment of edge LLMs.
pdf
bib
abs
Instance-Selection-Inspired Undersampling Strategies for Bias Reduction in Small and Large Language Models for Binary Text Classification
Guilherme Fonseca
|
Washington Cunha
|
Gabriel Prenassi
|
Marcos André Gonçalves
|
Leonardo Chaves Dutra Da Rocha
Skewness in imbalanced datasets affects Automatic Text Classification (ATC), leading to classifier bias toward the majority classes. This work examines undersampling methods to mitigate such bias in Small and Large Language Model (SLMs and LLMs) classifiers. Based on the limitations found in existing solutions, we propose two novel undersampling methods inspired by state-of-the-art Instance Selection techniques, relying on calibrated confidences and semantic difficulty estimates. We compare them against 19 baselines across 13 datasets, evaluating: (i) effectiveness, (ii) class imbalance bias, (iii) efficiency, (iv) scalability, and (v) consistency. Results show our methods uniquely reduce classifier bias (up to 56%) across all datasets without effectiveness loss while improving efficiency (1.6x speedup), scalability and reducing carbon emissions (up to 50%).
pdf
bib
abs
Forward Knows Efficient Backward Path: Saliency-Guided Memory-Efficient Fine-tuning of Large Language Models
Yeachan Kim
|
SangKeun Lee
Fine-tuning is widely recognized as a crucial process for aligning large language models (LLMs) with human intentions. However, the substantial memory requirements associated with fine-tuning pose a significant barrier to extending the applicability of LLMs. While parameter-efficient fine-tuning can be a promising approach by reducing trainable parameters, intermediate activations still need to be cached to compute gradients during the backward pass, thereby limiting overall memory efficiency. In this work, we propose Saliency-Guided Gradient Flow (SAGE), a memory-efficient fine-tuning method designed to minimize the memory specifically associated with cached intermediate activations. The key strategy is to selectively cache activations based on their saliency during the forward pass and then use these activations for the backward pass. This process transforms the dense backward pass into a sparse one, thereby enhancing memory efficiency. To verify whether SAGE can serve as an efficient alternative for fine-tuning, we conduct comprehensive experiments across diverse fine-tuning scenarios and setups. The experimental results show that SAGE substantially improves memory efficiency without a significant loss in accuracy, highlighting its broad value in real-world applications
pdf
bib
abs
Focus on What Matters: Enhancing Medical Vision-Language Models with Automatic Attention Alignment Tuning
Aofei Chang
|
Le Huang
|
Alex James Boyd
|
Parminder Bhatia
|
Taha Kass-Hout
|
Cao Xiao
|
Fenglong Ma
Medical Large Vision-Language Models (Med-LVLMs) often exhibit suboptimal attention distribution on visual inputs, leading to hallucinated or inaccurate outputs. Existing methods primarily rely on inference-time interventions, which are limited in attention adaptation or require additional supervision. To address this, we propose A3Tune, a novel fine-tuning framework for Automatic Attention Alignment Tuning. ATune leverages zero-shot weak labels from SAM, refines them into prompt-aware labels using BioMedCLIP, and then selectively modifies visually-critical attention heads to improve alignment while minimizing interference. Additionally, we introduce a A3MoE module, enabling adaptive parameter selection for attention tuning across diverse prompts and images. Extensive experiments on medical VQA and report generation benchmarks show that A3Tune outperforms state-of-the-art baselines, achieving enhanced attention distributions and performance in Med-LVLMs.
pdf
bib
abs
LLMs + Persona-Plug = Personalized LLMs
Jiongnan Liu
|
Yutao Zhu
|
Shuting Wang
|
Xiaochi Wei
|
Erxue Min
|
Yu Lu
|
Shuaiqiang Wang
|
Dawei Yin
|
Zhicheng Dou
Personalization plays a critical role in numerous language tasks and applications, since users with the same requirements may prefer diverse outputs based on their interests. This has led to the development of various personalized approaches aimed at adapting large language models (LLMs) to generate customized outputs aligned with user preferences. Some of them involve fine-tuning a unique personalized LLM for each user, which is too expensive for widespread application. Alternative approaches introduce personalization information in a plug-and-play manner by retrieving the user’s relevant historical texts as demonstrations. However, this retrieval-based strategy may break the continuity of the user history and fail to capture the user’s overall styles and patterns, hence leading to sub-optimal performance. To address these challenges, we propose a novel personalized LLM model, PPlug. It constructs a user-specific embedding for each individual by modeling all her historical contexts through a lightweight plug-in user embedder module. By attaching this embedding to the task input, LLMs can better understand and capture user habits and preferences, thereby producing more personalized outputs without tuning their parameters. Extensive experiments on various tasks in the language model personalization (LaMP) benchmark demonstrate that the proposed model significantly outperforms existing personalized LLM approaches.
pdf
bib
abs
Developmentally-plausible Working Memory Shapes a Critical Period for Language Acquisition
Masato Mita
|
Ryo Yoshida
|
Yohei Oseki
Large language models possess general linguistic abilities but acquire language less efficiently than humans. This study proposes a method for integrating the developmental characteristics of working memory during the critical period, a stage when human language acquisition is particularly efficient, into the training process of language models. The proposed method introduces a mechanism that initially constrains working memory during the early stages of training and gradually relaxes this constraint in an exponential manner as learning progresses. Targeted syntactic evaluation shows that the proposed method outperforms conventional methods without memory constraints or with static memory constraints. These findings not only provide new directions for designing data-efficient language models but also offer indirect evidence supporting the role of the developmental characteristics of working memory as the underlying mechanism of the critical period in language acquisition.
pdf
bib
abs
IRIS: An Iterative and Integrated Framework for Verifiable Causal Discovery in the Absence of Tabular Data
Tao Feng
|
Lizhen Qu
|
Niket Tandon
|
Gholamreza Haffari
Causal discovery is fundamental to scientific research, yet traditional statistical algorithms face significant challenges, including expensive data collection, redundant computation for known relations, and unrealistic assumptions. While recent LLM-based methods excel at identifying commonly known causal relations, they fail to uncover novel relations. We introduce IRIS (Iterative Retrieval and Integrated System for Real-Time Causal Discovery), a novel framework that addresses these limitations. Starting with a set of initial variables, IRIS automatically collects relevant documents, extracts variables, and uncovers causal relations. Our hybrid causal discovery method combines statistical algorithms and LLM-based methods to discover known and novel causal relations. In addition to causal discovery on initial variables, the missing variable proposal component of IRIS identifies and incorporates missing variables to expand the causal graphs. Our approach enables real-time causal discovery from only a set of initial variables without requiring pre-existing datasets.
pdf
bib
abs
INJONGO: A Multicultural Intent Detection and Slot-filling Dataset for 16 African Languages
Hao Yu
|
Jesujoba Oluwadara Alabi
|
Andiswa Bukula
|
Jian Yun Zhuang
|
En-Shiun Annie Lee
|
Tadesse Kebede Guge
|
Israel Abebe Azime
|
Happy Buzaaba
|
Blessing Kudzaishe Sibanda
|
Godson Koffi Kalipe
|
Jonathan Mukiibi
|
Salomon Kabongo Kabenamualu
|
Mmasibidi Setaka
|
Lolwethu Ndolela
|
Nkiruka Odu
|
Rooweither Mabuya
|
Shamsuddeen Hassan Muhammad
|
Salomey Osei
|
Sokhar Samb
|
Dietrich Klakow
|
David Ifeoluwa Adelani
Slot-filling and intent detection are well-established tasks in Conversational AI. However, current large-scale benchmarks for these tasks often exclude evaluations of low-resource languages and rely on translations from English benchmarks, thereby predominantly reflecting Western-centric concepts. In this paper, we introduce “INJONGO” - a multicultural, open-source benchmark dataset for 16 African languages with utterances generated by native speakers across diverse domains, including banking, travel, home, and dining. Through extensive experiments, we benchmark fine-tuning multilingual transformer models and prompting large language models (LLMs), and show the advantage of leveraging African-cultural utterances over Western-centric utterances for improving cross-lingual transfer from the English language. Experimental results reveal that current LLMs struggle with the slot-filling task, with GPT-4o achieving an average performance of 26 F1. In contrast, intent detection performance is notably better, with an average accuracy of 70.6%, though it still falls short of fine-tuning baselines. When compared to the English language, GPT-4o and fine-tuning baselines perform similarly on intent detection, achieving an accuracy of approximately 81%. Our findings suggest that LLMs performance is still behind for many low-resource African languages, and more work is needed to further improve their downstream performance.
pdf
bib
abs
Boosting Long-Context Information Seeking via Query-Guided Activation Refilling
Hongjin Qian
|
Zheng Liu
|
Peitian Zhang
|
Zhicheng Dou
|
Defu Lian
Processing long contexts poses a significant challenge for large language models (LLMs) due to their inherent context window limitations and the computational burden of extensive key-value (KV) activations, which severely impact efficiency. For information-seeking tasks, full context perception is often unnecessary, as a query’s information needs can dynamically range from localized details to a global perspective, depending on its complexity. However, existing methods struggle to adapt effectively to this dynamic information needs.In the paper, we propose a method for processing long-context information-seeking tasks via query-guided ACtivation REfilling (ACRE). ACRE constructs a Bi-layer KV Cache for long contexts, where the layer-1 (L1) cache compactly captures global information, and the layer-2 (L2) cache provides detailed, localized information. ACRE establishes a proxying relationship between the two caches, allowing the input query to attend to the L1 cache and dynamically refill it with relevant entries from the L2 cache. This mechanism integrates global understanding with query-specific local details, thereby enhancing answer decoding. Experiments on a variety of long-context information-seeking datasets demonstrate ACRE’s effectiveness, achieving significant improvements in both performance and efficiency.
pdf
bib
abs
Efficient Pretraining Data Selection for Language Models via Multi-Actor Collaboration
Tianyi Bai
|
Ling Yang
|
Zhen Hao Wong
|
Fupeng Sun
|
Xinlin Zhuang
|
Jiahui Peng
|
Chi Zhang
|
Lijun Wu
|
Qiu Jiantao
|
Wentao Zhang
|
Binhang Yuan
|
Conghui He
Efficient data selection is crucial to accelerate the pretraining of language model (LMs). While various methods have been proposed to enhance data efficiency, limited research has addressed the inherent conflicts between these approaches to achieve optimal data selection for LM pretraining. To tackle this problem, we propose a multi-actor collaborative data selection mechanism. Each data selection method independently prioritizes data based on its specific criterion and updates its prioritization rules using the current state of the model, functioning as an independent actor for data selection. Additionally, a console is designed to adjust the impacts of different actors at various stages and dynamically integrate information from all actors throughout the LM pretraining process. We conduct extensive empirical studies to evaluate our multi-actor framework. The experimental results demonstrate that our approach significantly improves data efficiency, accelerates convergence in LM pretraining, and achieves an average relative performance gain up to 10.5% across multiple language model benchmarks compared to the state-of-the-art methods.
pdf
bib
abs
AdaDHP: Fine-Grained Fine-Tuning via Dual Hadamard Product and Adaptive Parameter Selection
Han Liu
|
Changya Li
|
Xiaotong Zhang
|
Feng Zhang
|
Fenglong Ma
|
Wei Wang
|
Hong Yu
With the continuously expanding parameters, efficiently adapting large language models to downstream tasks is crucial in resource-limited conditions. Many parameter-efficient fine-tuning methods have emerged to address this challenge. However, they lack flexibility, like LoRA requires manually selecting trainable parameters and rank size, (IA)3 can only scale the activations along columns, yielding inferior results due to less precise fine-tuning. To address these issues, we propose a novel method named AdaDHP with fewer parameters and finer granularity, which can adaptively select important parameters for each task. Specifically, we introduce two trainable vectors for each parameter and fine-tune the parameters through Hadamard product along both rows and columns. This significantly reduces the number of trainable parameters, with our parameter count capped at the lower limit of LoRA. Moreover, we design an adaptive parameter selection strategy to select important parameters for downstream tasks dynamically. This allows our method to flexibly remove unimportant parameters for downstream tasks. Finally, we demonstrate the superiority of our method on the T5-base model across 17 NLU tasks and on complex mathematical tasks with the Llama series models.
pdf
bib
abs
KG-Agent: An Efficient Autonomous Agent Framework for Complex Reasoning over Knowledge Graph
Jinhao Jiang
|
Kun Zhou
|
Xin Zhao
|
Yang Song
|
Chen Zhu
|
Hengshu Zhu
|
Ji-Rong Wen
In this paper, we aim to improve the reasoning ability of large language models(LLMs) over knowledge graphs(KGs) to answer complex questions. Inspired by existing methods that design the interaction strategy between LLMs and KG, we propose an autonomous LLM-based agent framework, called KG-Agent, which enables a small LLM to actively make decisions until finishing the reasoning process over KGs. In KG-Agent, we integrate the LLM, multifunctional toolbox, KG-based executor, and knowledge memory, and develop an iteration mechanism that autonomously selects the tool and then updates the memory for reasoning over KG. To guarantee the effectiveness, we leverage program language to formulate the multi-hop reasoning process over the KG and synthesize a code-based instruction dataset to fine-tune the base LLM. Extensive experiments demonstrate that only using 10K samples for tuning LLaMA2-7B can outperform competitive methods using larger LLMs or more data, on both in-domain and out-domain datasets. Our code and data will be publicly released.
pdf
bib
abs
Curriculum Debiasing: Toward Robust Parameter-Efficient Fine-Tuning Against Dataset Biases
Mingyu Lee
|
Yeachan Kim
|
Wing-Lam Mok
|
SangKeun Lee
Parameter-efficient fine-tuning (PEFT) addresses the memory footprint issue of full fine-tuning by modifying only a subset of model parameters. However, on datasets exhibiting spurious correlations, we observed that PEFT slows down the model’s convergence on unbiased examples, while the convergence on biased examples remains fast. This leads to the model’s overfitting on biased examples, causing significant performance degradation in out-of-distribution (OOD) scenarios. Traditional debiasing methods mitigate this issue by emphasizing unbiased examples during training but often come at the cost of in-distribution (ID) performance drops. To address this trade-off issue, we propose a curriculum debiasing framework that presents examples in a biased-to-unbiased order. Our framework initially limits the model’s exposure to unbiased examples, which are harder to learn, allowing it to first establish a foundation on easier-to-converge biased examples. As training progresses, we gradually increase the proportion of unbiased examples in the training set, guiding the model away from reliance on spurious correlations. Compared to the original PEFT methods, our method accelerates convergence on unbiased examples by approximately twofold and improves ID and OOD performance by 1.2% and 8.0%, respectively.
pdf
bib
abs
Does Context Matter? ContextualJudgeBench for Evaluating LLM-based Judges in Contextual Settings
Austin Xu
|
Srijan Bansal
|
Yifei Ming
|
Semih Yavuz
|
Shafiq Joty
The large language model (LLM)-as-judge paradigm has been used to meet the demand for a cheap, reliable, and fast evaluation of model outputs during AI system development and post-deployment monitoring. While judge models—LLMs finetuned to specialize in assessing and critiquing model outputs—have been touted as general purpose evaluators, they are typically evaluated only on non-contextual scenarios, such as instruction following. The omission of contextual settings—those where external information is used as context to generate an output—is surprising given the increasing prevalence of retrieval-augmented generation (RAG) and summarization use cases. Contextual assessment is uniquely challenging, as evaluation often depends on practitioner priorities, leading to conditional evaluation criteria (e.g., comparing responses based on factuality and then considering completeness if they are equally factual). To address the gap, we propose ContextualJudgeBench, a judge benchmark with 2,000 challenging response pairs across eight splits inspired by real-world contextual evaluation scenarios. We build our benchmark with a multi-pronged data construction pipeline that leverages both existing human annotations and model-based perturbations. Our comprehensive study across 11 judge models and 7 general purpose models, reveals that the contextual information and assessment criteria present a significant challenge to even state-of-the-art models. For example, o1, the best-performing model, barely reaches 55% consistent accuracy.
pdf
bib
abs
On the Reliability of Large Language Models for Causal Discovery
Tao Feng
|
Lizhen Qu
|
Niket Tandon
|
Zhuang Li
|
Xiaoxi Kang
|
Gholamreza Haffari
This study investigates the efficacy of Large Language Models (LLMs) in causal discovery. Using newly available open-source LLMs, OLMo and BLOOM, which provide access to their pre-training corpora, we investigate how LLMs address causal discovery through three research questions. We examine: (i) the impact of memorization for accurate causal relation prediction, (ii) the influence of incorrect causal relations in pre-training data, and (iii) the contextual nuances that influence LLMs’ understanding of causal relations. Our findings indicate that while LLMs are effective in recognizing causal relations that occur frequently in pre-training data, their ability to generalize to new or rare causal relations is limited. Moreover, the presence of incorrect causal relations significantly undermines the confidence of LLMs in corresponding correct causal relations, and the contextual information critically affects the outcomes of LLMs to discern causal connections between random variables.
pdf
bib
abs
Value-Spectrum: Quantifying Preferences of Vision-Language Models via Value Decomposition in Social Media Contexts
Jingxuan Li
|
Yuning Yang
|
Shengqi Yang
|
Linfan Zhang
|
Ying Nian Wu
The recent progress in Vision-Language Models (VLMs) has broadened the scope of multimodal applications. However, evaluations often remain limited to functional tasks, neglecting abstract dimensions such as personality traits and human values. To address this gap, we introduce Value-Spectrum, a novel Visual Question Answering (VQA) benchmark aimed at assessing VLMs based on Schwartz’s value dimensions that capture core human values guiding people’s preferences and actions. We design a VLM agent pipeline to simulate video browsing and construct a vector database comprising over 50,000 short videos from TikTok, YouTube Shorts, and Instagram Reels. These videos span multiple months and cover diverse topics, including family, health, hobbies, society, technology, etc. Benchmarking on Value-Spectrum highlights notable variations in how VLMs handle value-oriented content. Beyond identifying VLMs’ intrinsic preferences, we also explore the ability of VLM agents to adopt specific personas when explicitly prompted, revealing insights into the adaptability of the model in role-playing scenarios. These findings highlight the potential of Value-Spectrum as a comprehensive evaluation set for tracking VLM preferences in value-based tasks and abilities to simulate diverse personas. The complete code and data are available at https://github.com/Jeremyyny/Value-Spectrum.
pdf
bib
abs
TeRDy: Temporal Relation Dynamics through Frequency Decomposition for Temporal Knowledge Graph Completion
Ziyang Liu
|
Chaokun Wang
Temporal knowledge graph completion aims to predict missing facts in a knowledge graph by leveraging temporal information. Existing methods often struggle to capture both the long-term changes and short-term variability of relations, which are crucial for accurate prediction. In this paper, we propose a novel method called TeRDy for temporal knowledge graph completion. TeRDy captures temporal relational dynamics by utilizing time-invariant embeddings, along with long-term temporally dynamic embeddings (e.g., enduring political alliances) and short-term temporally dynamic embeddings (e.g., transient political events). These two types of embeddings are derived from low- and high-frequency components via frequency decomposition. Also, we design temporal smoothing and temporal gradient to seamlessly incorporate timestamp embeddings into relation embeddings. Extensive experiments on benchmark datasets demonstrate that TeRDy outperforms state-of-the-art temporal knowledge graph embedding methods.
pdf
bib
abs
Incorporating Domain Knowledge into Materials Tokenization
Yerim Oh
|
Jun-Hyung Park
|
Junho Kim
|
SungHo Kim
|
SangKeun Lee
While language models are increasingly utilized in materials science, typical models rely on frequency-centric tokenization methods originally developed for natural language processing. However, these methods frequently produce excessive fragmentation and semantic loss, failing to maintain the structural and semantic integrity of material concepts. To address this issue, we propose MATTER, a novel tokenization approach that integrates material knowledge into tokenization. Based on MatDetector trained on our materials knowledge base and re-ranking method prioritizing material terms in token merging, MATTER maintains the structural integrity of identified materials concepts and prevents fragmentation during tokenization, ensuring their semantic meaning remains intact. The experimental results demonstrate that MATTER outperforms existing tokenization methods, achieving an average performance gain of 4% and 2% in the generation and classification tasks, respectively. These results underscore the importance of domain knowledge for tokenization strategies in scientific text processing.
pdf
bib
abs
PIG: Privacy Jailbreak Attack on LLMs via Gradient-based Iterative In-Context Optimization
Yidan Wang
|
Yanan Cao
|
Yubing Ren
|
Fang Fang
|
Zheng Lin
|
Binxing Fang
Large Language Models (LLMs) excel in various domains but pose inherent privacy risks. Existing methods to evaluate privacy leakage in LLMs often use memorized prefixes or simple instructions to extract data, both of which well-alignment models can easily block. Meanwhile, Jailbreak attacks bypass LLM safety mechanisms to generate harmful content, but their role in privacy scenarios remains underexplored. In this paper, we examine the effectiveness of jailbreak attacks in extracting sensitive information, bridging privacy leakage and jailbreak attacks in LLMs. Moreover, we propose PIG, a novel framework targeting Personally Identifiable Information (PII) and addressing the limitations of current jailbreak methods. Specifically, PIG identifies PII entities and their types in privacy queries, uses in-context learning to build a privacy context, and iteratively updates it with three gradient-based strategies to elicit target PII. We evaluate PIG and existing jailbreak methods using two privacy-related datasets. Experiments on four white-box and two black-box LLMs show that PIG outperforms baseline methods and achieves state-of-the-art (SoTA) results. The results underscore significant privacy risks in LLMs, emphasizing the need for stronger safeguards.
pdf
bib
abs
Agents Under Siege: Breaking Pragmatic Multi-Agent LLM Systems with Optimized Prompt Attacks
Rana Shahroz
|
Zhen Tan
|
Sukwon Yun
|
Charles Fleming
|
Tianlong Chen
Most discussions about Large Language Model (LLM) safety have focused on single-agent settings but multi-agent LLM systems now create novel adversarial risks because their behavior depends on communication between agents and decentralized reasoning. In this work, we innovatively focus on attacking pragmatic systems that have constrains such as limited token bandwidth, latency between message delivery, and defense mechanisms. We design a permutation-invariant adversarial attack that optimizes prompt distribution across latency and bandwidth-constraint network topologies to bypass distributed safety mechanisms within the system. Formulating the attack path as a problem of maximum-flow minimum-cost, coupled with the novel Permutation-Invariant Evasion Loss (PIEL), we leverage graph-based optimization to maximize attack success rate while minimizing detection risk. Evaluating across models including Llama, Mistral, Gemma, DeepSeek and other variants on various datasets like JailBreakBench and AdversarialBench, our method outperforms conventional attacks by up to 7×, exposing critical vulnerabilities in multi-agent systems. Moreover, we demonstrate that existing defenses, including variants of Llama-Guard and PromptGuard, fail to prohibit our attack, emphasizing the urgent need for multi-agent specific safety mechanisms.
pdf
bib
abs
Semantic-Eval : A Semantic Comprehension Evaluation Framework for Large Language Models Generation without Training
Shusheng Li
|
Jiale Li
|
Yifei Qu
|
Xinwei Shi
|
Yanliang Guo
|
Ziyi He
|
Yubo Wang
|
Wenjun Tan
With the increasing prominence of large language models (LLMs), evaluating their text-generation capabilities has become an essential research challenge. Although LLM-based evaluation methods exhibit robust performance, the inherent stochastic nature of the LLM generation process introduces a degree of uncertainty in alignment with human preferences. To address this limitation, we propose Semantic-Eval, the first training-free framework designed to assess LLM-generated text based on semantic understanding. This framework computes semantic similarity between pairwise texts to evaluate the interdependence of semantic units, integrating a graph-based weighting mechanism to account for the differential contributions of individual sentences. A pre-trained natural language inference (NLI) model is also incorporated to mitigate potential semantic relationship biases. We evaluate Semantic-Eval across eight datasets that encompass four common NLP tasks. The experimental results indicate that Semantic-Eval surpasses traditional N-gram and BERT-based evaluation metrics, aligning more closely with human judgments and demonstrating a higher correlation than smaller LLMs. However, it slightly lags behind GPT-4. Finally, we demonstrate the effectiveness of Semantic-Eval in evaluating the generation quality of 13 large language models. The code is publicly available at https://github.com/LssTry/Semantic-Eval.
pdf
bib
abs
Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases
Michael Y. Hu
|
Jackson Petty
|
Chuan Shi
|
William Merrill
|
Tal Linzen
Pretraining language models on formal language can improve their acquisition of natural language. Which features of the formal language impart an inductive bias that leads to effective transfer? Drawing on insights from linguistics and complexity theory, we hypothesize that effective transfer occurs when two conditions are met: the formal language should capture the dependency structures present in natural language, and it should remain within the computational limitations of the model architecture. We experiment with pre-pretraining (training on formal language before natural languages) on transformers and find that formal languages capturing hierarchical dependencies indeed enable language models to achieve lower loss on natural language and better linguistic generalization compared to other formal languages. We also find modest support for the hypothesis that the formal language should fall within the computational limitations of the architecture. Strikingly, pre-pretraining reduces loss more efficiently than training on a matched amount of natural language. For a 1B-parameter language model trained on roughly 1.6B tokens of natural language, pre-pretraining achieves the same loss and better linguistic generalization with a 33% smaller token budget. Finally, we also give mechanistic evidence of transfer from formal tonatural language: attention heads acquired during pre-pretraining remain crucial for the model’s performance on syntactic evaluations.
pdf
bib
abs
When to Speak, When to Abstain: Contrastive Decoding with Abstention
Hyuhng Joon Kim
|
Youna Kim
|
Sang-goo Lee
|
Taeuk Kim
Large Language Models (LLMs) demonstrate exceptional performance across diverse tasks by leveraging pre-trained (i.e., parametric) and external (i.e., contextual) knowledge. While substantial efforts have been made to enhance the utilization of both forms of knowledge, situations in which models lack relevant information remain underexplored. To investigate this challenge, we first present a controlled testbed featuring four distinct knowledge access scenarios, including the aforementioned edge case, revealing that conventional LLM usage exhibits insufficient robustness in handling all instances. Addressing this limitation, we propose Contrastive Decoding with Abstention (CDA), a novel training-free decoding method that allows LLMs to generate responses when relevant knowledge is available and to abstain otherwise. CDA estimates the relevance of both knowledge sources for a given input, adaptively deciding which type of information to prioritize and which to exclude. Through extensive experiments, we demonstrate that CDA can effectively perform accurate generation and abstention simultaneously, enhancing reliability and preserving user trust.
pdf
bib
abs
On the Risk of Evidence Pollution for Malicious Social Text Detection in the Era of LLMs
Herun Wan
|
Minnan Luo
|
Zhixiong Su
|
Guang Dai
|
Xiang Zhao
Evidence-enhanced detectors present remarkable abilities in identifying malicious social text. However, the rise of large language models (LLMs) brings potential risks of evidence pollution to confuse detectors. This paper explores potential manipulation scenarios including basic pollution, and rephrasing or generating evidence by LLMs. To mitigate the negative impact, we propose three defense strategies from the data and model sides, including machine-generated text detection, a mixture of experts, and parameter updating. Extensive experiments on four malicious social text detection tasks with ten datasets illustrate that evidence pollution significantly compromises detectors, where the generating strategy causes up to a 14.4% performance drop. Meanwhile, the defense strategies could mitigate evidence pollution, but they faced limitations for practical employment. Further analysis illustrates that polluted evidence (i) is of high quality, evaluated by metrics and humans; (ii) would compromise the model calibration, increasing expected calibration error up to 21.6%; and (iii) could be integrated to amplify the negative impact, especially for encoder-based LMs, where the accuracy drops by 21.8%.
pdf
bib
abs
Investigating and Extending Homans’ Social Exchange Theory with Large Language Model based Agents
Lei Wang
|
Zheqing Zhang
|
Xu Chen
Homans’ Social Exchange Theory (SET) is widely recognized as a basic framework for understanding the formation and emergence of human civilizations and social structures. In social science, this theory is typically studied based on simple simulation experiments or real-world human studies, both of which either lack realism or are too expensive to control. In artificial intelligence, recent advances in large language models (LLMs) have shown promising capabilities in simulating human behaviors. Inspired by these insights, we adopt an interdisciplinary research perspective and propose using LLM-based agents to study Homans’ SET. Specifically, we construct a virtual society composed of three LLM agents and have them engage in a social exchange game to observe their behaviors. Through extensive experiments, we found that Homans’ SET is well validated in our agent society, demonstrating the consistency between the agent and human behaviors. Building on this foundation, we intentionally alter the settings of the agent society to extend the traditional Homans’ SET, making it more comprehensive and detailed. To the best of our knowledge, this paper marks the first step in studying Homans’ SET with LLM-based agents. More importantly, it introduces a novel and feasible research paradigm that bridges the fields of social science and computer science through LLM-based agents. Code is available at https://github.com/Paitesanshi/SET .
pdf
bib
abs
A Drop-In Solution for On-the-Fly Adaptation of Speculative Decoding in Large Language Models
Jiesong Liu
|
Brian Park
|
Xipeng Shen
Large Language Models (LLMs) are cutting-edge generative AI models built on transformer architecture, which tend to be highly memory-intensive when performing real-time inference. Various strategies have been developed to enhance the end-to-end inference speed for LLMs, one of which is speculative decoding. This technique involves running a smaller LLM (draft model) for inference over a defined window size, denoted as 𝛾, while simultaneously being validated by the larger LLM (target model). Choosing the optimal 𝛾 value and the draft model is essential for unlocking the potential of speculative decoding. But it is difficult to do due to the complicated influence from various factors, including the nature of the task, the hardware in use, and the combination of the large and small models. This paper introduces *on-the-fly adaption of speculative decoding*, a solution that dynamically adapts the choices to maximize the efficiency of speculative decoding for LLM inferences. As a drop-in solution, it needs no offline benchmarking or training. Experiments show that the solution can lead to 3.55-16.48% speed improvement over the standard speculative decoding, and 1.2-3.4× over the default LLMs.
pdf
bib
abs
If Attention Serves as a Cognitive Model of Human Memory Retrieval, What is the Plausible Memory Representation?
Ryo Yoshida
|
Shinnosuke Isono
|
Kohei Kajikawa
|
Taiga Someya
|
Yushi Sugimoto
|
Yohei Oseki
Recent work in computational psycholinguistics has revealed intriguing parallels between attention mechanisms and human memory retrieval, focusing primarily on vanilla Transformers that operate on token-level representations. However, computational psycholinguistic research has also established that syntactic structures provide compelling explanations for human sentence processing that token-level factors cannot fully account for. In this paper, we investigate whether the attention mechanism of Transformer Grammar (TG), which uniquely operates on syntactic structures as representational units, can serve as a cognitive model of human memory retrieval, using Normalized Attention Entropy (NAE) as a linking hypothesis between models and humans. Our experiments demonstrate that TG’s attention achieves superior predictive power for self-paced reading times compared to vanilla Transformer’s, with further analyses revealing independent contributions from both models. These findings suggest that human sentence processing involves dual memory representations—one based on syntactic structures and another on token sequences—with attention serving as the general memory retrieval algorithm, while highlighting the importance of incorporating syntactic structures as representational units.
pdf
bib
abs
Aligning VLM Assistants with Personalized Situated Cognition
Yongqi Li
|
Shen Zhou
|
Xiaohu Li
|
Xin Miao
|
Jintao Wen
|
Mayi Xu
|
Jianhao Chen
|
Birong Pan
|
Hankun Kang
|
Yuanyuan Zhu
|
Ming Zhong
|
Tieyun Qian
Vision-language models (VLMs) aligned with general human objectives, such as being harmless and hallucination-free, have become valuable assistants of humans in managing visual tasks. However, people with diversified backgrounds have different cognition even in the same situation. Consequently, they may have personalized expectations for VLM assistants. This highlights the urgent need to align VLM assistants with personalized situated cognition for real-world assistance. To study this problem, we first simplify it by characterizing individuals based on the sociological concept of Role-Set. Then, we propose to evaluate the individuals’ actions to examine whether the personalized alignment is achieved. Further, we construct a benchmark named PCogAlignBench, which includes 18k instances and 20 individuals with different Role-Sets. Finally, we present a framework called PCogAlign, which constructs a cognition-aware and action-based reward model for personalized alignment. Experimental results and human evaluations demonstrate the reliability of the PCogAlignBench and the effectiveness of our proposed PCogAlign. We will open-source the constructed benchmark and code after being accepted.
pdf
bib
abs
Attention Entropy is a Key Factor: An Analysis of Parallel Context Encoding with Full-attention-based Pre-trained Language Models
Zhisong Zhang
|
Yan Wang
|
Xinting Huang
|
Tianqing Fang
|
Hongming Zhang
|
Chenlong Deng
|
Shuaiyi Li
|
Dong Yu
Large language models have shown remarkable performance across a wide range of language tasks, owing to their exceptional capabilities in context modeling. The most commonly used method of context modeling is full self-attention, as seen in standard decoder-only Transformers. Although powerful, this method can be inefficient for long sequences and may overlook inherent input structures. To address these problems, an alternative approach is parallel context encoding, which splits the context into sub-pieces and encodes them parallelly. Because parallel patterns are not encountered during training, naively applying parallel encoding leads to performance degradation. However, the underlying reasons and potential mitigations are unclear. In this work, we provide a detailed analysis of this issue and identify that unusually high attention entropy can be a key factor. Furthermore, we adopt two straightforward methods to reduce attention entropy by incorporating attention sinks and selective mechanisms. Experiments on various tasks reveal that these methods effectively lower irregular attention entropy and narrow performance gaps. We hope this study can illuminate ways to enhance context modeling mechanisms.
pdf
bib
abs
Faster Speculative Decoding via Effective Draft Decoder with Pruned Candidate Tree
Huanran Zheng
|
Xiaoling Wang
Speculative Decoding (SD) is a promising method for reducing the inference latency of large language models (LLMs). A well-designed draft model and an effective draft candidate tree construction method are key to enhancing the acceleration effect of SD. In this paper, we first propose the Effective Draft Decoder (EDD), which treats the LLM as a powerful encoder and generates more accurate draft tokens by leveraging the encoding results as soft prompts. Furthermore, we use KL divergence instead of the standard cross-entropy loss to better align the draft model’s output with the LLM. Next, we introduce the Pruned Candidate Tree (PCT) algorithm to construct a more efficient candidate tree. Specifically, we found that the confidence scores predicted by the draft model are well-calibrated with the acceptance probability of draft tokens. Therefore, PCT estimates the expected time gain for each node in the candidate tree based on confidence scores and retains only the nodes that contribute to acceleration, pruning away redundant nodes. We conducted extensive experiments with various LLMs across four datasets. The experimental results verify the effectiveness of our proposed method, which significantly improves the performance of SD and reduces the inference latency of LLMs.
pdf
bib
abs
Selecting and Merging: Towards Adaptable and Scalable Named Entity Recognition with Large Language Models
Zhuojun Ding
|
Wei Wei
|
Chenghao Fan
Supervised fine-tuning (SFT) is widely used to align large language models (LLMs) with information extraction (IE) tasks, such as named entity recognition (NER). However, annotating such fine-grained labels and training domain-specific models is costly. Existing works typically train a unified model across multiple domains, but such approaches lack adaptation and scalability since not all training data benefits target domains and scaling trained models remains challenging. We propose the SaM framework, which dynamically Selects and Merges expert models at inference time. Specifically, for a target domain, we select domain-specific experts pre-trained on existing domains based on (i) domain similarity to the target domain and (ii) performance on sampled instances, respectively. The experts are then merged to create task-specific models optimized for the target domain. By dynamically merging experts beneficial to target domains, we improve generalization across various domains without extra training. Additionally, experts can be added or removed conveniently, leading to great scalability. Extensive experiments on multiple benchmarks demonstrate our framework’s effectiveness, which outperforms the unified model by an average of 10%. We further provide insights into potential improvements, practical experience, and extensions of our framework.
pdf
bib
abs
Embracing Imperfection: Simulating Students with Diverse Cognitive Levels Using LLM-based Agents
Tao Wu
|
Jingyuan Chen
|
Wang Lin
|
Mengze Li
|
Yumeng Zhu
|
Ang Li
|
Kun Kuang
|
Fei Wu
Large language models (LLMs) are revolutionizing education, with LLM-based agents playing a key role in simulating student behavior. A major challenge in student simulation is modeling the diverse learning patterns of students at various cognitive levels. However, current LLMs, typically trained as “helpful assistants”, target at generating perfect responses. As a result, they struggle to simulate students with diverse cognitive abilities, as they often produce overly advanced answers, missing the natural imperfections that characterize student learning and resulting in unrealistic simulations. To address this issue, we propose a training-free framework for student simulation. We begin by constructing a cognitive prototype for each student using a knowledge graph, which captures their understanding of concepts from past learning records. This prototype is then mapped to new tasks to predict student performance. Next, we simulate student solutions based on these predictions and iteratively refine them using a beam search method to better replicate realistic mistakes. To validate our approach, we construct the Student_100 dataset, consisting of 100 students working on Python programming and 5,000 learning records. Experimental results show that our method consistently outperforms baseline models, achieving 100% improvement in simulation accuracy and realism.
pdf
bib
abs
CADReview: Automatically Reviewing CAD Programs with Error Detection and Correction
Jiali Chen
|
Xusen Hei
|
HongFei Liu
|
Yuancheng Wei
|
Zikun Deng
|
Jiayuan Xie
|
Yi Cai
|
Li Qing
Computer-aided design (CAD) is crucial in prototyping 3D objects through geometric instructions (i.e., CAD programs). In practical design workflows, designers often engage in time-consuming reviews and refinements of these prototypes by comparing them with reference images. To bridge this gap, we introduce the CAD review task to automatically detect and correct potential errors, ensuring consistency between the constructed 3D objects and reference images. However, recent advanced multimodal large language models (MLLMs) struggle to recognize multiple geometric components and perform spatial geometric operations within the CAD program, leading to inaccurate reviews. In this paper, we propose the CAD program repairer (ReCAD) framework to effectively detect program errors and provide helpful feedback on error correction. Additionally, we create a dataset, CADReview, consisting of over 20K program-image pairs, with diverse errors for the CAD review task. Extensive experiments demonstrate that our ReCAD significantly outperforms existing MLLMs, which shows great potential in design applications.
pdf
bib
abs
Think&Cite: Improving Attributed Text Generation with Self-Guided Tree Search and Progress Reward Modeling
Junyi Li
|
Hwee Tou Ng
Despite their outstanding capabilities, large language models (LLMs) are prone to hallucination and producing factually incorrect information. This challenge has spurred efforts in attributed text generation, which prompts LLMs to generate content with supporting evidence. In this paper, we propose a novel framework, called Think&Cite, and formulate attributed text generation as a multi-step reasoning problem integrated with search. Specifically, we propose Self-Guided Monte Carlo Tree Search (SG-MCTS), which capitalizes on the self-reflection capability of LLMs to reason about the intermediate states of MCTS for guiding the tree expansion process. To provide reliable and comprehensive feedback, we introduce Progress Reward Modeling to measure the progress of tree search from the root to the current state from two aspects, i.e., generation and attribution progress. We conduct extensive experiments on three datasets and the results show that our approach significantly outperforms baseline approaches.
pdf
bib
abs
The Lawyer That Never Thinks: Consistency and Fairness as Keys to Reliable AI
Dana R Alsagheer
|
Abdulrahman Kamal
|
Mohammad Kamal
|
Cosmo Yang Wu
|
Weidong Shi
Large Language Models (LLMs) are increasingly used in high-stakes domains like law and research, yet their inconsistencies and response instability raise concerns about trustworthiness. This study evaluates six leading LLMs—GPT-3.5, GPT-4, Claude, Gemini, Mistral, and LLaMA 2—on rationality, stability, and ethical fairness through reasoning tests, legal challenges, and bias-sensitive scenarios. Results reveal significant inconsistencies, highlighting trade-offs between model scale, architecture, and logical coherence. These findings underscore the risks of deploying LLMs in legal and policy settings, emphasizing the need for AI systems that prioritize transparency, fairness, and ethical robustness.
pdf
bib
abs
Polishing Every Facet of the GEM: Testing Linguistic Competence of LLMs and Humans in Korean
SungHo Kim
|
Nayeon Kim
|
Taehee Jeon
|
SangKeun Lee
We introduce the ̲Korean ̲Grammar ̲Evaluation Bench ̲Mark (KoGEM), designed to assess the linguistic competence of LLMs and humans in Korean. KoGEM consists of 1.5k multiple-choice QA pairs covering five main categories and 16 subcategories. The zero-shot evaluation of 27 LLMs of various sizes and types reveals that while LLMs perform remarkably well on straightforward tasks requiring primarily definitional knowledge, they struggle with tasks that demand the integration of real-world experiential knowledge, such as phonological rules and pronunciation. Furthermore, our in-depth analysis suggests that incorporating such experiential knowledge could enhance the linguistic competence of LLMs. With KoGEM, we not only highlight the limitations of current LLMs in linguistic competence but also uncover hidden facets of LLMs in linguistic competence, paving the way for enhancing comprehensive language understanding. Our code and dataset are available at: https://github.com/SungHo3268/KoGEM.
pdf
bib
abs
SpeechFake: A Large-Scale Multilingual Speech Deepfake Dataset Incorporating Cutting-Edge Generation Methods
Wen Huang
|
Yanmei Gu
|
Zhiming Wang
|
Huijia Zhu
|
Yanmin Qian
As speech generation technology advances, the risk of misuse through deepfake audio has become a pressing concern, which underscores the critical need for robust detection systems. However, many existing speech deepfake datasets are limited in scale and diversity, making it challenging to train models that can generalize well to unseen deepfakes. To address these gaps, we introduce SpeechFake, a large-scale dataset designed specifically for speech deepfake detection. SpeechFake includes over 3 million deepfake samples, totaling more than 3,000 hours of audio, generated using 40 different speech synthesis tools. The dataset encompasses a wide range of generation techniques, including text-to-speech, voice conversion, and neural vocoder, incorporating the latest cutting-edge methods. It also provides multilingual support, spanning 46 languages. In this paper, we offer a detailed overview of the dataset’s creation, composition, and statistics. We also present baseline results by training detection models on SpeechFake, demonstrating strong performance on both its own test sets and various unseen test sets. Additionally, we conduct experiments to rigorously explore how generation methods, language diversity, and speaker variation affect detection performance. We believe SpeechFake will be a valuable resource for advancing speech deepfake detection and developing more robust models for evolving generation techniques.
pdf
bib
abs
ReflectionCoder: Learning from Reflection Sequence for Enhanced One-off Code Generation
Houxing Ren
|
Mingjie Zhan
|
Zhongyuan Wu
|
Aojun Zhou
|
Junting Pan
|
Hongsheng Li
Code generation plays a crucial role in various tasks, such as code auto-completion and mathematical reasoning. Previous work has proposed numerous methods to enhance code generation performance, including integrating feedback from the compiler. Inspired by this, we present ReflectionCoder, a novel approach that effectively leverages reflection sequences constructed by integrating compiler feedback to improve one-off code generation performance. Furthermore, we propose reflection self-distillation and dynamically masked distillation to effectively utilize these reflection sequences. Extensive experiments on three benchmarks, i.e., HumanEval (+), MBPP (+), and MultiPl-E, demonstrate that models fine-tuned with our method achieve state-of-the-art performance. Beyond the code domain, we believe this approach can benefit other domains that focus on final results and require long reasoning paths. Code and data are available at https://github.com/SenseLLM/ReflectionCoder.
pdf
bib
abs
InvestAlign: Overcoming Data Scarcity in Aligning Large Language Models with Investor Decision-Making Processes Under Herd Behavior
Huisheng Wang
|
Zhuoshi Pan
|
Hangjing Zhang
|
Mingxiao Liu
|
Hanqing Gao
|
H. Vicky Zhao
Aligning Large Language Models (LLMs) with investor decision-making processes under herd behavior is a critical challenge in behavioral finance, which grapples with a fundamental limitation: the scarcity of real-user data needed for Supervised Fine-Tuning (SFT). While SFT can bridge the gap between LLM outputs and human behavioral patterns, its reliance on massive authentic data imposes substantial collection costs and privacy risks. We propose **InvestAlign**, a novel framework that constructs high-quality SFT datasets by leveraging theoretical solutions to similar and simple optimal investment problems rather than the complex scenarios. Our theoretical analysis demonstrates that training LLMs with **InvestAlign**-generated data achieves faster parameter convergence than using real-user data, suggesting superior learning efficiency. Furthermore, we develop **InvestAgent**, an LLM agent fine-tuned with **InvestAlign**, which shows significantly closer alignment to real-user data than pre-SFT models in both simple and complex investment problems. This highlights our proposed **InvestAlign** as a promising approach with the potential to address complex optimal investment problems and align LLMs with investor decision-making processes under herd behavior. Our code is publicly available at https://github.com/thu-social-network-research-group/InvestAlign.
pdf
bib
abs
Enhancing Neural Machine Translation Through Target Language Data: A kNN-LM Approach for Domain Adaptation
Abudurexiti Reheman
|
Hongyu Liu
|
Junhao Ruan
|
Abudukeyumu Abudula
|
Yingfeng Luo
|
Tong Xiao
|
JingBo Zhu
Neural machine translation (NMT) has advanced significantly, yet challenges remain in adapting to new domains . In scenarios where bilingual data is limited, this issue is further exacerbated. To address this, we propose kNN-LM-NMT, a method that leverages semantically similar target language sentences in the kNN framework. Our approach generates a probability distribution over these sentences during decoding, and this distribution is then interpolated with the NMT model’s distribution. Additionally, we introduce an n-gram-based approach to focus on similar fragments, enabling the model to avoid the noise introduced by the non-similar parts. To enhance accuracy, we further incorporate cross-lingual retrieval similarity to refine the kNN probability distribution. Extensive experiments on multi-domain datasets demonstrate significant performance improvements in both high-resource and low-resource scenarios. Our approach effectively extracts translation knowledge from limited target domain data, and well benefits from large-scale monolingual data for robust context representation.
pdf
bib
abs
Multi-level Relevance Document Identifier Learning for Generative Retrieval
Fuwei Zhang
|
Xiaoyu Liu
|
Xinyu Jia
|
Yingfei Zhang
|
Shuai Zhang
|
Xiang Li
|
Fuzhen Zhuang
|
Wei Lin
|
Zhao Zhang
Generative Retrieval (GR) introduces a new information retrieval paradigm that directly generates unique document identifiers (DocIDs). The key challenge of GR lies in creating effective yet discrete DocIDs that preserve semantic relevance for similar documents while differentiating dissimilar ones. However, existing methods generate DocIDs solely based on the textual content of documents, which may result in DocIDs with weak semantic connections for similar documents due to variations in expression. Therefore, we propose using queries as a bridge to connect documents with varying relevance levels for learning improved DocIDs. In this paper, we propose **M**ulti-l**E**vel **R**elevance document identifier learning for **G**enerative r**E**trieval (MERGE), a novel approach that utilizes multi-level document relevance to learn high-quality DocIDs. MERGE incorporates three modules: a multi-relevance query-document alignment module to effectively align document representations with related queries, an outer-level contrastive learning module to capture binary-level relevance, and an inner-level multi-level relevance learning module to distinguish documents with different relevance levels. Our approach encodes rich hierarchical semantic information and maintains uniqueness across documents. Experimental results on real-world multilingual e-commerce search datasets demonstrate that MERGE significantly outperforms existing methods, underscoring its effectiveness. The source code is available at <https://github.com/zhangfw123/MERGE>.
pdf
bib
abs
EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
Mengzhao Chen
|
Wenqi Shao
|
Peng Xu
|
Jiahao Wang
|
Peng Gao
|
Kaipeng Zhang
|
Ping Luo
Large language models (LLMs) are crucial in modern natural language processing and artificial intelligence. However, they face challenges in managing their significant memory requirements. Although quantization-aware training (QAT) offers a solution by reducing memory consumption through low-bit representations with minimal accuracy loss, it is impractical due to substantial training resources. To address this, we propose Efficient Quantization-Aware Training (EfficientQAT), a more feasible QAT algorithm. EfficientQAT involves two consecutive phases: Block-wise training of all parameters (Block-AP) and end-to-end training of quantization parameters (E2E-QP). To the best of our knowledge, Block-AP is the first method to enable direct training of all parameters in a block-wise manner, reducing accuracy loss in low-bit scenarios by enhancing the solution space during optimization. E2E-QP then trains only the quantization parameters (step sizes) end-to-end, further improving the performance of quantized models by considering interactions among all sub-modules. Extensive experiments demonstrate that EfficientQAT outperforms previous quantization methods across a range of models, including base LLMs, instruction-tuned LLMs, and multimodal LLMs, with scales from 7B to 70B parameters at various quantization bits. For instance, EfficientQAT obtains a 2-bit Llama-2-70B model on a single A100-80GB GPU in 41 hours, with less than 3 points accuracy degradation compared to the full precision (69.48 vs. 72.41). Code is available at https://github.com/OpenGVLab/EfficientQAT.
pdf
bib
abs
Exploring How Generative MLLMs Perceive More Than CLIP with the Same Vision Encoder
Siting Li
|
Pang Wei Koh
|
Simon Shaolei Du
Recent research has shown that CLIP models struggle with visual reasoning tasks that require grounding compositionality, understanding spatial relationships, or capturing fine-grained details. One natural hypothesis is that the CLIP vision encoder does not embed essential information for these tasks. However, we find that this is not always the case: The encoder gathers query-relevant visual information, while CLIP fails to extract it. In particular, we show that another branch of Vision-Language Models (VLMs), Generative Multimodal Large Language Models (MLLMs), achieve significantly higher accuracy than CLIP in many of these tasks using the *same* vision encoder and weights, indicating that these Generative MLLMs *perceive more*—as they extract and utilize visual information more effectively. We conduct a series of controlled experiments and reveal that their success is attributed to multiple key design choices, including patch tokens, position embeddings, and prompt-based weighting. On the other hand, enhancing the training data alone or applying a stronger text encoder does not suffice to solve the task, and additional text tokens offer little benefit. Interestingly, we find that fine-grained visual reasoning is not exclusive to generative models trained by an autoregressive loss: When converted into CLIP-like encoders by contrastive finetuning, these MLLMs still outperform CLIP under the same cosine similarity-based evaluation protocol. Our study highlights the importance of VLM architectural choices and suggests directions for improving the performance of CLIP-like contrastive VLMs.
pdf
bib
abs
NexusSum: Hierarchical LLM Agents for Long-Form Narrative Summarization
Hyuntak Kim
|
Byung-Hak Kim
Summarizing long-form narratives—such as books, movies, and TV scripts—requires capturing intricate plotlines, character interactions, and thematic coherence, a task that remains challenging for existing LLMs. We introduce NexusSum, a multi-agent LLM framework for narrative summarization that processes long-form text through a structured, sequential pipeline—without requiring fine-tuning. Our approach introduces two key innovations: **(1) Dialogue-to-Description Transformation**: A narrative-specific preprocessing method that standardizes character dialogue and descriptive text into a unified format, improving coherence. **(2) Hierarchical Multi-LLM Summarization**: A structured summarization pipeline that optimizes chunk processing and controls output length for accurate, high-quality summaries. Our method establishes a new state-of-the-art in narrative summarization, achieving up to **a 30.0% improvement in BERTScore (F1)** across books, movies, and TV scripts. These results demonstrate the effectiveness of multi-agent LLMs in handling long-form content, offering a scalable approach for structured summarization in diverse storytelling domains.
pdf
bib
abs
HAIC: Improving Human Action Understanding and Generation with Better Captions for Multi-modal Large Language Models
Xiao Wang
|
Jingyun Hua
|
Weihong Lin
|
Yuanxing Zhang
|
Fuzheng Zhang
|
Jianlong Wu
|
Di Zhang
|
Liqiang Nie
Recent Multi-modal Large Language Models (MLLMs) have made great progress in video understanding. However, their performance on videos involving human actions is still limited by the lack of high-quality data. To address this, we introduce a two-stage data annotation pipeline. First, we design strategies to accumulate videos featuring clear human actions from the Internet. Second, videos are annotated in a standardized caption format that uses human attributes to distinguish individuals and chronologically details their actions and interactions. Through this pipeline, we curate two datasets, namely HAICTrain and HAICBench. **HAICTrain** comprises 126K video-caption pairs generated by Gemini-Pro and verified for training purposes. Meanwhile, **HAICBench** includes 412 manually annotated video-caption pairs and 2,000 QA pairs, for a comprehensive evaluation of human action understanding. Experimental results demonstrate that training with HAICTrain not only significantly enhances human understanding abilities across 4 benchmarks, but can also improve text-to-video generation results. Both the HAICTrain and HAICBench will be made open-source to facilitate further research.
pdf
bib
abs
Uni-Retrieval: A Multi-Style Retrieval Framework for STEM’s Education
Yanhao Jia
|
Xinyi Wu
|
Li Hao
|
QinglinZhang QinglinZhang
|
Yuxiao Hu
|
Shuai Zhao
|
Wenqi Fan
In AI-facilitated teaching, leveraging various query styles to interpret abstract text descriptions is crucial for ensuring high-quality teaching. However, current retrieval models primarily focus on natural text-image retrieval, making them insufficiently tailored to educational scenarios due to the ambiguities in the retrieval process. In this paper, we propose a diverse expression retrieval task tailored to educational scenarios, supporting retrieval based on multiple query styles and expressions. We introduce the STEM Education Retrieval Dataset (SER), which contains over 24,000 query pairs of different styles, and the Uni-Retrieval, an efficient and style-diversified retrieval vision-language model based on prompt tuning. Uni-Retrieval extracts query style features as prototypes and builds a continuously updated Prompt Bank containing prompt tokens for diverse queries. This bank can updated during test time to represent domain-specific knowledge for different subject retrieval scenarios. Our framework demonstrates scalability and robustness by dynamically retrieving prompt tokens based on prototype similarity, effectively facilitating learning for unknown queries. Experimental results indicate that Uni-Retrieval outperforms existing retrieval models in most retrieval tasks.
pdf
bib
abs
DenseLoRA: Dense Low-Rank Adaptation of Large Language Models
Lin Mu
|
Xiaoyu Wang
|
Li Ni
|
Yang Li
|
Zhize Wu
|
Peiquan Jin
|
Yiwen Zhang
Low-rank adaptation (LoRA) has been developed as an efficient approach for adapting large language models (LLMs) by fine-tuning two low-rank matrices, thereby reducing the number of trainable parameters. However, prior research indicates that many of the weights in these matrices are redundant, leading to inefficiencies in parameter utilization. To address this limitation, we introduce Dense Low-Rank Adaptation (DenseLoRA), a novel approach that enhances parameter efficiency while achieving superior performance compared to LoRA. DenseLoRA builds upon the concept of representation fine-tuning, incorporating a single Encoder-Decoder to refine and compress hidden representations across all adaptation layers before applying adaptation. Instead of relying on two redundant low-rank matrices as in LoRA, DenseLoRA adapts LLMs through a dense low-rank matrix, improving parameter utilization and adaptation efficiency. We evaluate DenseLoRA on various benchmarks, showing that it achieves 83.8% accuracy with only 0.01% of trainable parameters, compared to LoRA’s 80.8% accuracy with 0.70% of trainable parameters on LLaMA3-8B. Additionally, we conduct extensive experiments to systematically assess the impact of DenseLoRA’s components on overall model performance.
pdf
bib
abs
Exploring the Potential of LLMs as Personalized Assistants: Dataset, Evaluation, and Analysis
Jisoo Mok
|
Ik-hwan Kim
|
Sangkwon Park
|
Sungroh Yoon
Personalized AI assistants, a hallmark of the human-like capabilities of Large Language Models (LLMs), are a challenging application that intertwines multiple problems in LLM research. Despite the growing interest in the development of personalized assistants, the lack of an open-source conversational dataset tailored for personalization remains a significant obstacle for researchers in the field. To address this research gap, we introduce HiCUPID, a new benchmark to probe and unleash the potential of LLMs to deliver personalized responses. Alongside a conversational dataset, HiCUPID provides a Llama-3.2-based automated evaluation model whose assessment closely mirrors human preferences. We release our dataset, evaluation model, and code at https://github.com/12kimih/HiCUPID.
pdf
bib
abs
Cracking Factual Knowledge: A Comprehensive Analysis of Degenerate Knowledge Neurons in Large Language Models
Yuheng Chen
|
Pengfei Cao
|
Yubo Chen
|
Yining Wang
|
Shengping Liu
|
Kang Liu
|
Jun Zhao
Knowledge neuron theory provides a key approach to understanding the mechanisms of factual knowledge in Large Language Models (LLMs), which suggests that facts are stored within multi-layer perceptron neurons. This paper further explores **Degenerate Knowledge Neurons** (DKNs), where distinct sets of neurons can store identical facts, but unlike simple redundancy, they also participate in storing other different facts. Despite the novelty and unique properties of this concept, it has not been rigorously defined and systematically studied. Our contributions are: (1) We pioneer the study of structures in knowledge neurons by analyzing weight connection patterns, providing a comprehensive definition of DKNs from both functional and structural aspects. (2) Based on this definition, we develop the **Neuronal Topology Clustering** method, leading to a more accurate DKN identification. (3) We demonstrate the practical applications of DKNs in two aspects: guiding LLMs to learn new knowledge and relating to LLMs’ robustness against input errors.
pdf
bib
abs
Towards Context-Robust LLMs: A Gated Representation Fine-tuning Approach
Shenglai Zeng
|
Pengfei He
|
Kai Guo
|
Tianqi Zheng
|
Hanqing Lu
|
Yue Xing
|
Hui Liu
Large Language Models (LLMs) enhanced with external contexts, such as through retrieval-augmented generation (RAG), often face challenges in handling imperfect evidence. They tend to over-rely on external knowledge, making them vulnerable to misleading and unhelpful contexts. To address this, we propose the concept of context-robust LLMs, which can effectively balance internal knowledge with external context, similar to human cognitive processes. Specifically, context-robust LLMs should rely on external context only when lacking internal knowledge, identify contradictions between internal and external knowledge, and disregard unhelpful contexts. To achieve this goal, we introduce Grft, a lightweight and plug-and-play gated representation fine-tuning approach. Grft consists of two key components: a gating mechanism to detect and filter problematic inputs, and low-rank representation adapters to adjust hidden representations. By training a lightweight intervention function with only 0.0004% of model size on fewer than 200 examples, Grft can effectively adapt LLMs towards context-robust behaviors.
pdf
bib
abs
On Support Samples of Next Word Prediction
Yuqian Li
|
Yupei Du
|
Yufang Liu
|
Feifei Feng
|
Mou Xiao Feng
|
Yuanbin Wu
Language models excel in various tasks by making complex decisions, yet understanding the rationale behind these decisions remains a challenge. This paper investigates data-centric interpretability in language models, focusing on the next-word prediction task. Using representer theorem, we identify two types of support samples—those that either promote or deter specific predictions. Our findings reveal that being a support sample is an intrinsic property, predictable even before training begins. Additionally, while non-support samples are less influential in direct predictions, they play a critical role in preventing overfitting and shaping generalization and representation learning. Notably, the importance of non-support samples increases in deeper layers, suggesting their significant role in intermediate representation formation.These insights shed light on the interplay between data and model decisions, offering a new dimension to understanding language model behavior and interpretability.
pdf
bib
abs
WebWalker: Benchmarking LLMs in Web Traversal
Jialong Wu
|
Wenbiao Yin
|
Yong Jiang
|
Zhenglin Wang
|
Zekun Xi
|
Runnan Fang
|
Linhai Zhang
|
Yulan He
|
Deyu Zhou
|
Pengjun Xie
|
Fei Huang
Retrieval-augmented generation (RAG) demonstrates remarkable performance across tasks in open-domain question-answering. However, traditional search engines may retrieve shallow content, limiting the ability of LLMs to handle complex, multi-layered information. To address this, we introduce WebWalkerQA, a benchmark designed to assess the ability of LLMs to perform web traversal. It evaluates the capacity of LLMs to traverse a website’s subpages to extract high-quality data systematically. We propose WebWalker, which is a multi-agent framework that mimics human-like web navigation through an explore-critic paradigm. Extensive experimental results show that WebWalkerQA is challenging and demonstrates the effectiveness of RAG combined with WebWalker, through this horizontal and vertical integration in real-world scenarios.
pdf
bib
abs
From Trade-off to Synergy: A Versatile Symbiotic Watermarking Framework for Large Language Models
Yidan Wang
|
Yubing Ren
|
Yanan Cao
|
Binxing Fang
The rise of Large Language Models (LLMs) has heightened concerns about the misuse of AI-generated text, making watermarking a promising solution. Mainstream watermarking schemes for LLMs fall into two categories: logits-based and sampling-based. However, current schemes entail trade-offs among robustness, text quality, and security. To mitigate this, we integrate logits-based and sampling-based schemes, harnessing their respective strengths to achieve synergy. In this paper, we propose a versatile symbiotic watermarking framework with three strategies: serial, parallel, and hybrid. The hybrid framework adaptively embeds watermarks using token entropy and semantic entropy, optimizing the balance between detectability, robustness, text quality, and security. Furthermore, we validate our approach through comprehensive experiments on various datasets and models. Experimental results indicate that our method outperforms existing baselines and achieves state-of-the-art (SOTA) performance. We believe this framework provides novel insights into diverse watermarking paradigms.
pdf
bib
abs
AutoGUI: Scaling GUI Grounding with Automatic Functionality Annotations from LLMs
Hongxin Li
|
Jingfan Chen
|
Jingran Su
|
Yuntao Chen
|
Li Qing
|
Zhaoxiang Zhang
User interface understanding with vision-language models (VLMs) has received much attention due to its potential for enhancing software automation.However, existing datasets used to build UI-VLMs either only contain large-scale context-free element annotations or contextualized functional descriptions for elements at a small scale.In this work, we propose the AutoGUI pipeline for automatically annotating UI elements with detailed functionality descriptions at scale.Specifically, we leverage large language models (LLMs) to infer element functionality by comparing UI state changes before and after simulated interactions. To improve annotation quality, we propose LLM-aided rejection and verification, eliminating invalid annotations without human labor.We construct a high-quality AutoGUI-704k dataset using the proposed pipeline, featuring diverse and detailed functionality annotations that are hardly provided by previous datasets.Human evaluation shows that we achieve annotation correctness comparable to a trained human annotator. Extensive experiments show that our dataset remarkably enhances VLM’s UI grounding capabilities and exhibits significant scaling effects. We also show the interesting potential use of our dataset in UI agent tasks. Please view our project at https://autogui-project.github.io/.
pdf
bib
abs
Introducing Graph Context into Language Models through Parameter-Efficient Fine-Tuning for Lexical Relation Mining
Jingwen Sun
|
Zhiyi Tian
|
Yu He
|
Jingwei Sun
|
Guangzhong Sun
Lexical relation refers to the way words are related within a language. Prior work has demonstrated that pretrained language models (PLMs) can effectively mine lexical relations between word pairs. However, they overlook the potential of graph structures composed of lexical relations, which can be integrated with the semantic knowledge of PLMs. In this work, we propose a parameter-efficient fine-tuning method through graph context, which integrates graph features and semantic representations for lexical relation classification (LRC) and lexical entailment (LE) tasks. Our experiments show that graph features can help PLMs better understand more complex lexical relations, establishing a new state-of-the-art for LRC and LE. Finally, we perform an error analysis, identifying the bottlenecks of language models in lexical relation mining tasks and providing insights for future improvements.
pdf
bib
abs
S-RAG: A Novel Audit Framework for Detecting Unauthorized Use of Personal Data in RAG Systems
Zhirui Zeng
|
Jiamou Liu
|
Meng-Fen Chiang
|
Jialing He
|
Zijian Zhang
Retrieval-Augmented Generation (RAG) systems combine external data retrieval with text generation and have become essential in applications requiring accurate and context-specific responses. However, their reliance on external data raises critical concerns about unauthorized collection and usage of personal information. To ensure compliance with data protection regulations like GDPR and detect improper use of data, we propose the Shadow RAG Auditing Data Provenance (S-RAG) framework. S-RAG enables users to determine whether their textual data has been utilized in RAG systems, even in black-box settings with no prior system knowledge. It is effective across open-source and closed-source RAG systems and resilient to defense strategies. Experiments demonstrate that S-RAG achieves an improvement in Accuracy by 19.9% (compared to the best baseline), while maintaining strong performance under adversarial defenses. Furthermore, we analyze how the auditor’s knowledge of the target system affects performance, offering practical insights for privacy-preserving AI systems. Our code is open-sourced online.
pdf
bib
abs
Praetor: A Fine-Grained Generative LLM Evaluator with Instance-Level Customizable Evaluation Criteria
Yongqi Leng
|
Renren Jin
|
Yue Chen
|
Zhuowen Han
|
Ling Shi
|
Jianxiang Peng
|
Lei Yang
|
Juesi Xiao
|
Deyi Xiong
With the increasing capability of large language models (LLMs), LLM-as-a-judge has emerged as a new evaluation paradigm. Compared with traditional automatic and manual evaluation, LLM evaluators exhibit better interpretability and efficiency. Despite this, existing LLM evaluators suffer from limited use scenarios and poor flexibility. To mitigate these issues, we propose Praetor, a fine-grained generative LLM evaluator with instance-level customazable evaluation criteria. To train Praetor, we curate a large-scale dataset guided with a hierarchical guideline covering a wide range of tasks and instance-level evaluation criteria. We train Praetor on this dataset in a multi-task learning fashion, which enables to evaluate LLMs in either pointwise grading or pairwise comparison way and support two languages simultaneously with a high flexibility of setting evaluation criteria. Extensive experiments demonstrate that Praetor outperforms previous LLM evaluators and instruction-tuned LLMs on multiple benchmarks, setting new SOTA results. It also exhibits the potential for generating critiques as scalable feedback to further improve LLMs. Our model and related resources are released at
https://github.com/tjunlp-lab/Praetor.
pdf
bib
abs
Mitigating Confounding in Speech-Based Dementia Detection through Weight Masking
Zhecheng Sheng
|
Xiruo Ding
|
Brian Hur
|
Changye Li
|
Trevor Cohen
|
Serguei V. S. Pakhomov
Deep transformer models have been used to detect linguistic anomalies in patient transcripts for early Alzheimer’s disease (AD) screening. While pre-trained neural language models (LMs) fine-tuned on AD transcripts perform well, little research has explored the effects of the gender of the speakers represented by these transcripts. This work addresses gender confounding in dementia detection and proposes two methods: the Extended Confounding Filter and the Dual Filter, which isolate and ablate weights associated with gender. We evaluate these methods on dementia datasets with first-person narratives from patients with cognitive impairment and healthy controls. Our results show transformer models tend to overfit to training data distributions. Disrupting gender-related weights results in a deconfounded dementia classifier, with the trade-off of slightly reduced dementia detection performance.
pdf
bib
abs
MCS-Bench: A Comprehensive Benchmark for Evaluating Multimodal Large Language Models in Chinese Classical Studies
Yang Liu
|
Jiahuan Cao
|
Hiuyi Cheng
|
Yongxin Shi
|
Kai Ding
|
Lianwen Jin
With the rapid development of Multimodal Large Language Models (MLLMs), their potential in Chinese Classical Studies (CCS), a field which plays a vital role in preserving and promoting China’s rich cultural heritage, remains largely unexplored due to the absence of specialized benchmarks. To bridge this gap, we propose MCS-Bench, the first-of-its-kind multimodal benchmark specifically designed for CCS across multiple subdomains. MCS-Bench spans seven core subdomains (Ancient Chinese Text, Calligraphy, Painting, Oracle Bone Script, Seal, Cultural Relic, and Illustration), with a total of 45 meticulously designed tasks. Through extensive evaluation of 37 representative MLLMs, we observe that even the top-performing model (InternVL2.5-78B) achieves an average score below 50, indicating substantial room for improvement. Our analysis reveals significant performance variations across different tasks and identifies critical challenges in areas such as Optical Character Recognition (OCR) and cultural context interpretation. MCS-Bench not only establishes a standardized baseline for CCS-focused MLLM research but also provides valuable insights for advancing cultural heritage preservation and innovation in the Artificial General Intelligence (AGI) era. Data and code will be publicly available.
pdf
bib
abs
The Knowledge Microscope: Features as Better Analytical Lenses than Neurons
Yuheng Chen
|
Pengfei Cao
|
Kang Liu
|
Jun Zhao
We demonstrate that features, rather than neurons, serve as superior analytical units for understanding the mechanisms of factual knowledge in Language Models (LMs). Previous studies primarily utilize MLP neurons as units of analysis; however, neurons suffer from polysemanticity, leading to limited knowledge expression and poor interpretability. We first conduct preliminary experiments to validate that SAE can effectively decompose neurons into features. With this established, our core findings reveal three key advantages of features over neurons: (1) Features exhibit stronger influence on knowledge expression and superior interpretability. (2) Features demonstrate enhanced monosemanticity, showing distinct activation patterns between related and unrelated facts. (3) Feature-based method demonstrates superior performance over neuron-based approaches in erasing privacy-sensitive information from LMs. Additionally, we propose FeatureEdit, the first feature-based editing method. Code and dataset will be available.
pdf
bib
abs
From Real to Synthetic: Synthesizing Millions of Diversified and Complicated User Instructions with Attributed Grounding
Chiwei Zhu
|
Benfeng Xu
|
Xiaorui Wang
|
Zhendong Mao
The pursuit of diverse, complex, and large-scale instruction data is crucial for automatically aligning large language models (LLMs). While there are methods capable of generating synthetic instructions at scale, they either suffer from limited grounding sources, leading to a narrow distribution, or rely on trivial extensions that fail to produce meaningful trajectories in terms of complexity. In contrast, instructions that benefit efficient alignment are typically crafted with cognitive insights and grounded in real-world use cases. In this paper, we synthesize such instructions using attributed grounding, which involves 1) a top-down attribution process that grounds a selective set of real instructions to situated users, and 2) a bottom-up synthesis process that leverages web documents to first generate a situation, then a meaningful instruction. This framework allows us to harvest diverse and complex instructions at scale, utilizing the vast range of web documents. Specifically, we construct a dataset of 1 million instructions, called SynthQuestions, and demonstrate that models trained on it achieve leading performance on several common benchmarks, with improvements that continually scale with more web corpora.
pdf
bib
abs
PrivaCI-Bench: Evaluating Privacy with Contextual Integrity and Legal Compliance
Haoran Li
|
Wenbin Hu
|
Huihao Jing
|
Yulin Chen
|
Qi Hu
|
Sirui Han
|
Tianshu Chu
|
Peizhao Hu
|
Yangqiu Song
Recent advancements in generative large language models (LLMs) have enabled wider applicability, accessibility, and flexibility. However, their reliability and trustworthiness are still in doubt, especially for concerns regarding individuals’ data privacy. Great efforts have been made on privacy by building various evaluation benchmarks to study LLMs’ privacy awareness and robustness from their generated outputs to their hidden representations. Unfortunately, most of these works adopt a narrow formulation of privacy and only investigate personally identifiable information (PII). In this paper, we follow the merit of the Contextual Integrity (CI) theory, which posits that privacy evaluation should not only cover the transmitted attributes but also encompass the whole relevant social context through private information flows. We present PrivaCI-Bench, a comprehensive contextual privacy evaluation benchmark targeted at legal compliance to cover well-annotated privacy and safety regulations, real court cases, privacy policies, and synthetic data built from the official toolkit to study LLMs’ privacy and safety compliance. We evaluate the latest LLMs, including the recent reasoner models QwQ-32B and Deepseek R1. Our experimental results suggest that though LLMs can effectively capture key CI parameters inside a given context, they still require further advancements for privacy compliance.
pdf
bib
abs
Unveiling Environmental Impacts of Large Language Model Serving: A Functional Unit View
Yanran Wu
|
Inez Hua
|
Yi Ding
Large language models (LLMs) offer powerful capabilities but come with significant environmental impact, particularly in carbon emissions. Existing studies benchmark carbon emissions but lack a standardized basis for comparison across different model configurations. To address this, we introduce the concept of functional unit (FU) as a standardized basis and develop FUEL, the first FU-based framework for evaluating LLM serving’s environmental impact. Through three case studies, we uncover key insights and trade-offs in reducing carbon emissions by optimizing model size, quantization strategy, and hardware choice, paving the way for more sustainable LLM serving. The code is available at https://github.com/jojacola/FUEL.
pdf
bib
abs
ExpeTrans: LLMs Are Experiential Transfer Learners
Jinglong Gao
|
Xiao Ding
|
Lingxiao Zou
|
Bibo Cai
|
Bing Qin
|
Ting Liu
Recent studies provide large language models (LLMs) with textual task-solving experiences via prompts to improve their performance.However, previous methods rely on substantial human labor or time to gather such experiences for each task, which is impractical given the growing variety of task types in user queries to LLMs.To address this issue, we design an autonomous experience transfer framework to explore whether LLMs can mimic human cognitive intelligence to autonomously transfer experience from existing source tasks to newly encountered target tasks. This not only allows the acquisition of experience without extensive costs of previous methods, but also offers a novel path for the generalization of LLMs.Experimental results on 13 datasets demonstrate that our framework effectively improves the performance of LLMs. Furthermore, we provide a detailed analysis of each module in the framework.
pdf
bib
abs
Cool-Fusion: Fuse Large Language Models without Training
Cong Liu
|
Xiaojun Quan
|
Yan Pan
|
Weigang Wu
|
Xu Chen
|
Liang Lin
We focus on the problem of fusing two or more heterogeneous large language models (LLMs) to leverage their complementary strengths. One of the challenges of model fusion is high computational load, specifically in fine-tuning or aligning vocabularies. To address this, we propose Cool-Fusion, a simple yet effective approach that fuses the knowledge of source LLMs, which does not require training. Unlike ensemble methods, Cool-Fusion is applicable to any set of source LLMs that have different vocabularies. To overcome the vocabulary discrepancies among LLMs, we ensemble LLMs on text level, allowing them to rerank the generated texts by each other with different granularities. Extensive experiments have been conducted across a variety of benchmark datasets. On GSM8K, Cool-Fusion increases accuracy from three strong source LLMs by a significant margin of 17.4%.
pdf
bib
abs
DAPE V2: Process Attention Score as Feature Map for Length Extrapolation
Chuanyang Zheng
|
Yihang Gao
|
Han Shi
|
Jing Xiong
|
Jiankai Sun
|
Jingyao Li
|
Minbin Huang
|
Xiaozhe Ren
|
Michael Ng
|
Xin Jiang
|
Zhenguo Li
|
Yu Li
The attention mechanism is a fundamental component of the Transformer model, contributing to interactions among distinct tokens. In general, the attention scores are determined simply by the key-query products. However, this work’s occasional trial (combining DAPE and NoPE) of including additional MLPs on attention scores without position encoding indicates that the classical key-query multiplication may limit the performance of Transformers. In this work, we conceptualize attention as a feature map and apply the convolution operator (for neighboring attention scores across different heads) to mimic the processing methods in computer vision. Specifically, **the main contribution of this paper is identifying and interpreting the Transformer length extrapolation problem as a result of the limited expressiveness of the naive query and key dot product, and we successfully translate the length extrapolation issue into a well-understood feature map processing problem**, which is called Convolutional Data-Adaptive Position Encoding (CDAPE).The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution. Extensive experiments demonstrate that treating attention as a feature map and applying convolution as a processing method significantly enhances Transformer performance.
pdf
bib
abs
MuSC: Improving Complex Instruction Following with Multi-granularity Self-Contrastive Training
Hui Huang
|
Jiaheng Liu
|
Yancheng He
|
Shilong Li
|
Bing Xu
|
Conghui Zhu
|
Muyun Yang
|
Tiejun Zhao
Complex instruction-following with elaborate constraints is imperative for Large Language Models (LLMs). While existing methods have constructed data for complex instruction alignment, they all rely on a more advanced model, especially GPT-4, limiting their application. In this paper, we propose a Multi-granularity Self-Contrastive Training (MuSC) framework, to improve the complex instruction alignment without relying on a stronger model. Our method is conducted on both coarse and fine granularity. On coarse-granularity, we construct constraint-aware preference data based on instruction decomposition and recombination. On fine-granularity, we perform token-aware preference optimization with dynamic token-level supervision. Our method is evaluated on open-sourced models, and experiment results show our method achieves significant improvement on both complex and general instruction-following benchmarks, surpassing previous self-alignment methods.
pdf
bib
abs
LongReD: Mitigating Short-Text Degradation of Long-Context Large Language Models via Restoration Distillation
Zican Dong
|
Junyi Li
|
Jinhao Jiang
|
Mingyu Xu
|
Xin Zhao
|
Bingning Wang
|
Weipeng Chen
Large language models (LLMs) have gained extended context windows through scaling positional encodings and lightweight continual pre-training. However, this often leads to degraded performance on short-text tasks, while the reasons for this degradation remain insufficiently explored. In this work, we identify two primary factors contributing to this issue: distribution drift in hidden states and attention scores, and catastrophic forgetting during continual pre-training. To address these challenges, we propose Long Context Pre-training with Restoration Distillation (LongReD), a novel approach designed to mitigate short-text performance degradation through minimizing the distribution discrepancy between the extended and original models. Besides training on long texts, LongReD distills the hidden state of selected layers from the original model on short texts. Additionally, LongReD also introduces a short-to-long distillation, aligning the output distribution on short texts with that on long texts by leveraging skipped positional indices. Experiments on common benchmarks demonstrate that LongReD effectively preserves the model’s short-text performance while maintaining or even enhancing its long-context abilities.
pdf
bib
abs
APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs
Yuxiang Huang
|
Mingye Li
|
Xu Han
|
Chaojun Xiao
|
Weilin Zhao
|
Sun Ao
|
Hao Zhou
|
Jie Zhou
|
Zhiyuan Liu
|
Maosong Sun
While long-context inference is crucial for advancing large language model (LLM) applications, its prefill speed remains a significant bottleneck. Current approaches, including sequence parallelism strategies and compute reduction through approximate attention mechanisms, still fall short of delivering optimal inference efficiency. This hinders scaling the inputs to longer sequences and processing long-context queries in a timely manner. To address this, we introduce APB, an efficient long-context inference framework that leverages multi-host approximate attention to enhance prefill speed by reducing compute and enhancing parallelism simultaneously. APB introduces a communication mechanism for essential key-value pairs within a sequence parallelism framework, enabling a faster inference speed while maintaining task performance. We implement APB by incorporating a tailored FlashAttn kernel alongside optimized distribution strategies, supporting diverse models and parallelism configurations. APB achieves speedups of up to 9.2×, 4.2×, and 1.6× compared with FlashAttn, RingAttn, and StarAttn, respectively, without any observable task performance degradation.
pdf
bib
abs
PPT: A Minor Language News Recommendation Model via Cross-Lingual Preference Pattern Transfer
Yiyang Zhang
|
Nan Chen
Rich user-item interactions are essential for building reliable recommender systems, as they reflect user preference patterns. However, minor language news recommendation platforms suffer from limited interactions due to a small user base. A natural solution is to apply well-established English recommender systems to minor language news recommendation, but the linguistic gap can lead to inaccurate modeling of minor language news content. Therefore, enabling few-shot minor language news recommender systems to capture both content information and preference patterns remains a challenge. Based on the observation that preference patterns are similar across languages, we propose a minor language news recommendation model by cross-lingual preference pattern transfer, named PPT. Our model adopts the widely used two-tower architecture and employs the large language model as the backbone of the news encoder. Through cross-lingual alignment, the strong English capability of the news encoder is extended to minor languages, thus enhancing news content representations. Additionally, through cross-lingual news augmentation, PPT simulates interactions of minor language news in the English domain, which facilitates the transfer of preference patterns from the many-shot English domain to the few-shot minor language domain. Extensive experiments on two real-world datasets across 15 minor languages demonstrate the superiority and generalization of our proposed PPT in addressing minor language news recommendation.
pdf
bib
abs
GainRAG: Preference Alignment in Retrieval-Augmented Generation through Gain Signal Synthesis
Yi Jiang
|
Sendong Zhao
|
Jianbo Li
|
Haochun Wang
|
Bing Qin
The Retrieval-Augmented Generation (RAG) framework introduces a retrieval module to dynamicaslly inject retrieved information into the input context of large language models (LLMs), and has demonstrated significant success in various NLP tasks. However, the current study points out that there is a preference gap between retrievers and LLMs in the RAG framework, which limit the further improvement of system performance. Some highly relevant passages may interfere with LLM reasoning because they contain complex or contradictory information; while some indirectly related or even inaccurate content may help LLM generate more accurate answers by providing suggestive information or logical clues. To solve this, we propose **GainRAG**, a novel approach that aligns the retriever’s and LLM’s preferences by defining a new metric, “gain’’, which measure how well an input passage contributes to correct outputs.We then propose a method to estimate these gain signals and train a middleware that aligns the preferences of the retriever and the LLM using only limited data.In addition, we introduce a pseudo-passage strategy to mitigate degradation.The experimental results on 6 datasets verify the effectiveness of GainRAG.
pdf
bib
abs
Top-n𝜎: Eliminating Noise in Logit Space for Robust Token Sampling of LLM
Chenxia Tang
|
Jianchun Liu
|
Hongli Xu
|
Liusheng Huang
Large language models (LLMs) rely heavily on sampling methods to generate diverse and high-quality text.While existing sampling methods like top-p and min-p have identified the detrimental effects of low-probability tails in LLMs’ outputs, they still fail to effectively distinguish between diversity and noise. This limitation stems from their reliance on probability-based metrics that are inherently sensitive to temperature scaling. Through empirical and theoretical analysis, we make two key discoveries: (1) the pre-softmax logits exhibit a clear statistical separation between informative tokens and noise, and (2) we prove the mathematical equivalence of min-p and top-(1-p) under uniform distribution over logits. These findings motivate the design of top-n𝜎, a novel sampling method that identifies informative tokens by eliminating noise directly in logit space.Unlike existing methods that become unstable at high temperatures, top-n𝜎 achieves temperature-invariant token selection while preserving output diversity. Extensive experiments across reasoning and creative writing tasks demonstrate that our method consistently outperforms existing approaches, with particularly significant improvements in high-temperature settings.
pdf
bib
abs
SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation
Jialong Wu
|
Zhenglin Wang
|
Linhai Zhang
|
Yilong Lai
|
Yulan He
|
Deyu Zhou
Key-Value (KV) cache has become a bottleneck of LLMs for long-context generation. Despite the numerous efforts in this area, the optimization for the decoding phase is generally ignored. However, we believe such optimization is crucial, especially for long-output generation tasks based on the following two observations: (i) Excessive compression during the prefill phase, which requires specific full context impairs the comprehension of the reasoning task; (ii) Deviation of heavy hitters occurs in the reasoning tasks with long outputs. Therefore, SCOPE, a simple yet efficient framework that separately performs KV cache optimization during the prefill and decoding phases, is introduced. Specifically, the KV cache during the prefill phase is preserved to maintain the essential information, while a novel strategy based on sliding is proposed to select essential heavy hitters for the decoding phase. Memory usage and memory transfer are further optimized using adaptive and discontinuous strategies. Extensive experiments on LongGenBench show the effectiveness and generalization of SCOPE and its compatibility as a plug-in to other prefill-only KV compression methods.
pdf
bib
abs
Mitigating Non-Representative Prototypes and Representation Bias in Few-Shot Continual Relation Extraction
Thanh Duc Pham
|
Nam Le Hai
|
Linh Ngo Van
|
Nguyen Thi Ngoc Diep
|
Sang Dinh
|
Thien Huu Nguyen
To address the phenomenon of similar classes, existing methods in few-shot continual relation extraction (FCRE) face two main challenges: non-representative prototypes and representation bias, especially when the number of available samples is limited. In our work, we propose Minion to address these challenges. Firstly, we leverage the General Orthogonal Frame (GOF) structure, based on the concept of Neural Collapse, to create robust class prototypes with clear separation, even between analogous classes. Secondly, we utilize label description representations as global class representatives within the fast-slow contrastive learning paradigm. These representations consistently encapsulate the essential attributes of each relation, acting as global information that helps mitigate overfitting and reduces representation bias caused by the limited local few-shot examples within a class. Extensive experiments on well-known FCRE benchmarks show that our method outperforms state-of-the-art approaches, demonstrating its effectiveness for advancing RE system.
pdf
bib
abs
MoQAE: Mixed-Precision Quantization for Long-Context LLM Inference via Mixture of Quantization-Aware Experts
Wei Tao
|
Haocheng Lu
|
Xiaoyang Qu
|
Bin Zhang
|
Kai Lu
|
Jiguang Wan
|
Jianzong Wang
One of the primary challenges in optimizing large language models (LLMs) for long-context inference lies in the high memory consumption of the Key-Value (KV) cache. Existing approaches, such as quantization, have demonstrated promising results in reducing memory usage. However, current quantization methods cannot take both effectiveness and efficiency into account. In this paper, we propose MoQAE, a novel mixed-precision quantization method via mixture of quantization-aware experts. First, we view different quantization bit-width configurations as experts and use the traditional mixture of experts (MoE) method to select the optimal configuration. To avoid the inefficiency caused by inputting tokens one by one into the router in the traditional MoE method, we input the tokens into the router chunk by chunk. Second, we design a lightweight router-only fine-tuning process to train MoQAE with a comprehensive loss to learn the trade-off between model accuracy and memory usage. Finally, we introduce a routing freezing (RF) and a routing sharing (RS) mechanism to further reduce the inference overhead. Extensive experiments on multiple benchmark datasets demonstrate that our method outperforms state-of-the-art KV cache quantization approaches in both efficiency and effectiveness.
pdf
bib
abs
PrivacyRestore: Privacy-Preserving Inference in Large Language Models via Privacy Removal and Restoration
Ziqian Zeng
|
Jianwei Wang
|
Junyao Yang
|
Zhengdong Lu
|
Haoran Li
|
Huiping Zhuang
|
Cen Chen
The widespread usage of online Large Language Models (LLMs) inference services has raised significant privacy concerns about the potential exposure of private information in user inputs. Existing privacy protection methods for LLMs suffer from either insufficient privacy protection with performance degradation, or large inference time overhead. To address these limitations, we propose PrivacyRestore, a plug-and-play method to protect the privacy of user inputs during LLM inference for the client-server scenario. The server first trains restoration vectors for each privacy span type offline and then releases them to the clients. During inference, the client aggregates restoration vectors of all privacy spans in the user query into a meta restoration vector, which is later sent to the server to restore information. Before transmission, the client removes all privacy spans in the user query and applies d𝜒-privacy mechanism to the meta vector for privacy protection. We prove that our method can inherently prevent the linear growth of the privacy budget. We conduct extensive experimental, covering the medical and legal domains, and demonstrate that PrivacyRestore effectively protects private information and maintains acceptable levels of performance and inference efficiency
pdf
bib
abs
Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models
Xinlin Zhuang
|
Jiahui Peng
|
Ren Ma
|
Yinfan Wang
|
Tianyi Bai
|
Xingjian Wei
|
Qiu Jiantao
|
Chi Zhang
|
Ying Qian
|
Conghui He
The composition of pre-training datasets for large language models (LLMs) remains largely undisclosed, hindering transparency and efforts to optimize data quality—a critical driver of model performance. Current data selection methods, such as natural language quality assessments, diversity-based filters, and classifier-based approaches, are limited by single-dimensional evaluation or redundancy-focused strategies. To address these gaps, we propose four dimensions to evaluate data quality: professionalism, readability, reasoning, and cleanliness. We further introduce
Meta-rater, a multi-dimensional data selection method that integrates these dimensions with existing quality metrics through learned optimal weightings. Meta-rater employs proxy models to train a regression model that predicts validation loss, enabling the identification of optimal combinations of quality scores. Experiments demonstrate that Meta-rater
doubles convergence speed for 1.3B parameter models and improves downstream task performance by
3.23%, with advantages that scale to models as large as 7.2B parameters. Our work establishes that holistic, multi-dimensional quality integration significantly outperforms conventional single-dimension approaches, offering a scalable paradigm for enhancing pre-training efficiency and model capability. To advance future research, we release scripts, data, and models at
https://github.com/opendatalab/Meta-rater.
pdf
bib
abs
GuessArena: Guess Who I Am? A Self-Adaptive Framework for Evaluating LLMs in Domain-Specific Knowledge and Reasoning
Qingchen Yu
|
Zifan Zheng
|
Ding Chen
|
Simin Niu
|
Bo Tang
|
Feiyu Xiong
|
Zhiyu Li
The evaluation of large language models (LLMs) has traditionally relied on static benchmarks, a paradigm that poses two major limitations: (1) predefined test sets lack adaptability to diverse application domains, and (2) standardized evaluation protocols often fail to capture fine-grained assessments of domain-specific knowledge and contextual reasoning abilities. To overcome these challenges, we propose GuessArena, an adaptive evaluation framework grounded in adversarial game-based interactions. Inspired by the interactive structure of the Guess Who I Am? game, our framework seamlessly integrates dynamic domain knowledge modeling with progressive reasoning assessment to improve evaluation fidelity. Empirical studies across five vertical domains-finance, healthcare, manufacturing, information technology, and education-demonstrate that GuessArena effectively distinguishes LLMs in terms of domain knowledge coverage and reasoning chain completeness. Compared to conventional benchmarks, our method provides substantial advantages in interpretability, scalability, and scenario adaptability.
pdf
bib
abs
Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition
Kehua Feng
|
Keyan Ding
|
Tan Hongzhi
|
Kede Ma
|
Zhihua Wang
|
Shuangquan Guo
|
Cheng Yuzhou
|
Ge Sun
|
Guozhou Zheng
|
Qiang Zhang
|
Huajun Chen
The past years have witnessed a proliferation of large language models (LLMs). Yet, reliable evaluation of LLMs is challenging due to the inaccuracy of standard metrics in human perception of text quality and the inefficiency in sampling informative test examples for human evaluation. This paper presents a sample-efficient human evaluation method for LLMs based on the principle of MAximum Discrepancy (MAD) competition. MAD automatically selects a small set of informative input instructions, each of which maximizes the discrepancy of two LLMs’ reponses, which are subsequently subject to three-alternative forced choice by human subjects. The pairwise comparison results of multiple LLMs are then aggregated into a global ranking using the Elo rating system. We compare eight representative LLMs in terms of four skills: knowledge understanding, mathematical reasoning, writing, and coding. Experimental results show that the proposed method reliably achieves the “golden” ranking of LLMs with a minimum set of input instructions, which in turn reveal their relative strengths and weaknesses, and offers valuable insights for further LLM advancement.
pdf
bib
abs
DTCRS: Dynamic Tree Construction for Recursive Summarization
Guanran Luo
|
Zhongquan Jian
|
Wentao Qiu
|
Meihong Wang
|
Qingqiang Wu
Retrieval-Augmented Generation (RAG) mitigates the hallucination problem of Large Language Models (LLMs) by incorporating external knowledge. Recursive summarization constructs a hierarchical summary tree by clustering text chunks, integrating information from multiple parts of a document to provide evidence for abstractive questions involving multi-step reasoning. However, summary trees often contain a large number of redundant summary nodes, which not only increase construction time but may also negatively impact question answering. Moreover, recursive summarization is not suitable for all types of questions. We introduce DTCRS, a method that dynamically generates summary trees based on document structure and query semantics. DTCRS determines whether a summary tree is necessary by analyzing the question type. It then decomposes the question and uses the embeddings of sub-questions as initial cluster centers, reducing redundant summaries while improving the relevance between summaries and the question. Our approach significantly reduces summary tree construction time and achieves substantial improvements across three QA tasks. Additionally, we investigate the applicability of recursive summarization to different question types, providing valuable insights for future research.
pdf
bib
abs
A Generative Adaptive Replay Continual Learning Model for Temporal Knowledge Graph Reasoning
Zhiyu Zhang
|
Wei Chen
|
Youfang Lin
|
Huaiyu Wan
Recent Continual Learning (CL)-based Temporal Knowledge Graph Reasoning (TKGR) methods focus on significantly reducing computational cost and mitigating catastrophic forgetting caused by fine-tuning models with new data. However, existing CL-based TKGR methods still face two key limitations: (1) They usually one-sidedly reorganize individual historical facts, while overlooking the historical context essential for accurately understanding the historical semantics of these facts; (2) They preserve historical knowledge by simply replaying historical facts, while ignoring the potential conflicts between historical and emerging facts. In this paper, we propose a Deep Generative Adaptive Replay (DGAR) method, which can generate and adaptively replay historical entity distribution representations from the whole historical context. To address the first challenge, historical context prompts as sampling units are built to preserve the whole historical context information. To overcome the second challenge, a pre-trained diffusion model is adopted to generate the historical distribution. During the generation process, the common features between the historical and current distributions are enhanced under the guidance of the TKGR model. In addition, a layer-by-layer adaptive replay mechanism is designed to effectively integrate historical and current distributions. Experimental results demonstrate that DGAR significantly outperforms baselines in reasoning and mitigating forgetting.
pdf
bib
abs
ARise: Towards Knowledge-Augmented Reasoning via Risk-Adaptive Search
Yize Zhang
|
Tianshu Wang
|
Sirui Chen
|
Kun Wang
|
Xingyu Zeng
|
Hongyu Lin
|
Xianpei Han
|
Le Sun
|
Chaochao Lu
Large language models (LLMs) have demonstrated impressive capabilities and are receiving increasing attention to enhance their reasoning through scaling test-time compute. However, their application in open-ended, knowledge-intensive, complex reasoning scenarios is still limited. Reasoning-oriented methods struggle to generalize to open-ended scenarios due to implicit assumptions of complete world knowledge. Meanwhile, knowledge-augmented reasoning (KAR) methods fails to address two core challenges: 1) error propagation, where errors in early steps cascade through the chain, and 2) verification bottleneck, where the explore–exploit trade-off arises in multi-branch decision processes. To overcome these limitations, we introduce ARise, a novel framework that integrates risk assessment of intermediate reasoning states with dynamic retrieval-augmented generation (RAG) within a Monte Carlo tree search paradigm. This approach enables effective construction and optimization of reasoning plans across multiple maintained hypothesis branches. Experimental results show that ARise significantly outperforms the state-of-the-art KAR methods by up to 23.10%, and the latest RAG-equipped large reasoning models by up to 25.37%. Our project page is at https://opencausalab.github.io/ARise.
pdf
bib
abs
PKAG-DDI: Pairwise Knowledge-Augmented Language Model for Drug-Drug Interaction Event Text Generation
Ziyan Wang
|
Zhankun Xiong
|
Feng Huang
|
Wen Zhang
Drug-drug interactions (DDIs) arise when multiple drugs are administered concurrently. Accurately predicting the specific mechanisms underlying DDIs (named DDI events or DDIEs) is critical for the safe clinical use of drugs. DDIEs are typically represented as textual descriptions. However, most computational methods focus more on predicting the DDIE class label over generating human-readable natural language increasing clinicians’ interpretation costs. Furthermore, current methods overlook the fact that each drug assumes distinct biological functions in a DDI, which, when used as input context, can enhance the understanding of the DDIE process and benefit DDIE generation by the language model (LM). In this work, we propose a novel pairwise knowledge-augmented generative method (termed PKAG-DDI) for DDIE text generation. It consists of a pairwise knowledge selector efficiently injecting structural information between drugs bidirectionally and simultaneously to select pairwise biological functions from the knowledge set, and a pairwise knowledge integration strategy that matches and integrates the selected biological functions into the LM. Experiments on two professional datasets show that PKAG-DDI outperforms existing methods in DDIE text generation, especially in challenging inductive scenarios, indicating its practicality and generalization.
pdf
bib
abs
Knowledge-Augmented Multimodal Clinical Rationale Generation for Disease Diagnosis with Small Language Models
Shuai Niu
|
Jing Ma
|
Hongzhan Lin
|
Liang Bai
|
Zhihua Wang
|
Richard Yi Da Xu
|
Yunya Song
|
Xian Yang
Interpretation is critical for disease diagnosis, but existing models struggle to balance predictive accuracy with human-understandable rationales. While large language models (LLMs) offer strong reasoning abilities, their clinical use is limited by high computational costs and restricted multimodal reasoning ability. Small language models (SLMs) are efficient but lack advanced reasoning for integrating multimodal medical data. In addition, both LLMs and SLMs lack domain knowledge for trustworthy reasoning. Therefore, we propose ClinRaGen, enhancing SLMs by leveraging LLM-derived reasoning ability via rationale distillation and domain knowledge injection for trustworthy multimodal rationale generation. Key innovations include a sequential rationale distillation framework that equips SLMs with LLM-comparable multimodal reasoning abilities, and a knowledge-augmented attention mechanism that jointly unifies multimodal representation from time series and textual data in the same encoding space, enabling it to be naturally interpreted by SLMs while incorporating domain knowledge for reliable rationale generation. Experiments on real-world medical datasets show that ClinRaGen achieves state-of-the-art performance in disease diagnosis and rationale generation, demonstrating the effectiveness of combining LLM-driven reasoning with knowledge augmentation for improved interpretability.
pdf
bib
abs
TWIST: Text-encoder Weight-editing for Inserting Secret Trojans in Text-to-Image Models
Xindi Li
|
Zhe Liu
|
Tong Zhang
|
Jiahao Chen
|
Qingming Li
|
Jinbao Li
|
Shouling Ji
Text-to-image (T2I) models excel at generating high-quality images from text via powerful text encoders but training these encoders demands substantial computational resources. Consequently, many users seek pre-trained text encoders from model plugin-sharing platforms like Civitai and Hugging Face, which introduces an underexplored threat: the potential for adversaries to embed Trojans within these plugins. Existing Trojan attacks often require extensive training data and suffer from poor generalization across different triggers, limiting their effectiveness and scalability. To the best of our knowledge, this paper introduces the first **T**ext-encoder **W**eight-editing method for **I**nserting **S**ecret **T**rojans (**TWIST**). By identifying the *bottleneck MLP layer*—the critical point where minimal edits can dominantly control cross-modal alignment—TWIST achieves training-free and data-free Trojan insertion, which makes it highly efficient and practical. The experimental results across various triggers demonstrate that TWIST attains an average attack success rate of 91%, a 78% improvement over the state-of-the-art (SOTA) method proposed in 2024 and highlights the excellent generalization capability. Moreover, TWIST reduces modified parameters by 8-fold and cuts injection time to 25 seconds. Our findings underscore the security risks associated with text encoders in real-world applications and emphasize the need for more robust defense mechanisms.
pdf
bib
abs
Frictional Agent Alignment Framework: Slow Down and Don’t Break Things
Abhijnan Nath
|
Carine Graff
|
Andrei Bachinin
|
Nikhil Krishnaswamy
AI support of collaborative interactions entails mediating potential misalignment between interlocutor beliefs. Common preference alignment methods like DPO excel in static settings, but struggle in dynamic collaborative tasks where the explicit signals of interlocutor beliefs are sparse and skewed. We propose the Frictional Agent Alignment Framework (FAAF), to generate precise, context-aware “friction” that prompts for deliberation and re-examination of existing evidence. FAAF’s two-player objective decouples from data skew: a frictive-state policy identifies belief misalignments, while an intervention policy crafts collaborator-preferred responses. We derive an analytical solution to this objective, enabling training a single policy via a simple supervised loss. Experiments on three benchmarks show FAAF outperforms competitors in producing concise, interpretable friction and in OOD generalization. By aligning LLMs to act as adaptive “thought partners”—not passive responders—FAAF advances scalable, dynamic human-AI collaboration. Our code and data can be found at https://github.com/csu-signal/FAAF_ACL.
pdf
bib
abs
Powerformer: Efficient and High-Accuracy Privacy-Preserving Language Model with Homomorphic Encryption
Dongjin Park
|
Eunsang Lee
|
Joon-Woo Lee
We propose Powerformer, an efficient homomorphic encryption (HE)-based privacy-preserving language model (PPLM) designed to reduce computation overhead while maintaining model performance. Powerformer incorporates three key techniques to optimize encrypted computations:1. A novel distillation technique that replaces softmax and layer normalization (LN) with computationally efficient power and linear functions, ensuring no performance degradation while enabling seamless encrypted computation.2. A pseudo-sign composite approximation method that accurately approximates GELU and tanh functions with minimal computational overhead.3. A homomorphic matrix multiplication algorithm specifically optimized for Transformer models, enhancing efficiency in encrypted environments.By integrating these techniques, Powerformer based on the BERT-base model achieves a 45% reduction in computation time compared to the state-of-the-art HE-based PPLM without any loss in accuracy.
pdf
bib
abs
Beware of Your Po! Measuring and Mitigating AI Safety Risks in Role-Play Fine-Tuning of LLMs
Weixiang Zhao
|
Yulin Hu
|
Yang Deng
|
Jiahe Guo
|
Xingyu Sui
|
Xinyang Han
|
An Zhang
|
Yanyan Zhao
|
Bing Qin
|
Tat-Seng Chua
|
Ting Liu
Role-playing enables large language models (LLMs) to engage users in immersive and personalized interactions, but it also introduces significant safety risks. Existing role-play fine-tuning techniques improve role adaptability but may degrade safety performance, particularly for villainous characters. In this work, we conduct the first comprehensive assessment of role-play fine-tuning risks by training 95 role-specific LLMs using RoleBench. Our experiments reveal that role-play fine-tuning leads to a noticeable decline in safety performance, with safety risks varying based on character traits. To tackle this challenge, we propose Safety-Aware Role-Play Fine-Tuning (SaRFT), a novel method designed to balance role-playing capabilities and safety. Extensive experiments on LLaMA-3-8B-Instruct, Gemma-2-9B-it, and Qwen2.5-7B-Instruct demonstrate that SaRFT consistently outperforms state-of-the-art baselines under both LoRA and full-parameter fine-tuning settings. Our findings highlight the necessity of role-adaptive safety measures and provide insights into mitigating role-specific safety risks in role-playing LLMs.
pdf
bib
abs
Can Graph Neural Networks Learn Language with Extremely Weak Text Supervision?
Zihao Li
|
Lecheng Zheng
|
Bowen Jin
|
Dongqi Fu
|
Baoyu Jing
|
Yikun Ban
|
Jingrui He
|
Jiawei Han
While great success has been achieved in building vision models with Contrastive Language-Image Pre-training (CLIP) over Internet-scale image-text pairs, building transferable Graph Neural Networks (GNNs) with CLIP pipeline is challenging because of the scarcity of labeled data and text supervision, different levels of downstream tasks, and the conceptual gaps between domains. In this work, to address these issues, we propose a multi-modal prompt learning paradigm to effectively adapt pre-trained GNN to downstream tasks and data, given only a few semantically labeled samples, each with extremely weak text supervision. Our new paradigm embeds the graphs directly in the same space as the Large Language Models (LLMs) by learning both graph prompts and text prompts simultaneously. We demonstrate the superior performance of our paradigm in few-shot, multi-task-level, and cross-domain settings. Moreover, we build the first CLIP-style zero-shot classification prototype that can generalize GNNs to unseen classes with extremely weak text supervision.
pdf
bib
abs
Towards Enhanced Immersion and Agency for LLM-based Interactive Drama
Hongqiu Wu
|
Weiqi Wu
|
Tianyang Xu
|
Jiameng Zhang
|
Hai Zhao
LLM-based Interactive Drama is a novel AI-based dialogue scenario, where the user (i.e. the player) plays the role of a character in the story, has conversations with characters played by LLM agents, and experiences an unfolding story. This paper begins with understanding interactive drama from two aspects: Immersion—the player’s feeling of being present in the story—and Agency—the player’s ability to influence the story world. Both are crucial to creating an enjoyable interactive experience, while they have been underexplored in previous work. To enhance these two aspects, we first propose Playwriting-guided Generation, a novel method that helps LLMs craft dramatic stories with substantially improved structures and narrative quality. Additionally, we introduce Plot-based Reflection for LLM agents to refine their reactions to align with the player’s intentions. Our evaluation relies on human judgment to assess the gains of our methods in terms of immersion and agency.
pdf
bib
abs
Disambiguating Reference in Visually Grounded Dialogues through Joint Modeling of Textual and Multimodal Semantic Structures
Shun Inadumi
|
Nobuhiro Ueda
|
Koichiro Yoshino
Multimodal reference resolution, including phrase grounding, aims to understand the semantic relations between mentions and real-world objects. Phrase grounding between images and their captions is a well-established task. In contrast, for real-world applications, it is essential to integrate textual and multimodal reference resolution to unravel the reference relations within dialogue, especially in handling ambiguities caused by pronouns and ellipses. This paper presents a framework that unifies textual and multimodal reference resolution by mapping mention embeddings to object embeddings and selecting mentions or objects based on their similarity. Our experiments show that learning textual reference resolution, such as coreference resolution and predicate-argument structure analysis, positively affects performance in multimodal reference resolution. In particular, our model with coreference resolution performs better in pronoun phrase grounding than representative models for this task, MDETR and GLIP. Our qualitative analysis demonstrates that incorporating textual reference relations strengthens the confidence scores between mentions, including pronouns and predicates, and objects, which can reduce the ambiguities that arise in visually grounded dialogues.
pdf
bib
abs
Improving Factuality with Explicit Working Memory
Mingda Chen
|
Yang Li
|
Karthik Padthe
|
Rulin Shao
|
Alicia Yi Sun
|
Luke Zettlemoyer
|
Gargi Ghosh
|
Wen-tau Yih
Large language models can generate factually inaccurate content, a problem known as hallucination. Recent works have built upon retrieved-augmented generation to improve factuality through iterative prompting but these methods are limited by the traditional RAG design. To address these challenges, we introduce Ewe (Explicit Working Memory), a novel approach that enhances factuality in long-form text generation by integrating a working memory that receives real-time feedback from external resources. The memory is refreshed based on online fact-checking and retrieval feedback, allowing Ewe to rectify false claims during the generation process and ensure more accurate and reliable outputs. Our experiments demonstrate that Ewe outperforms strong baselines on four fact-seeking long-form generation datasets, increasing the factuality metric, VeriScore, by 2 to 6 points absolute without sacrificing the helpfulness of the responses. Further analysis reveals that the design of rules for memory updates, configurations of memory units, and the quality of the retrieval datastore are crucial factors for influencing model performance.
pdf
bib
abs
Gradient-Adaptive Policy Optimization: Towards Multi-Objective Alignment of Large Language Models
Chengao Li
|
Hanyu Zhang
|
Yunkun Xu
|
Hongyan Xue
|
Xiang Ao
|
Qing He
Reinforcement Learning from Human Feedback (RLHF) has emerged as a powerful technique for aligning large language models (LLMs) with human preferences. However, effectively aligning LLMs with diverse human preferences remains a significant challenge, particularly when they are conflict. To address this issue, we frame human value alignment as a multi-objective optimization problem, aiming to maximize a set of potentially conflicting objectives. We introduce Gradient-Adaptive Policy Optimization (GAPO), a novel fine-tuning paradigm that employs multiple-gradient descent to align LLMs with diverse preference distributions. GAPO adaptively rescales the gradients for each objective to determine an update direction that optimally balances the trade-offs between objectives. Additionally, we introduce P-GAPO, which incorporates user preferences across different objectives and achieves Pareto solutions that better align with the user’s specific needs.
pdf
bib
abs
Dynamic Parallel Tree Search for Efficient LLM Reasoning
Yifu Ding
|
Wentao Jiang
|
Shunyu Liu
|
Yongcheng Jing
|
Jinyang Guo
|
Yingjie Wang
|
Jing Zhang
|
Zengmao Wang
|
Ziwei Liu
|
Bo Du
|
Xianglong Liu
|
Dacheng Tao
Tree of Thoughts (ToT) enhances Large Language Model (LLM) reasoning by structuring problem-solving as a spanning tree. However, recent methods focus on search accuracy while overlooking computational efficiency. The challenges of accelerating the ToT lie in the frequent switching of reasoning focus, and the redundant exploration of suboptimal solutions. To alleviate this dilemma, we propose Dynamic Parallel Tree Search (DPTS), a novel parallelism framework that aims to dynamically optimize the reasoning path in inference. It includes the Parallelism Streamline in the generation phase to build up a flexible and adaptive parallelism with arbitrary paths by cache management and alignment. Meanwhile, the Search and Transition Mechanism filters potential candidates to dynamically maintain the reasoning focus on more possible solutions with less redundancy. Experiments on Qwen-2.5 and Llama-3 on math and code datasets show that DPTS significantly improves efficiency by 2-4× on average while maintaining or even surpassing existing reasoning algorithms in accuracy, making ToT-based reasoning more scalable and computationally efficient. Codes are released at: https://github.com/yifu-ding/DPTS.
pdf
bib
abs
Pre3: Enabling Deterministic Pushdown Automata for Faster Structured LLM Generation
Junyi Chen
|
Shihao Bai
|
Zaijun Wang
|
Siyu Wu
|
Chuheng Du
|
Hailong Yang
|
Ruihao Gong
|
Shengzhong Liu
|
Fan Wu
|
Guihai Chen
Extensive LLM applications demand efficient structured generations, particularly for LR(1) grammars, to produce outputs in specified formats (e.g., JSON). Existing methods primarily parse LR(1) grammars into a pushdown automaton (PDA), leading to runtime execution overhead for context-dependent token processing, especially inefficient under large inference batches.To address these issues, we propose Pre3 that exploits deterministic pushdown automata (DPDA) to optimize the constrained LLM decoding efficiency.First, by **pre**computing **pre**fix-conditioned edges during the **pre**processing, Pre3 enables ahead-of-time edge analysis and thus makes parallel transition processing possible.Futher, leveraging the prefix-conditioned edges, Pre3 introduces a novel approach that transforms LR(1) transition graphs into DPDA, eliminating the need for runtime path exploration and achieving edge transitions with minimal overhead.Pre3 can be seamlessly integrated into standard LLM inference frameworks, improving time per output token (TPOT) by up to 40% and throughput by up to 36% in our experiments. Our code is available at https://github.com/ModelTC/lightllm.
pdf
bib
abs
SHARE: An SLM-based Hierarchical Action CorREction Assistant for Text-to-SQL
Ge Qu
|
Jinyang Li
|
Bowen Qin
|
Xiaolong Li
|
Nan Huo
|
Chenhao Ma
|
Reynold Cheng
Current self-correction approaches in text-to-SQL face two critical limitations: 1) Conventional self-correction methods rely on recursive self-calls of LLMs, resulting in multiplicative computational overhead, and 2) LLMs struggle to implement effective error detection and correction for monolithic SQL queries, as they fail to demonstrate the underlying reasoning path. In this work, we propose **SHARE**, a **S**LM-based **H**ierarchical **A**ction cor**RE**ction assistant that enables LLMs to perform more precise error localization and efficient correction. SHARE orchestrates three specialized Small Language Models (SLMs) in a sequential pipeline, where it first transforms monolithic SQL queries into stepwise action trajectories that reveal underlying reasoning, followed by a two-phase granular refinement. We further propose a novel hierarchical self-evolution strategy for data-efficient training. Our experimental results demonstrate that SHARE effectively enhances self-correction capabilities while proving robust across various LLMs. Furthermore, our comprehensive analysis shows that SHARE maintains strong performance even in low-resource training settings, which is particularly valuable for text-to-SQL applications with data privacy constraints.
pdf
bib
abs
GenderAlign: An Alignment Dataset for Mitigating Gender Bias in Large Language Models
Tao Zhang
|
Ziqian Zeng
|
YuxiangXiao YuxiangXiao
|
Huiping Zhuang
|
Cen Chen
|
James R. Foulds
|
Shimei Pan
Large Language Models (LLMs) are prone to generating content that exhibits gender biases, raising significant ethical concerns. Alignment, the process of fine-tuning LLMs to better align with desired behaviors, is recognized as an effective approach to mitigate gender biases. Although proprietary LLMs have made significant strides in mitigating gender bias, their alignment datasets are not publicly available. The commonly used and publicly available alignment dataset, HH-RLHF, still exhibits gender bias to some extent. There is a lack of publicly available alignment datasets specifically designed to address gender bias. Hence, we developed a new dataset named GenderAlign, aiming at mitigating a comprehensive set of gender biases in LLMs. This dataset comprises 8k single-turn dialogues, each paired with a “chosen” and a “rejected” response. Compared to the “rejected” responses, the “chosen” responses demonstrate lower levels of gender bias and higher quality. Furthermore, we categorized the gender biases in the “rejected” responses of GenderAlign into 4 principal categories. The experimental results show the effectiveness of GenderAlign in reducing gender bias in LLMs.
pdf
bib
abs
Large Language and Protein Assistant for Protein-Protein Interactions Prediction
Peng Zhou
|
Pengsen Ma
|
Jianmin Wang
|
Xibao Cai
|
Haitao Huang
|
Wei Liu
|
Longyue Wang
|
Lai Hou Tim
|
Xiangxiang Zeng
Predicting the types and affinities of protein-protein interactions (PPIs) is crucial for understanding biological processes and developing novel therapeutic approaches. While encoding proteins themselves is essential, PPI networks can also provide rich prior knowledge for these predictive tasks. However, existing methods oversimplify the problem of PPI prediction in a semi-supervised manner when utilizing PPI networks, limiting their practical application. Furthermore, how to effectively use the rich prior knowledge of PPI networks for novel proteins not present in the network remains an unexplored issue. Additionally, due to inflexible architectures, most of existing methods cannot handle complexes containing an flexible number of proteins. To overcome these limitations, we introduce LLaPA (Large Language and Protein Assistant), a multimodal large language model that integrates proteins and PPI networks. LLaPA offers a more rational approach to utilizing PPI networks for PPI prediction and can fully exploit the information of PPI networks for unseen proteins. Through natural language instructions, LLaPA can accept flexible number of protein sequences and has the potential to perform various protein tasks. Experiments show that LLaPA achieves state-of-the-art performance in multi-label PPI (mPPI) type prediction and is capable of predicting the binding affinity between multiple interacting proteins based on sequence data.
pdf
bib
abs
An Empirical Study of Many-to-Many Summarization with Large Language Models
Jiaan Wang
|
Fandong Meng
|
Zengkui Sun
|
Yunlong Liang
|
Yuxuan Cao
|
Jiarong Xu
|
Haoxiang Shi
|
Jie Zhou
Many-to-many summarization (M2MS) aims to process documents in any language and generate the corresponding summaries also in any language. Recently, large language models (LLMs) have shown strong multi-lingual abilities, giving them the potential to perform M2MS in real applications. This work presents a systematic empirical study on LLMs’ M2MS ability. Specifically, we first reorganize M2MS data based on eight previous domain-specific datasets. The reorganized data contains 47.8K samples spanning five domains and six languages, which could be used to train and evaluate LLMs. Then, we benchmark 18 LLMs in a zero-shot manner and an instruction-tuning manner. Fine-tuned traditional models (e.g., mBART) are also conducted for comparisons. Our experiments reveal that, zero-shot LLMs achieve competitive results with fine-tuned traditional models. After instruct-tuning, open-source LLMs can significantly improve their M2MS ability, and outperform zero-shot LLMs (including GPT-4) in terms of automatic evaluations. In addition, we demonstrate this task-specific improvement does not sacrifice the LLMs’ general task-solving abilities. However, as revealed by our human evaluation, LLMs still face the factuality issue, and the instruction tuning might intensify the issue. Thus, how to control factual errors becomes the key when building LLM summarizers in real applications, and is worthy to be noted in future research.
pdf
bib
abs
Locate-and-Focus: Enhancing Terminology Translation in Speech Language Models
Suhang Wu
|
Jialong Tang
|
Chengyi Yang
|
Pei Zhang
|
Baosong Yang
|
Junhui Li
|
Junfeng Yao
|
Min Zhang
|
Jinsong Su
Direct speech translation (ST) has garnered increasing attention nowadays, yet the accurate translation of terminology within utterances remains a great challenge. In this regard, current studies mainly concentrate on leveraging various translation knowledge into ST models. However, these methods often struggle with interference from irrelevant noise and can not fully utilize the translation knowledge. To address these issues, in this paper, we propose a novel Locate-and-Focus method for terminology translation. It first effectively locates the speech clips containing terminologies within the utterance to construct translation knowledge, minimizing irrelevant information for the ST model. Subsequently, it associates the translation knowledge with the utterance and hypothesis from both audio and textual modalities, allowing the ST model to better focus on translation knowledge during translation. Experimental results across various datasets demonstrate that our method effectively locates terminologies within utterances and enhances the success rate of terminology translation, while maintaining robust general translation performance.
pdf
bib
abs
GuideBench: Benchmarking Domain-Oriented Guideline Following for LLM Agents
Lingxiao Diao
|
Xinyue Xu
|
Wanxuan Sun
|
Cheng Yang
|
Zhuosheng Zhang
Large language models (LLMs) have been widely deployed as autonomous agents capable of following user instructions and making decisions in real-world applications. Previous studies have made notable progress in benchmarking the instruction following capabilities of LLMs in general domains, with a primary focus on their inherent commonsense knowledge. Recently, LLMs have been increasingly deployed as domain-oriented agents, which rely on domain-oriented guidelines that may conflict with their commonsense knowledge. These guidelines exhibit two key characteristics: they consist of a wide range of domain-oriented rules and are subject to frequent updates. Despite these challenges, the absence of comprehensive benchmarks for evaluating the domain-oriented guideline following capabilities of LLMs presents a significant obstacle to their effective assessment and further development. In this paper, we introduce GuideBench, a comprehensive benchmark designed to evaluate guideline following performance of LLMs. GuideBench evaluates LLMs on three critical aspects: (i) adherence to diverse rules, (ii) robustness to rule updates, and (iii) alignment with human preferences. Experimental results on a range of LLMs indicate substantial opportunities for improving their ability to follow domain-oriented guidelines. Data and code are available at Anonymous.
pdf
bib
abs
TC–RAG: Turing–Complete RAG’s Case study on Medical LLM Systems
Xinke Jiang
|
Yue Fang
|
Rihong Qiu
|
Haoyu Zhang
|
Yongxin Xu
|
Hao Chen
|
Wentao Zhang
|
Ruizhe Zhang
|
Yuchen Fang
|
Xinyu Ma
|
Xu Chu
|
Junfeng Zhao
|
Yasha Wang
In the pursuit of enhancing domain-specific Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) emerges as a promising solution to mitigate issues such as hallucinations, outdated knowledge, and limited expertise in highly specialized queries. However, existing approaches to RAG fall short by neglecting system state variables, which are crucial for ensuring adaptive control, retrieval halting, and system convergence. In this paper, we introduce the Turing-Complete-RAG (TC-RAG) through rigorous proof, a novel framework that addresses these challenges by incorporating a Turing Complete System to manage state variables, thereby enabling more efficient and accurate knowledge retrieval. By leveraging a memory stack system with adaptive retrieval, reasoning, and planning capabilities, TC-RAG not only ensures the controlled halting of retrieval processes but also mitigates the accumulation of erroneous knowledge via Push and Pop actions. In the case study of the medical and general domain, our extensive experiments on seven real-world healthcare and general-domain datasets demonstrate the superiority of TC-RAG over existing methods in accuracy by over 7.20%. Our code, datasets and RAG resources have been available at https://github.com/Artessay/TC-RAG.
pdf
bib
abs
SoRFT: Issue Resolving with Subtask-oriented Reinforced Fine-Tuning
Zexiong Ma
|
Chao Peng
|
Pengfei Gao
|
Xiangxin Meng
|
Yanzhen Zou
|
Bing Xie
Mainstream issue-resolving frameworks predominantly rely on commercial models, leading to high costs and privacy concerns. Existing training approaches for issue resolving struggle with poor generalization and fail to fully leverage open-source development resources. We propose **S**ubtask-**o**riented **R**einforced **F**ine-**T**uning (**SoRFT**), a novel training approach to enhance the issue resolving capability of LLMs. We decomposes issue resolving into structured subtasks: file localization, function localization, line localization, and code edit generation. SoRFT consists of two training stages: (1) **rejection-sampled supervised fine-tuning**, Chain of Thought (CoT) data is filtered using ground-truth before fine-tuning the LLM, and (2) **rule-based reinforcement learning**, which leverages PPO with ground-truth based rewards. We evaluate the SoRFT-trained model on SWE-Bench Verified and SWE-Bench Lite, achieving state-of-the-art (SOTA) performance among open-source models (e.g., resolve 21.4% issues on SWE-Bench Verified with SoRFT-Qwen-7B). The experimental results demonstrate that SoRFT significantly enhances issue-resolving performance, improves model generalization, and provides a cost-efficient alternative to commercial models.
pdf
bib
abs
MiniLongBench: The Low-cost Long Context Understanding Benchmark for Large Language Models
Zhongzhan Huang
|
Guoming Ling
|
Shanshan Zhong
|
Hefeng Wu
|
Liang Lin
Long Context Understanding (LCU) is a critical area for exploration in current large language models (LLMs). However, due to the inherently lengthy nature of long-text data, existing LCU benchmarks for LLMs often result in prohibitively high evaluation costs, like testing time and inference expenses. Through extensive experimentation, we discover that existing LCU benchmarks exhibit significant redundancy, which means the inefficiency in evaluation. In this paper, we propose a concise data compression method tailored for long-text data with sparse information characteristics. By pruning the well-known LCU benchmark LongBench, we create MiniLongBench. This benchmark includes only 237 test samples across six major task categories and 21 distinct tasks. Through empirical analysis of over 60 LLMs, MiniLongBench achieves an average evaluation cost reduced to only 4.5% of the original while maintaining an average rank correlation coefficient of 0.97 with LongBench results. Therefore, our MiniLongBench, as a low-cost benchmark, holds great potential to substantially drive future research into the LCU capabilities of LLMs.
pdf
bib
abs
Divide-Then-Align: Honest Alignment based on the Knowledge Boundary of RAG
Xin Sun
|
Jianan Xie
|
Zhongqi Chen
|
Qiang Liu
|
Shu Wu
|
Yuehe Chen
|
Bowen Song
|
Zilei Wang
|
Weiqiang Wang
|
Liang Wang
Large language models (LLMs) augmented with retrieval systems have significantly advanced natural language processing tasks by integrating external knowledge sources, enabling more accurate and contextually rich responses. To improve the robustness of such systems against noisy retrievals, Retrieval-Augmented Fine-Tuning (RAFT) has emerged as a widely adopted method. However, RAFT conditions models to generate answers even in the absence of reliable knowledge. This behavior undermines their reliability in high-stakes domains, where acknowledging uncertainty is critical. To address this issue, we propose Divide-Then-Align (DTA), a post-training approach designed to endow RAG systems with the ability to respond with “I don’t know” when the query is out of the knowledge boundary of both the retrieved passages and the model’s internal knowledge. DTA divides data samples into four knowledge quadrants and constructs tailored preference data for each quadrant, resulting in a curated dataset for Direct Preference Optimization (DPO). Experimental results on three benchmark datasets demonstrate that effectively balances accuracy with appropriate abstention, enhancing the reliability and trustworthiness of retrieval-augmented systems.
pdf
bib
abs
PwnGPT: Automatic Exploit Generation Based on Large Language Models
Wanzong Peng
|
Lin Ye
|
Xuetao Du
|
Hongli Zhang
|
Dongyang Zhan
|
Yunting Zhang
|
Yicheng Guo
|
Chen Zhang
Automatic exploit generation (AEG) refers to the automatic discovery and exploitation of vulnerabilities against unknown targets. Traditional AEG often targets a single type of vulnerability and still relies on templates built from expert experience. To achieve intelligent exploit generation, we establish a comprehensive benchmark using Binary Exploitation (pwn) challenges in Capture the Flag (CTF) competitions and investigate the capabilities of Large Language Models (LLMs) in AEG based on the benchmark. To improve the performance of AEG, we propose PwnGPT, an LLM-based automatic exploit generation framework that automatically solves pwn challenges. The structural design of PwnGPT is divided into three main components: analysis, generation, and verification modules. With the help of a modular approach and structured problem inputs, PwnGPT can solve challenges that LLMs cannot directly solve. We evaluate PwnGPT on our benchmark and analyze the outputs of each module. Experimental results show that our framework is highly autonomous and capable of addressing various challenges. Compared to direct input LLMs, PwnGPT increases the completion rate of exploit on our benchmark from 26.3% to 57.9% with the OpenAI o1-preview model and from 21.1% to 36.8% with the GPT-4o model.
pdf
bib
abs
VMLU Benchmarks: A comprehensive benchmark toolkit for Vietnamese LLMs
Cuc Thi Bui
|
Nguyen Truong Son
|
Truong Van Trang
|
Lam Viet Phung
|
Pham Nhut Huy
|
Hoang Anh Le
|
Quoc Huu Van
|
Phong Nguyen-Thuan Do
|
Van Le Tran Truc
|
Duc Thanh Chau
|
Le-Minh Nguyen
The evolution of Large Language Models (LLMs) has underscored the necessity for benchmarks designed for various languages and cultural contexts. To address this need for Vietnamese, we present the first Vietnamese Multitask Language Understanding (VMLU) Benchmarks. The VMLU benchmarks consist of four datasets that assess different capabilities of LLMs, including general knowledge, reading comprehension, reasoning, and conversational skills. This paper also provides an insightful overview of the current state of some dominant LLMs, such as Llama-3, Qwen2.5, and GPT-4, highlighting their performances and limitations when measured against these benchmarks. Furthermore, we provide insights into how prompt design can influence VMLU’s evaluation outcomes, as well as suggest that open-source LLMs can serve as effective, cost-efficient evaluators within the Vietnamese context. By offering a comprehensive and accessible benchmarking framework, the VMLU Benchmarks aim to foster the development and fine-tuning of Vietnamese LLMs, thereby establishing a foundation for their practical applications in language-specific domains.
pdf
bib
abs
Scaling up the State Size of RNN LLMs for Long-Context Scenarios
Kai Liu
|
Jianfei Gao
|
Kai Chen
The Transformer architecture has become the standard LLM architecture due to its powerful self-attention mechanism. However, it suffers from quadratic computational complexity and linear memory complexity. RNN-based LLMs have been proposed as alternatives. Yet, RNN models struggle in long-context scenarios, making it challenging to replace self-attention with RNNs. We identify the state size as a critical bottleneck, which is significantly smaller than that of Transformers with a basic context length of 2k. However, simply increasing the state size significantly raises the number of parameters and lowers training efficiency. In this paper, we propose an efficient scaling method to scale the state size of RNN models to match the 2k context length of Transformers, with small parameters overhead. Experimental results demonstrate that scaling the state size significantly enhances long-context understanding. Retrieval performance scales almost linearly with state size, with a 454M model featuring an expanded state achieving performance comparable to a 1.47B model on FDA, a recall-intensive task. These findings highlight state scaling as a promising approach for advancing RNN-based LLMs.
pdf
bib
abs
Unifying Continuous and Discrete Text Diffusion with Non-simultaneous Diffusion Processes
Bocheng Li
|
Zhujin Gao
|
Linli Xu
Diffusion models have emerged as a promising approach for text generation, with recent works falling into two main categories: discrete and continuous diffusion models. Discrete diffusion models apply token corruption independently using categorical distributions, allowing for different diffusion progress across tokens but lacking fine-grained control. Continuous diffusion models map tokens to continuous spaces and apply fine-grained noise, but the diffusion progress is uniform across tokens, limiting their ability to capture semantic nuances. To address these limitations, we propose Non-simultaneous Continuous Diffusion Models (NeoDiff), a novel diffusion model that integrates the strengths of both discrete and continuous approaches. NeoDiff introduces a Poisson diffusion process for the forward process, enabling a flexible and fine-grained noising paradigm, and employs a time predictor for the reverse process to adaptively modulate the denoising progress based on token semantics. Furthermore, NeoDiff utilizes an optimized schedule for inference to ensure more precise noise control and improved performance. Our approach unifies the theories of discrete and continuous diffusion models, offering a more principled and effective framework for text generation. Experimental results on several text generation tasks demonstrate NeoDiff’s superior performance compared to baselines of non-autoregressive continuous and discrete diffusion models, iterative-based methods and autoregressive diffusion-based methods. These results highlight NeoDiff’s potential as a powerful tool for generating high-quality text and advancing the field of diffusion-based text generation.
pdf
bib
abs
A Strategic Coordination Framework of Small LMs Matches Large LMs in Data Synthesis
Xin Gao
|
Qizhi Pei
|
Zinan Tang
|
Yu Li
|
Honglin Lin
|
Jiang Wu
|
Lijun Wu
|
Conghui He
While data synthesis and distillation are promising strategies to enhance small language models, current approaches heavily rely on Large Language Models (LLMs), which suffer from high computational costs, environmental inefficiency, and potential biases inherited from monolithic architectures. In contrast, smaller LMs are more accessible and sustainable, but their individual capabilities often fall short in generating high-quality, diverse, and reliable data. Inspired by collaborative human processes (e.g., peer review), we propose a multiple small LMs involved framework, GRA, that aggregates specialized roles across small LMs to iterative refinement and quality control typically achieved by a single large LM. In this collaborative framework, multiple small LMs assume distinct roles—Generator, Reviewer, and Adjudicator—to simulate a peer-review-inspired data synthesis pipeline. The Generator proposes initial data samples, the Reviewer critiques their quality and diversity, and the Adjudicator resolves conflicts to finalize the output. By decomposing the synthesis process into specialized sub-tasks, collaborative small LMs can achieve data-level parity with distillation from large LMs. Through experiments across multiple benchmarks, we demonstrate that GRA-produced data matches or exceeds the quality of single large LM outputs, e.g., Qwen-2.5-72B-Instruct. Our results challenge the necessity of monolithic large models for high-quality data synthesis, advocating instead for strategic coordination of smaller agents.
pdf
bib
abs
Defining and Evaluating Visual Language Models’ Basic Spatial Abilities: A Perspective from Psychometrics
Wenrui Xu
|
Dalin Lyu
|
Weihang Wang
|
Jie Feng
|
Chen Gao
|
Yong Li
The Theory of Multiple Intelligences underscores the hierarchical nature of cognitive capabilities. To advance Spatial Artificial Intelligence, we pioneer a psychometric framework defining five Basic Spatial Abilities (BSAs) in Visual Language Models (VLMs): Spatial Perception, Spatial Relation, Spatial Orientation, Mental Rotation, and Spatial Visualization. Benchmarking 13 mainstream VLMs through nine validated psychometric experiments reveals significant gaps versus humans, with three key findings: 1) VLMs mirror human hierarchies (strongest in 2D orientation, weakest in 3D rotation) with independent BSAs; 2) Many smaller models surpass larger counterparts, with Qwen leading and InternVL2 lagging; 3) Interventions like CoT and few-shot training show limits from architectural constraints, while ToT demonstrates the most effective enhancement. Identified barriers include weak geometry encoding and missing dynamic simulation. By linking Psychometrics to VLMs, we provide a comprehensive BSA evaluation benchmark, a methodological perspective for embodied AI development, and a cognitive science-informed roadmap for achieving human-like spatial intelligence.
pdf
bib
abs
SPHERE: Unveiling Spatial Blind Spots in Vision-Language Models Through Hierarchical Evaluation
Wenyu Zhang
|
Wei En Ng
|
Lixin Ma
|
Yuwen Wang
|
Junqi Zhao
|
Allison Koenecke
|
Boyang Li
|
Wanglu Wanglu
Current vision-language models may grasp basic spatial cues and simple directions (e.g. left, right, front, back), but struggle with the multi-dimensional spatial reasoning necessary for human-like understanding and real-world applications. To address this gap, we develop SPHERE (Spatial Perception and Hierarchical Evaluation of REasoning), a hierarchical evaluation framework supported by a new human-annotated dataset. SPHERE systematically probes models across increasing levels of complexity, from fundamental skills to multi-skill integration and high-level reasoning that combines spatial, visual, and logical understanding. Benchmark evaluation of state-of-the-art models reveals significant deficiencies, especially in reasoning about distance and proximity, understanding both egocentric and allocentric perspectives, and applying spatial logic in physical contexts. These findings expose critical blind spots in existing models and underscore the need for more advanced spatial reasoning techniques, driving the development of vision-language models that align more closely with human spatial cognition.
pdf
bib
abs
User-side Model Consistency Monitoring for Open Source Large Language Models Inference Services
Qijun Miao
|
Zhixuan Fang
With the continuous advancement in the performance of open-source large language models (LLMs), their inference services have attracted a substantial user base by offering quality comparable to closed-source models at a significantly lower cost. However, it has also given rise to trust issues regarding model consistency between users and third-party service providers. Specifically, service providers can effortlessly degrade a model’s parameter scale or precision for more margin profits, and although users may perceptibly experience differences in text quality, they often lack a reliable method for concrete monitoring. To address this problem, we propose a paradigm for model consistency monitoring on the user side. It constructs metrics based on the logits produced by LLMs to differentiate sequences generated by degraded models. Furthermore, by leveraging model offloading techniques, we demonstrate that the proposed method is implementable on consumer-grade devices. Metric evaluations conducted on three widely used LLMs series (OPT, Llama 3.1 and Qwen 2.5) along with system prototype efficiency tests on a consumer device (RTX 3080 TI) confirm both the effectiveness and feasibility of the proposed approach.
pdf
bib
abs
Jailbreaking? One Step Is Enough!
Weixiong Zheng
|
Peijian Zeng
|
YiWei Li
|
Hongyan Wu
|
Nankai Lin
|
Junhao Chen
|
Aimin Yang
|
Yongmei Zhou
Large language models (LLMs) excel in various tasks but remain vulnerable to jailbreak attacks, where adversaries manipulate prompts to generate harmful outputs. Examining jailbreak prompts helps uncover the shortcomings of LLMs. However, current jailbreak methods and the target model’s defenses are engaged in an independent and adversarial process, resulting in the need for frequent attack iterations and redesigning attacks for different models. To address these gaps, we propose a Reverse Embedded Defense Attack (REDA) mechanism that disguises the attack intention as the “defense”. intention against harmful content. Specifically, REDA starts from the target response, guiding the model to embed harmful content within its defensive measures, thereby relegating harmful content to a secondary role and making the model believe it is performing a defensive task. The attacking model considers that it is guiding the target model to deal with harmful content, while the target model thinks it is performing a defensive task, creating an illusion of cooperation between the two. Additionally, to enhance the model’s confidence and guidance in “defensive” intentions, we adopt in-context learning (ICL) with a small number of attack examples and construct a corresponding dataset of attack examples. Extensive evaluations demonstrate that the REDA method enables cross-model attacks without the need to redesign attack strategies for different models, enables successful jailbreak in one iteration, and outperforms existing methods on both open-source and closed-source models.
pdf
bib
abs
Parenting: Optimizing Knowledge Selection of Retrieval-Augmented Language Models with Parameter Decoupling and Tailored Tuning
Yongxin Xu
|
Ruizhe Zhang
|
Xinke Jiang
|
Yujie Feng
|
Yuzhen Xiao
|
Xinyu Ma
|
Runchuan Zhu
|
Xu Chu
|
Junfeng Zhao
|
Yasha Wang
Retrieval-Augmented Generation (RAG) offers an effective solution to the issues faced by Large Language Models (LLMs) in hallucination generation and knowledge obsolescence by incorporating externally retrieved knowledge. However, existing methods lack effective control mechanisms for integrating internal and external knowledge. Inspired by human cognitive processes, we propose Parenting, a novel framework that decouples, identifies, and purposefully optimizes parameter subspaces related to adherence and robustness. Specifically, Parenting utilizes a key parameter mining method that combines forward and backward propagation signals to localize subspaces representing different capabilities. Then, Parenting employs a type-tailored tuning strategy, applying specific and appropriate optimizations to different subspaces, aiming to achieve a balanced enhancement of both adherence and robustness. Extensive experiments on various datasets and models validate the effectiveness and generalizability of our method. Our code is available at https://github.com/Nostradamus4869/Parenting.
pdf
bib
abs
PaSa: An LLM Agent for Comprehensive Academic Paper Search
Yichen He
|
Guanhua Huang
|
Peiyuan Feng
|
Yuan Lin
|
Yuchen Zhang
|
Hang Li
|
Weinan E
We introduce PaSa, an advanced Paper Search agent powered by large language models. PaSa can autonomously make a series of decisions, including invoking search tools, reading papers, and selecting relevant references, to ultimately obtain comprehensive and accurate results for complex scholar queries. We optimize PaSa using reinforcement learning with a synthetic dataset, AutoScholarQuery, which includes 35k fine-grained academic queries and corresponding papers sourced from top-tier AI conference publications. Additionally, we develop RealScholarQuery, a benchmark collecting real-world academic queries to assess PaSa performance in more realistic scenarios. Despite being trained on synthetic data, PaSa significantly outperforms existing baselines on RealScholarQuery, including Google, Google Scholar, Google with GPT-4o for paraphrased queries, ChatGPT (search-enabled GPT-4o), GPT-o1, and PaSa-GPT-4o (PaSa implemented by prompting GPT-4o). Notably, PaSa-7B surpasses the best Google-based baseline, Google with GPT-4o, by 37.78% in recall@20 and 39.90% in recall@50, and exceeds PaSa-GPT-4o by 30.36% in recall and 4.25% in precision. Model, datasets, and code are available at https://github.com/bytedance/pasa.Demo: https://pasa-agent.ai
pdf
bib
abs
Less Mature is More Adaptable for Sentence-level Language Modeling
Abhilasha Sancheti
|
David Dale
|
Artyom Kozhevnikov
|
Maha Elbayad
This work investigates sentence-level models (i.e., models that operate at the sentence-level) to study how sentence representations from various encoders influence downstream task performance, and which syntactic, semantic, and discourse-level properties are essential for strong performance. Our experiments encompass encoders with diverse training regimes and pretraining domains, as well as various pooling strategies applied to multi-sentence input tasks (including sentence ordering, sentiment classification, and natural language inference) requiring coarse-to-fine-grained reasoning. We find that ”less mature” representations (e.g., mean-pooled representations from BERT’s first or last layer, or representations from encoders with limited fine-tuning) exhibit greater generalizability and adaptability to downstream tasks compared to representations from extensively fine-tuned models (e.g., SBERT or SimCSE). These findings are consistent across different pretraining seed initializations for BERT. Our probing analysis reveals that syntactic and discourse-level properties are stronger indicators of downstream performance than MTEB scores or decodability. Furthermore, the data and time efficiency of sentence-level models, often outperforming token-level models, underscores their potential for future research.
pdf
bib
abs
EpMAN: Episodic Memory AttentioN for Generalizing to Longer Contexts
Subhajit Chaudhury
|
Payel Das
|
Sarathkrishna Swaminathan
|
Georgios Kollias
|
Elliot Nelson
|
Khushbu Pahwa
|
Tejaswini Pedapati
|
Igor Melnyk
|
Matthew Riemer
Recent advances in Large Language Models (LLMs) have yielded impressive successes on many language tasks. However, efficient processing of long contexts using LLMs remains a significant challenge. We introduce **EpMAN** – a method for processing long contexts in an episodic memory module while holistically attending to semantically-relevant context chunks. Output from episodic attention is then used to reweigh the decoder’s self-attention to the stored KV cache of the context during training and generation. When an LLM decoder is trained using **EpMAN**, its performance on multiple challenging single-hop long-context recall and question-answering benchmarks is found to be stronger and more robust across the range from 16k to 256k tokens than baseline decoders trained with self-attention, and popular retrieval-augmented generation frameworks.
pdf
bib
abs
UORA: Uniform Orthogonal Reinitialization Adaptation in Parameter Efficient Fine-Tuning of Large Models
Xueyan Zhang
|
Jinman Zhao
|
Zhifei Yang
|
Yibo Zhong
|
Shuhao Guan
|
Linbo Cao
|
Yining Wang
This paper introduces UoRA, a novel parameter-efficient fine-tuning (PEFT) approach for large language models (LLMs). UoRA achieves state-of-the-art efficiency by leveraging a low-rank approximation method that reduces the number of trainable parameters without compromising performance. Unlike existing methods such as LoRA and VeRA, UoRA employs a re-parametrization mechanism that eliminates the need to adapt frozen projection matrices while maintaining shared projection layers across the model. This results in halving the trainable parameters compared to LoRA and outperforming VeRA in computation and storage efficiency. Comprehensive experiments across various benchmarks demonstrate UoRA’s superiority in achieving competitive fine-tuning performance with minimal computational overhead. We demonstrate its performance on GLUE and E2E benchmarks and is effectiveness in instruction-tuning large language models and image classification models. Our contributions establish a new paradigm for scalable and resource-efficient fine-tuning of LLMs.
pdf
bib
abs
Agri-CM3: A Chinese Massive Multi-modal, Multi-level Benchmark for Agricultural Understanding and Reasoning
Haotian Wang
|
Yi Guan
|
Fanshu Meng
|
Chao Zhao
|
Lian Yan
|
Yang Yang
|
Jingchi Jiang
Multi-modal Large Language Models (MLLMs) integrating images, text, and speech can provide farmers with accurate diagnoses and treatment of pests and diseases, enhancing agricultural efficiency and sustainability. However, existing benchmarks lack comprehensive evaluations, particularly in multi-level reasoning, making it challenging to identify model limitations. To address this issue, we introduce Agri-CM
3, an expert-validated benchmark assessing MLLMs’ understanding and reasoning in agricultural management. It includes 3,939 images and 15,901 multi-level multiple-choice questions with detailed explanations. Evaluations of 45 MLLMs reveal significant gaps. Even GPT-4o achieves only 63.64% accuracy, falling short in fine-grained reasoning tasks. Analysis across three reasoning levels and seven compositional abilities highlights key challenges in accuracy and cognitive understanding. Our study provides insights for advancing MLLMs in agricultural management, driving their development and application. Code and data are available at
https://github.com/HIT-Kwoo/Agri-CM3.
pdf
bib
abs
TROVE: A Challenge for Fine-Grained Text Provenance via Source Sentence Tracing and Relationship Classification
Junnan Zhu
|
Min Xiao
|
Yining Wang
|
Feifei Zhai
|
Yu Zhou
|
Chengqing Zong
LLMs have achieved remarkable fluency and coherence in text generation, yet their widespread adoption has raised concerns about content reliability and accountability. In high-stakes domains, it is crucial to understand where and how the content is created. To address this, we introduce the Text pROVEnance (TROVE) challenge, designed to trace each sentence of a target text back to specific source sentences within potentially lengthy or multi-document inputs. Beyond identifying sources, TROVE annotates the fine-grained relationships (quotation, compression, inference, and others), providing a deep understanding of how each target sentence is formed.To benchmark TROVE, we construct our dataset by leveraging three public datasets covering 11 diverse scenarios (e.g., QA and summarization) in English and Chinese, spanning source texts of varying lengths (0–5k, 5–10k, 10k+), emphasizing the multi-document and long-document settings essential for provenance. To ensure high-quality data, we employ a three-stage annotation process: sentence retrieval, GPT-4o provenance, and human provenance. We evaluate 11 LLMs under direct prompting and retrieval-augmented paradigms, revealing that retrieval is essential for robust performance, larger models perform better in complex relationship classification, and closed-source models often lead, yet open-source models show significant promise, particularly with retrieval augmentation. We make our dataset available here: https://github.com/ZNLP/ZNLP-Dataset.
pdf
bib
abs
CaLMQA: Exploring culturally specific long-form question answering across 23 languages
Shane Arora
|
Marzena Karpinska
|
Hung-Ting Chen
|
Ipsita Bhattacharjee
|
Mohit Iyyer
|
Eunsol Choi
Despite rising global usage of large language models (LLMs), their ability to generate *long-form* answers to *culturally specific* questions remains unexplored in many languages. To fill this gap, we perform the first study of textual multilingual long-form QA by creating CaLMQA, a dataset of **51.7K** culturally specific questions across **23** different languages. We define culturally specific questions as those that refer to concepts unique to one or a few cultures, or have different answers depending on the cultural or regional context. We obtain these questions by crawling naturally-occurring questions from community web forums in high-resource languages, and by hiring native speakers to write questions in under-resourced, rarely-studied languages such as Fijian and Kirundi. Our data collection methodologies are translation-free, enabling the collection of culturally unique questions like “Kuber iki umwami wa mbere w’uburundi yitwa Ntare?” (Kirundi; English translation: “Why was the first king of Burundi called Ntare (Lion)?”). We evaluate factuality, relevance and surface-level quality of LLM-generated long-form answers, finding that (1) for many languages, even the best models make critical surface-level errors (e.g., answering in the wrong language, repetition), especially for low-resource languages; and (2) answers to culturally specific questions contain more factual errors than answers to culturally agnostic questions – questions that have consistent meaning and answer across many cultures. We release CaLMQA to facilitate future research in cultural and multilingual long-form QA.
pdf
bib
abs
Croppable Knowledge Graph Embedding
Yushan Zhu
|
Wen Zhang
|
Zhiqiang Liu
|
Mingyang Chen
|
Lei Liang
|
Huajun Chen
Knowledge Graph Embedding (KGE) is a common approach for Knowledge Graphs (KGs) in AI tasks. Embedding dimensions depend on application scenarios. Requiring a new dimension means training a new KGE model from scratch, increasing cost and limiting efficiency and flexibility. In this work, we propose a novel KGE training framework MED. It allows one training to obtain a croppable KGE model for multiple scenarios with different dimensional needs. Sub-models of required dimensions can be directly cropped and used without extra training. In MED, we propose a mutual learning mechanism to improve the low-dimensional sub-models and make high-dimensional sub-models retain the low-dimensional sub-models’ capacity, an evolutionary improvement mechanism to promote the high-dimensional sub-models to master the triple that the low-dimensional sub-models can not, and a dynamic loss weight to adaptively balance the multiple losses. Experiments on 4 KGE models across 4 standard KG completion datasets, 3 real-world scenarios using a large-scale KG, and extending MED to the BERT language model demonstrate its effectiveness, high efficiency, and flexible extensibility.
pdf
bib
abs
HyKGE: A Hypothesis Knowledge Graph Enhanced RAG Framework for Accurate and Reliable Medical LLMs Responses
Xinke Jiang
|
Ruizhe Zhang
|
Yongxin Xu
|
Rihong Qiu
|
Yue Fang
|
Zhiyuan Wang
|
Jinyi Tang
|
Hongxin Ding
|
Xu Chu
|
Junfeng Zhao
|
Yasha Wang
In this paper, we investigate the retrieval-augmented generation (RAG) based on Knowledge Graphs (KGs) to improve the accuracy and reliability of Large Language Models (LLMs). Recent approaches suffer from insufficient and repetitive knowledge retrieval, tedious and time-consuming query parsing, and monotonous knowledge utilization. To this end, we develop a Hypothesis Knowledge Graph Enhanced (HyKGE) framework, which leverages LLMs’ powerful reasoning capacity to compensate for the incompleteness of user queries, optimizes the interaction process with LLMs, and provides diverse retrieved knowledge. Specifically, HyKGE explores the zero-shot capability and the rich knowledge of LLMs with Hypothesis Outputs to extend feasible exploration directions in the KGs, as well as the carefully curated prompt to enhance the density and efficiency of LLMs’ responses. Furthermore, we introduce the HO Fragment Granularity-aware Rerank Module to filter out noise while ensuring the balance between diversity and relevance in retrieved knowledge. Experiments on two Chinese medical multiple-choice question datasets and one Chinese open-domain medical Q&A dataset with two LLM turbos demonstrate the superiority of HyKGE in terms of accuracy and explainability. Code is available at https://github.com/Artessay/HyKGE.
pdf
bib
abs
LongRecipe: Recipe for Efficient Long Context Generalization in Large Language Models
Zhiyuan Hu
|
Yuliang Liu
|
Jinman Zhao
|
Suyuchen Wang
|
WangYan WangYan
|
Wei Shen
|
Qing Gu
|
Anh Tuan Luu
|
See-Kiong Ng
|
Zhiwei Jiang
|
Bryan Hooi
Large language models (LLMs) face significant challenges in handling long-context tasks because of their limited effective context window size during pretraining, which restricts their ability to generalize over extended sequences. Meanwhile, extending the context window in LLMs through post-pretraining is highly resource-intensive.To address this, we introduce LongRecipe, an efficient training strategy for extending the context window of LLMs, including impactful token analysis, position index transformation, and training optimization strategies. It simulates long-sequence inputs while maintaining training efficiency and significantly improves the model’s understanding of long-range dependencies. Experiments on three types of LLMs show that LongRecipe can utilize long sequences while requiring only 30% of the target context window size, and reduces computational training resource over 85% compared to full sequence training. Furthermore, LongRecipe also preserves the original LLM’s capabilities in general tasks. Ultimately, we can extend effective context window of open-source LLMs from 8k to 128k, achieving performance close to GPT-4 with just one day of dedicated training using a single GPU with 80G memory.Our code is released at https://github.com/zhiyuanhubj/LongRecipe.
pdf
bib
abs
BeamLoRA: Beam-Constraint Low-Rank Adaptation
Naibin Gu
|
Zhenyu Zhang
|
Xiyu Liu
|
Peng Fu
|
Zheng Lin
|
Shuohuan Wang
|
Yu Sun
|
Hua Wu
|
Weiping Wang
|
Haifeng Wang
Due to the demand for efficient fine-tuning of large language models, Low-Rank Adaptation (LoRA) has been widely adopted as one of the most effective parameter-efficient fine-tuning methods. Nevertheless, while LoRA improves efficiency, there remains room for improvement in accuracy. Herein, we adopt a novel perspective to assess the characteristics of LoRA ranks. The results reveal that different ranks within the LoRA modules not only exhibit varying levels of importance but also evolve dynamically throughout the fine-tuning process, which may limit the performance of LoRA. Based on these findings, we propose BeamLoRA, which conceptualizes each LoRA module as a beam where each rank naturally corresponds to a potential sub-solution, and the fine-tuning process becomes a search for the optimal sub-solution combination. BeamLoRA dynamically eliminates underperforming sub-solutions while expanding the parameter space for promising ones, enhancing performance with a fixed rank. Extensive experiments across three base models and 12 datasets spanning math reasoning, code generation, and commonsense reasoning demonstrate that BeamLoRA consistently enhances the performance of LoRA, surpassing the other baseline methods.
pdf
bib
abs
GODBench: A Benchmark for Multimodal Large Language Models in Video Comment Art
Yiming Lei
|
Chenkai Zhang
|
Zeming Liu
|
Haitao Leng
|
ShaoGuo Liu
|
Tingting Gao
|
Qingjie Liu
|
Yunhong Wang
***Video Comment Art*** enhances user engagement by providing creative content that conveys humor, satire, or emotional resonance, requiring a nuanced and comprehensive grasp of cultural and contextual subtleties. Although Multimodal Large Language Models (MLLMs) and Chain-of-Thought (CoT) have demonstrated strong reasoning abilities in STEM tasks (e.g. mathematics and coding), they still struggle to generate creative expressions such as resonant jokes and insightful satire. Moreover, existing benchmarks are constrained by their limited modalities and insufficient categories, hindering the exploration of comprehensive creativity in video-based Comment Art creation. To address these limitations, we introduce **GODBench**, a novel benchmark that integrates video and text modalities to systematically evaluate MLLMs’ abilities to compose Comment Art. Furthermore, inspired by the propagation patterns of waves in physics, we propose **Ripple of Thought (RoT)**, a multi-step reasoning framework designed to enhance the creativity of MLLMs. Extensive experiments on GODBench reveal that existing MLLMs and CoT methods still face significant challenges in understanding and generating creative video comments. In contrast, RoT provides an effective approach to improving creative composing, highlighting its potential to drive meaningful advancements in MLLM-based creativity.
pdf
bib
abs
UniLR: Unleashing the Power of LLMs on Multiple Legal Tasks with a Unified Legal Retriever
Ang Li
|
Yiquan Wu
|
Yifei Liu
|
Ming Cai
|
Lizhi Qing
|
Shihang Wang
|
Yangyang Kang
|
Chengyuan Liu
|
Fei Wu
|
Kun Kuang
Despite the impressive capabilities of LLMs, they often generate content with factual inaccuracies in LegalAI, which may lead to serious legal consequences. Retrieval-Augmented Generation (RAG), a promising approach, can conveniently integrate specialized knowledge into LLMs. In practice, there are diverse legal knowledge retrieval demands (e.g. law articles and similar cases). However, existing retrieval methods are either designed for general domains, struggling with legal knowledge, or tailored for specific legal tasks, unable to handle diverse legal knowledge types. Therefore, we propose a novel **Uni**fied **L**egal **R**etriever (UniLR) capable of performing multiple legal retrieval tasks for LLMs. Specifically, we introduce attention supervision to guide the retriever in focusing on key elements during knowledge encoding. Next, we design a graph-based method to integrate meta information through a heterogeneous graph, further enriching the knowledge representation. These two components work together to enable UniLR to capture the essence of knowledge hidden beneath formats. Extensive experiments on multiple datasets of common legal tasks demonstrate that UniLR achieves the best retrieval performance and can significantly enhance the performance of LLM.
pdf
bib
abs
Generative Psycho-Lexical Approach for Constructing Value Systems in Large Language Models
Haoran Ye
|
TianZe Zhang
|
Yuhang Xie
|
Liyuan Zhang
|
Yuanyi Ren
|
Xin Zhang
|
Guojie Song
Values are core drivers of individual and collective perception, cognition, and behavior. Value systems, such as Schwartz’s Theory of Basic Human Values, delineate the hierarchy and interplay among these values, enabling cross-disciplinary investigations into decision-making and societal dynamics. Recently, the rise of Large Language Models (LLMs) has raised concerns regarding their elusive intrinsic values. Despite growing efforts in evaluating, understanding, and aligning LLM values, a psychologically grounded LLM value system remains underexplored. This study addresses the gap by introducing the Generative Psycho-Lexical Approach (GPLA), a scalable, adaptable, and theoretically informed method for constructing value systems. Leveraging GPLA, we propose a psychologically grounded five-factor value system tailored for LLMs. For systematic validation, we present three benchmarking tasks that integrate psychological principles with cutting-edge AI priorities. Our results reveal that the proposed value system meets standard psychological criteria, better captures LLM values, improves LLM safety prediction, and enhances LLM alignment, when compared to the canonical Schwartz’s values.
pdf
bib
abs
Beyond Dialogue: A Profile-Dialogue Alignment Framework Towards General Role-Playing Language Model
Yeyong Yu
|
Runsheng Yu
|
Haojie Wei
|
Zhanqiu Zhang
|
Quan Qian
The rapid advancement of large language models (LLMs) has revolutionized role-playing, enabling the development of general role-playing models. However, current role-playing training has two significant issues: (I) Using a predefined role profile to prompt dialogue training for specific scenarios usually leads to biases and even conflicts between the dialogue and the profile, resulting in training biases. (II) Models learn to imitate the role based solely on the profile, neglecting profile-dialogue alignment at the sentence level. To overcome the aforementioned hurdles, we propose a novel framework **Beyond Dialogue**, which introduces “beyond dialogue” tasks to align dialogue with profile traits for each scenario, eliminating biases during training. Furthermore, the framework achieves a sentence-level fine-grained alignment between profile and dialogue through an innovative prompting mechanism that generates reasoning data for training. Moreover, the aforementioned methods are fully automated and low-cost. Experimental results demonstrate our model excels in adhering to role profiles, outperforming most proprietary general and specialized role-playing baselines. The code and data are provided in https://github.com/yuyouyu32/BeyondDialogue.
pdf
bib
abs
ACECODER: Acing Coder RL via Automated Test-Case Synthesis
Huaye Zeng
|
Dongfu Jiang
|
Haozhe Wang
|
Ping Nie
|
Xiaotong Chen
|
Wenhu Chen
Most progress in recent coder models has been driven by supervised fine-tuning (SFT), while the potential of reinforcement learning (RL) remains largely unexplored, primarily due to the lack of reliable reward data/model in the code domain. In this paper, we address this challenge by leveraging automated large-scale test-case synthesis to enhance code model training. Specifically, we design a pipeline that generates extensive (question, test-cases) pairs from existing code data. Using these test cases, we construct preference pairs based on pass rates over sampled programs to train reward models with Bradley-Terry loss. It shows an average of 10-point improvement for Llama-3.1-8B-Ins and 5-point improvement for Qwen2.5-Coder-7B-Ins through best-of-32 sampling, making the 7B model on par with 236B DeepSeek-V2.5. Furthermore, we conduct reinforcement learning with both reward models and test-case pass rewards, leading to consistent improvements across HumanEval, MBPP, BigCodeBench, and LiveCodeBench (V4). Notably, we follow the R1-style training to start from Qwen2.5-Coder-base directly and show that our RL training can improve model on HumanEval-plus by over 25% and MBPP-plus by 6% for merely 80 optimization steps. We believe our results highlight the huge potential of reinforcement learning in coder models.
pdf
bib
abs
Quantifying Semantic Emergence in Language Models
Hang Chen
|
Xinyu Yang
|
Jiaying Zhu
|
Wenya Wang
Large language models (LLMs) are widely recognized for their exceptional capacity to capture semantics meaning. Yet, there remains no established metric to quantify this capability. In this work, we introduce a quantitative metric, Information Emergence (IE), designed to measure LLMs’ ability to extract semantics from input tokens. We formalize “semantics” as the meaningful information abstracted from a sequence of tokens and quantify this by comparing the entropy reduction observed for a sequence of tokens (macro-level) and individual tokens (micro-level). To achieve this, we design a lightweight estimator to compute the mutual information at each transformer layer, which is agnostic to different tasks and language model architectures. We apply IE in both synthetic in-context learning (ICL) scenarios and natural sentence contexts. Experiments demonstrate informativeness and patterns about semantics. While some of these patterns confirm the conventional prior linguistic knowledge, the rest are relatively unexpected, which may provide new insights.
pdf
bib
abs
DebateCoder: Towards Collective Intelligence of LLMs via Test Case Driven LLM Debate for Code Generation
Jizheng Chen
|
Kounianhua Du
|
Xinyi Dai
|
Weiming Zhang
|
Xihuai Wang
|
Yasheng Wang
|
Ruiming Tang
|
Weinan Zhang
|
Yong Yu
With the impressive reasoning and text generation capabilities of large language models (LLMs), methods leveraging multiple LLMs to debate each other have garnered increasing attention. However, existing debate-based approaches remain limited in effectiveness in structured and detailed domains represented by code generation due to several reasons: 1) Reliance on different instances of the same LLM for debate, neglecting the potential benefits of integrating diverse models with varied internal knowledge for more comprehensive code generation, 2) under-utilization of test cases, and 3) reliance on third-party LLM moderators for result consolidation and decision-making, probably introducing hallucinations and judgment errors. To address these challenges, we propose DebateCoder to collect intelligence of LLMs via test case-driven debate for code generation. In DebateCoder, test cases serve as a medium for models to analyze code and identify bugs, while opposing models generate test cases to challenge each other’s code during the debate process. These test cases, along with their execution results, are elaborately leveraged to refine and enhance the code through a novel contrastive analysis process. Furthermore, DebateCoder leverages test case outcomes to assess code quality and determine convergence criteria. Unlike previous approaches, DebateCoder emphasizes the collaborative improvement of both models through competitive debate and interactive analysis. Abundant experimental results on two datasets demonstrate the effectiveness of DebateCoder.
pdf
bib
abs
The Tug of War Within: Mitigating the Fairness-Privacy Conflicts in Large Language Models
Chen Qian
|
Dongrui Liu
|
Jie Zhang
|
Yong Liu
|
Jing Shao
Ensuring awareness of fairness and privacy in Large Language Models (LLMs) is critical. Interestingly, we discover a counter-intuitive trade-off phenomenon that enhancing an LLM’s privacy awareness through Supervised Fine-Tuning (SFT) methods significantly decreases its fairness awareness with thousands of samples. To address this issue, inspired by the information theory, we introduce a training-free method to
Suppress the
Privacy and fa
Irness coupled
Neurons (
SPIN), which theoretically and empirically decrease the mutual information between fairness and privacy awareness. Extensive experimental results demonstrate that SPIN eliminates the trade-off phenomenon and significantly improves LLMs’ fairness and privacy awareness simultaneously without compromising general capabilities, e.g., improving Qwen-2-7B-Instruct’s fairness awareness by 12.2% and privacy awareness by 14.0%.More crucially, SPIN remains robust and effective with limited annotated data or even when only malicious fine-tuning data is available, whereas SFT methods may fail to perform properly in such scenarios. Furthermore, we show that SPIN could generalize to other potential trade-off dimensions.We hope this study provides valuable insights into concurrently addressing fairness and privacy concerns in LLMs and can be integrated into comprehensive frameworks to develop more ethical and responsible AI systems. Our code is available at
https://github.com/ChnQ/SPIN.
pdf
bib
abs
GraphInsight: Unlocking Insights in Large Language Models for Graph Structure Understanding
Yukun Cao
|
Shuo Han
|
Zengyi Gao
|
Zezhong Ding
|
Xike Xie
|
S Kevin Zhou
Although Large Language Models (LLMs) have demonstrated potential in processing graphs, they struggle with comprehending graphical structure information through prompts of graph description sequences, especially as the graph size increases. We attribute this challenge to the uneven memory performance of LLMs across different positions in graph description sequences, known as ”Positional bias”. To address this, we propose GraphInsight, a novel framework aimed at improving LLMs’ comprehension of both macro- and micro-level graphical information. GraphInsight is grounded in two key strategies: 1) placing critical graphical information in positions where LLMs exhibit stronger memory performance, and 2) investigating a lightweight external knowledge base for regions with weaker memory performance, inspired by retrieval-augmented generation (RAG). Moreover, GraphInsight explores integrating these two strategies into LLM agent processes for composite graph tasks that require multi-step reasoning. Extensive empirical studies on benchmarks with a wide range of evaluation tasks show that GraphInsight significantly outperforms all other graph description methods (e.g., prompting techniques and reordering strategies) in understanding graph structures of varying sizes.
pdf
bib
abs
Phonotomizer: A Compact, Unsupervised, Online Training Approach to Real-Time, Multilingual Phonetic Segmentation
Michael S. Yantosca
|
Albert M. K. Cheng
Phonetic transcription requires significant time and expert training. Automated, state-of-the-art text-dependent methods still involve substantial pre-training annotation labor and may not generalize to multiple languages. Hallucination of speech amid silence or non-speech noise can also plague these methods, which fall short in real-time applications due to post hoc whole-phrase evaluation. This paper introduces Phonotomizer, a compact, unsupervised, online training approach to automatic, multilingual phonetic segmentation, a critical first stage in transcription. Unlike prior approaches, Phonotomizer trains on raw sound files alone and can modulate computational exactness. Preliminary evaluations on Irish and Twi, two underrepresented languages, exhibit segmentation comparable to current forced alignment technology, reducing acoustic model size and minimizing training epochs.
pdf
bib
abs
A Multi-persona Framework for Argument Quality Assessment
Bojun Jin
|
Jianzhu Bao
|
Yufang Hou
|
Yang Sun
|
Yice Zhang
|
Huajie Wang
|
Bin Liang
|
Ruifeng Xu
Argument quality assessment faces inherent challenges due to its subjective nature, where different evaluators may assign varying quality scores for an argument based on personal perspectives. Although existing datasets collect opinions from multiple annotators to model subjectivity, most existing computational methods fail to consider multi-perspective evaluation. To address this issue, we propose MPAQ, a multi-persona framework for argument quality assessment that simulates diverse evaluator perspectives through large language models. It first dynamically generates targeted personas tailored to an input argument, then simulates each persona’s reasoning process to evaluate the argument quality from multiple perspectives. To effectively generate fine-grained quality scores, we develop a coarse-to-fine scoring strategy that first generates a coarse-grained integer score and then refines it into a fine-grained decimal score. Experiments on IBM-Rank-30k and IBM-ArgQ-5.3kArgs datasets demonstrate that MPAQ consistently outperforms strong baselines while providing comprehensive multi-perspective rationales.
pdf
bib
abs
Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification
Chengwu Liu
|
Ye Yuan
|
Yichun Yin
|
Yan Xu
|
Xin Xu
|
Zaoyu Chen
|
Yasheng Wang
|
Lifeng Shang
|
Qun Liu
|
Ming Zhang
Chain-of-Thought (CoT) prompting has become the de facto method to elicit reasoning capabilities from large language models (LLMs). However, to mitigate hallucinations in CoT that are notoriously difficult to detect, current methods such as process reward models (PRMs) or self-consistency operate as opaque boxes and do not provide checkable evidence for their judgments, possibly limiting their effectiveness. To address this issue, we draw inspiration from the idea that “the gold standard for supporting a mathematical claim is to provide a proof”. We propose a retrospective, step-aware formal verification framework Safe. Rather than assigning arbitrary scores, we strive to articulate mathematical claims in formal mathematical language Lean 4 at each reasoning step and provide formal proofs to identify hallucinations. We evaluate our framework Safe across multiple language models and various mathematical datasets, demonstrating a significant performance improvement while offering interpretable and verifiable evidence. We also propose FormalStep as a benchmark for step correctness theorem proving with 30,809 formal statements. To the best of our knowledge, our work represents the first endeavor to utilize formal mathematical language Lean 4 for verifying content generated by LLMs, aligning with the reason why formal mathematical languages were created in the first place: to provide a robust foundation for hallucination-prone human-written proofs.
pdf
bib
abs
SAM Decoding: Speculative Decoding via Suffix Automaton
Yuxuan Hu
|
Ke Wang
|
Xiaokang Zhang
|
Fanjin Zhang
|
Cuiping Li
|
Hong Chen
|
Jing Zhang
Speculative decoding (SD) has been demonstrated as an effective technique for lossless LLM inference acceleration.Retrieval-based SD methods, one kind of model-free method, have yielded promising speedup, but they often rely on single retrieval resources, inefficient retrieval methods, and are constrained to certain tasks. This paper presents a novel retrieval-based speculative decoding method that adapts the suffix automaton (SAM) for efficient and accurate draft generation by utilizing the generating text sequence and static text corpus. Unlike existing n-gram matching methods, SAM-Decoding finds the exact longest suffix match, achieving an average time complexity of O(1) per generation step of SAM update and suffix retrieval.It can also integrate with existing methods, adaptively selecting a draft generation strategy based on match length to generalize to broader domains. Extensive experiments on Spec-Bench show that our method is 18% faster than other retrieval-based SD methods. Additionally, when combined with advanced EAGLE-2, it provides an additional speedup of 3.28% – 11.13% across various-sized LLM backbones.
pdf
bib
abs
PsyAdvisor: A Plug-and-Play Strategy Advice Planner with Proactive Questioning in Psychological Conversations
Yuxin Hu
|
Danni Liu
|
Bo Liu
|
Yida Chen
|
Jiuxin Cao
|
Yan Liu
Proactive questioning is essential in psychological conversations as it helps uncover deeper issues and unspoken concerns. Current psychological LLMs are constrained by passive response mechanisms, limiting their capacity to deploy proactive strategies for psychological counseling. To bridge this gap, we first develop the ProPsyC (Proactive Psychological Conversation) dataset, a multi-turn conversation dataset with interpretive labels including strategy decision logic and reaction attribution. Based on ProPsyC, we propose PsyAdvisor by supervised fine-tuning, a plug-and-play proactive questioning strategy planner that empowers psychological LLMs to initiate well-timed questioning through strategic prompting. Experimental results demonstrate that psychological LLMs integrated with PsyAdvisor substantially improve proactive questioning capacity, conversation depth, and response quality.Furthermore, PsyAdvisor shows promising potential in assisting novice counselors by providing strategy recommendations. This study provides new optimization directions for psychological conversation systems and offers valuable insights for future research on proactive questioning mechanisms in psychological LLMs.
pdf
bib
abs
HomeBench: Evaluating LLMs in Smart Homes with Valid and Invalid Instructions Across Single and Multiple Devices
Silin Li
|
Yuhang Guo
|
Jiashu Yao
|
Zeming Liu
|
Haifeng Wang
Large language models (LLMs) have the potential to revolutionize smart home assistants by enhancing their ability to accurately understand user needs and respond appropriately, which is extremely beneficial for building a smarter home environment. While recent studies have explored integrating LLMs into smart home systems, they primarily focus on handling straightforward, valid single-device operation instructions. However, real-world scenarios are far more complex and often involve users issuing invalid instructions or controlling multiple devices simultaneously. These have two main challenges: LLMs must accurately identify and rectify errors in user instructions and execute multiple user instructions perfectly. To address these challenges and advance the development of LLM-based smart home assistants, we introduce HomeBench, the first smart home dataset with valid and invalid instructions across single and multiple devices in this paper. We have experimental results on 13 distinct LLMs; e.g., GPT-4o achieves only a 0.0% success rate in the scenario of invalid multi-device instructions, revealing that the existing state-of-the-art LLMs still cannot perform well in this situation even with the help of in-context learning, retrieval-augmented generation, and fine-tuning. Our code and dataset are publicly available at https://github.com/BITHLP/HomeBench.
pdf
bib
abs
Advancing Zero-shot Text-to-Speech Intelligibility across Diverse Domains via Preference Alignment
Xueyao Zhang
|
Yuancheng Wang
|
Chaoren Wang
|
Ziniu Li
|
Zhuo Chen
|
Zhizheng Wu
Modern zero-shot text-to-speech (TTS) systems, despite using extensive pre-training, often struggle in challenging scenarios such as tongue twisters, repeated words, code-switching, and cross-lingual synthesis, leading to intelligibility issues. To address these limitations, this paper leverages preference alignment techniques, which enable targeted construction of out-of-pretraining-distribution data to enhance performance. We introduce a new dataset, named the Intelligibility Preference Speech Dataset (INTP), and extend the Direct Preference Optimization (DPO) framework to accommodate diverse TTS architectures. After INTP alignment, in addition to intelligibility, we observe overall improvements including naturalness, similarity, and audio quality for multiple TTS models across diverse domains. Based on that, we also verify the weak-to-strong generalization ability of INTP for more intelligible models such as CosyVoice 2 and Ints. Moreover, we showcase the potential for further improvements through iterative alignment based on Ints. Audio samples are available at https://intalign.github.io/.
pdf
bib
abs
GiFT: Gibbs Fine-Tuning for Code Generation
Haochen Li
|
Wanjin Feng
|
Xin Zhou
|
Zhiqi Shen
Training Large Language Models (LLMs) with synthetic data is a prevalent practice in code generation. A key approach is self-training, where LLMs are iteratively trained on self-generated correct code snippets. In this case, the self-generated codes are drawn from a conditional distribution, conditioned on a specific seed description. However, the seed description is not the only valid representation that aligns with its intended meaning. With all valid descriptions and codes forming a joint space, codes drawn from the conditional distribution would lead to an underrepresentation of the full description-code space. As such, we propose Gibbs Fine-Tuning (GiFT), a novel self-training method inspired by Gibbs sampling. GiFT allows self-generated data to be drawn from the marginal distribution of the joint space, thereby mitigating the biases inherent in conditional sampling. We provide a theoretical analysis demonstrating the potential benefits of fine-tuning LLMs with code derived from the marginal distribution. Furthermore, we propose a perplexity-based code selection method to mitigate the imbalanced long-tail distribution of the self-generated codes. Empirical evaluation of two LLMs across four datasets demonstrates that GiFT achieves superior performance, particularly on more challenging benchmarks. Source code is available at
https://github.com/Alex-HaochenLi/GiFT.
pdf
bib
abs
Enhancing Interpretable Image Classification Through LLM Agents and Conditional Concept Bottleneck Models
Yiwen Jiang
|
Deval Mehta
|
Wei Feng
|
Zongyuan Ge
Concept Bottleneck Models (CBMs) decompose image classification into a process governed by interpretable, human-readable concepts. Recent advances in CBMs have used Large Language Models (LLMs) to generate candidate concepts. However, a critical question remains: What is the optimal number of concepts to use? Current concept banks suffer from redundancy or insufficient coverage. To address this issue, we introduce a dynamic, agent-based approach that adjusts the concept bank in response to environmental feedback, optimizing the number of concepts for sufficiency yet concise coverage. Moreover, we propose Conditional Concept Bottleneck Models (CoCoBMs) to overcome the limitations in traditional CBMs’ concept scoring mechanisms. It enhances the accuracy of assessing each concept’s contribution to classification tasks and feature an editable matrix that allows LLMs to correct concept scores that conflict with their internal knowledge. Our evaluations across 6 datasets show that our method not only improves classification accuracy by 6% but also enhances interpretability assessments by 30%.
pdf
bib
abs
Reliably Bounding False Positives: A Zero-Shot Machine-Generated Text Detection Framework via Multiscaled Conformal Prediction
Xiaowei Zhu
|
Yubing Ren
|
Yanan Cao
|
Xixun Lin
|
Fang Fang
|
Yangxi Li
The rapid advancement of large language models has raised significant concerns regarding their potential misuse by malicious actors. As a result, developing effective detectors to mitigate these risks has become a critical priority. However, most existing detection methods focus excessively on detection accuracy, often neglecting the societal risks posed by high false positive rates (FPRs). This paper addresses this issue by leveraging Conformal Prediction (CP), which effectively constrains the upper bound of FPRs. While directly applying CP constrains FPRs, it also leads to a significant reduction in detection performance. To overcome this trade-off, this paper proposes a Zero-Shot Machine-Generated Text Detection Framework via Multiscaled Conformal Prediction (MCP), which both enforces the FPR constraint and improves detection performance. This paper also introduces RealDet, a high-quality dataset that spans a wide range of domains, ensuring realistic calibration and enabling superior detection performance when combined with MCP. Empirical evaluations demonstrate that MCP effectively constrains FPRs, significantly enhances detection performance, and increases robustness against adversarial attacks across multiple detectors and datasets.
pdf
bib
abs
RSCF: Relation-Semantics Consistent Filter for Entity Embedding of Knowledge Graph
Junsik Kim
|
Jinwook Park
|
Kangil Kim
In knowledge graph embedding, leveraging relation specific entity transformation has markedly enhanced performance. However, the consistency of embedding differences before and after transformation remains unaddressed, risking the loss of valuable inductive bias inherent in the embeddings. This inconsistency stems from two problems. First, transformation representations are specified for relations in a disconnected manner, allowing dissimilar transformations and corresponding entity embeddings for similar relations. Second, a generalized plug-in approach as a SFBR (Semantic Filter Based on Relations) disrupts this consistency through excessive concentration of entity embeddings under entity-based regularization, generating indistinguishable score distributions among relations. In this paper, we introduce a plug-in KGE method, Relation-Semantics Consistent Filter (RSCF). Its entity transformation has three features for enhancing semantic consistency: 1) shared affine transformation of relation embeddings across all relations, 2) rooted entity transformation that adds an entity embedding to its change represented by the transformed vector, and 3) normalization of the change to prevent scale reduction. To amplify the advantages of consistency that preserve semantics on embeddings, RSCF adds relation transformation and prediction modules for enhancing the semantics. In knowledge graph completion tasks with distance-based and tensor decomposition models, RSCF significantly outperforms state-of-the-art KGE methods, showing robustness across all relations and their frequencies.
pdf
bib
abs
RolePlot: A Systematic Framework for Evaluating and Enhancing the Plot-Progression Capabilities of Role-Playing Agents
Pinyi Zhang
|
Siyu An
|
Lingfeng Qiao
|
Yifei Yu
|
Jingyang Chen
|
Jie Wang
|
Di Yin
|
Xing Sun
|
Kai Zhang
Role-playing agents (RPAs) are garnering increasing interests as a novel form of conversational AI. While previous research has predominantly concentrated on their ability to portray specified characters, we argue from a user-centered perspective that RPAs’ capability to advance the plot requires substantial improvements to deliver more engaging interaction. To bridge this gap, we propose RolePlot, a role-playing framework specifically designed to evaluate and enhance the plot-progression capabilities of RPAs. RolePlot begins by constructing a plot-progression dataset extended from human-written literary scripts and specially designed synthetic data, followed by narrative theory-driven manual annotation and automated labeling validated through human verification. We then exploit the over-parameterized embedding space of LLMs to detect a “trigger subspace” that identifies dialogue segments catalyzing plot transitions. When user’s inputs align with this subspace, we explicitly prompt RPAs to advance the plot. For evaluation, we simulate User-RPA interactions and track both the conversation longevity (measured in dialogue turns before disengagement) and users’ arousal levels across different stages. Empirically, our method improves RPAs’ capability to time plot developments, and more importantly, yielding a significant increase in conversation turns and sustained higher arousal levels, thereby confirming that users experience more immersive engagements.
pdf
bib
abs
TreeRL: LLM Reinforcement Learning with On-Policy Tree Search
Zhenyu Hou
|
Ziniu Hu
|
Yujiang Li
|
Rui Lu
|
Jie Tang
|
Yuxiao Dong
Reinforcement learning (RL) with tree search has demonstrated superior performance in traditional reasoning tasks. Compared to conventional independent chain sampling strategies with outcome supervision, tree search enables better exploration of the reasoning space and provides dense, on-policy process rewards during RL training but remains under-explored in On-Policy LLM RL. We propose TreeRL, a reinforcement learning framework that directly incorporates on-policy tree search for RL training. Our approach includes intermediate supervision and eliminates the need for separate reward model training. Existing approaches typically train a separate process reward model, which can suffer from distribution mismatch and reward hacking. We also introduce a cost-effective tree search approach that achieves higher search efficiency under the same generation token budget by strategically branching from high-uncertainty intermediate steps rather than using random branching. Experiments on challenging math and code reasoning benchmarks demonstrate that TreeRL achieves superior performance compared to traditional ChainRL, highlighting the potential of tree search for LLM. TreeRL is open-sourced at
https://github.com/THUDM/TreeRL.
pdf
bib
abs
Can a Single Model Master Both Multi-turn Conversations and Tool Use? CoALM: A Unified Conversational Agentic Language Model
Emre Can Acikgoz
|
Jeremiah Greer
|
Akul Datta
|
Ze Yang
|
William Zeng
|
Oussama Elachqar
|
Emmanouil Koukoumidis
|
Dilek Hakkani-Tür
|
Gokhan Tur
Large Language Models (LLMs) with API-calling capabilities enabled building effective Language Agents (LA), while also revolutionizing the conventional task-oriented dialogue (TOD) paradigm. However, current approaches face a critical dilemma: TOD systems are often trained on a limited set of target APIs, requiring new data to maintain their quality when interfacing with new services, while LAs are not trained to maintain user intent over multi-turn conversations. Because both robust multi-turn management and advanced function calling are crucial for effective conversational agents, we evaluate these skills on three popular benchmarks: MultiWOZ 2.4 (TOD), BFCL V3 (LA), and API-Bank (LA)—and our analyses reveal that specialized approaches excel in one domain but underperform in the other. To bridge this chasm, we introduce **CoALM** (**C**onversational **A**gentic **L**anguage **M**odel), a unified approach that integrates both conversational and agentic capabilities. We created **CoALM-IT**, a carefully constructed multi-task dataset that interleave multi-turn ReAct reasoning with complex API usage. Using CoALM-IT, we train three models **CoALM 8B**, **CoALM 70B**, and **CoALM 405B**, which outperform top domain-specific models, including GPT-4o, across all three benchmarks. This demonstrates the feasibility of a single model approach for both TOD and LA, setting a new standard for conversational agents.
pdf
bib
abs
Single-to-mix Modality Alignment with Multimodal Large Language Model for Document Image Machine Translation
Yupu Liang
|
Yaping Zhang
|
Zhiyang Zhang
|
Yang Zhao
|
Lu Xiang
|
Chengqing Zong
|
Yu Zhou
Document Image Machine Translation (DIMT) aims to translate text within document images, facing generalization challenges due to limited training data and the complex interplay between visual and textual information. To address these challenges, we introduce M4Doc, a novel single-to-mix Modality alignment framework leveraging Multimodal Large Language Models (MLLMs). M4Doc aligns an imageonly encoder with the multimodal representations of an MLLM, pre-trained on large-scale document image datasets. This alignment enables a lightweight DIMT model to learn crucial visual-textual correlations during training. During inference, M4Doc bypasses the MLLM, maintaining computational efficiency while benefiting from its multimodal knowledge. Comprehensive experiments demonstrate substantial improvements in translation quality, especially in cross-domain generalization and challenging document image scenarios. The code will be released upon acceptance.
pdf
bib
abs
SDPO: Segment-Level Direct Preference Optimization for Social Agents
Aobo Kong
|
Wentao Ma
|
Shiwan Zhao
|
Yongbin Li
|
Yuchuan Wu
|
Ke Wang
|
Xiaoqian Liu
|
Qicheng Li
|
Yong Qin
|
Fei Huang
Social agents powered by large language models (LLMs) can simulate human social behaviors but fall short in handling complex social dialogues. Direct Preference Optimization (DPO) has proven effective in aligning LLM behavior with human preferences across various agent tasks. However, standard DPO focuses solely on individual turns, which limits its effectiveness in multi-turn social interactions. Several DPO-based multi-turn alignment methods with session-level data have shown potential in addressing this problem. While these methods consider multiple turns across entire sessions, they are often overly coarse-grained, introducing training noise, and lack robust theoretical support. To resolve these limitations, we propose Segment-Level Direct Preference Optimization (SDPO), which dynamically select key segments within interactions to optimize multi-turn agent behavior. SDPO minimizes training noise and is grounded in a rigorous theoretical framework. Evaluations on the SOTOPIA benchmark demonstrate that SDPO-tuned agents consistently outperform both existing DPO-based methods and proprietary LLMs like GPT-4o, underscoring SDPO’s potential to advance the social intelligence of LLM-based agents. We release our code and data at https://anonymous.4open.science/r/SDPO-CE8F.
pdf
bib
abs
KokoroChat: A Japanese Psychological Counseling Dialogue Dataset Collected via Role-Playing by Trained Counselors
Zhiyang Qi
|
Takumasa Kaneko
|
Keiko Takamizo
|
Mariko Ukiyo
|
Michimasa Inaba
Generating psychological counseling responses with language models relies heavily on high-quality datasets. Crowdsourced data collection methods require strict worker training, and data from real-world counseling environments may raise privacy and ethical concerns. While recent studies have explored using large language models (LLMs) to augment psychological counseling dialogue datasets, the resulting data often suffers from limited diversity and authenticity. To address these limitations, this study adopts a role-playing approach where trained counselors simulate counselor-client interactions, ensuring high-quality dialogues while mitigating privacy risks. Using this method, we construct KokoroChat, a Japanese psychological counseling dialogue dataset comprising 6,589 long-form dialogues, each accompanied by comprehensive client feedback. Experimental results demonstrate that fine-tuning open-source LLMs with KokoroChat improves both the quality of generated counseling responses and the automatic evaluation of counseling dialogues. The KokoroChat dataset is available at https://github.com/UEC-InabaLab/KokoroChat.
pdf
bib
abs
SURVEYFORGE : On the Outline Heuristics, Memory-Driven Generation, and Multi-dimensional Evaluation for Automated Survey Writing
Xiangchao Yan
|
Shiyang Feng
|
Jiakang Yuan
|
Renqiu Xia
|
Bin Wang
|
Lei Bai
|
Bo Zhang
Survey paper plays a crucial role in scientific research, especially given the rapid growth of research publications. Recently, researchers have begun using LLMs to automate survey generation for better efficiency. However, the quality gap between LLM-generated surveys and those written by human remains significant, particularly in terms of outline quality and citation accuracy. To close these gaps, we introduce SURVEYFORGE, which first generates the outline by analyzing the logical structure of human-written outlines and referring to the retrieved domain-related articles. Subsequently, leveraging high-quality papers retrieved from memory by our scholar navigation agent, SURVEYFORGE can automatically generate and refine the content of the generated article. Moreover, to achieve a comprehensive evaluation, we construct SurveyBench, which includes 100 human-written survey papers for win-rate comparison and assesses AI-generated survey papers across three dimensions: reference, outline, and content quality. Experiments demonstrate that SURVEYFORGEcan outperform previous works such as AutoSurvey.
pdf
bib
abs
Making LLMs Better Many-to-Many Speech-to-Text Translators with Curriculum Learning
Yexing Du
|
Youcheng Pan
|
Ziyang Ma
|
Bo Yang
|
Yifan Yang
|
Keqi Deng
|
Xie Chen
|
Yang Xiang
|
Ming Liu
|
Bing Qin
Multimodal Large Language Models (MLLMs) have achieved significant success in Speech-to-Text Translation (S2TT) tasks. While most existing research has focused on English-centric translation directions, the exploration of many-to-many translation is still limited by the scarcity of parallel data. To address this, we propose a three-stage curriculum learning strategy that leverages the machine translation capabilities of large language models and adapts them to S2TT tasks, enabling effective learning in low-resource settings. We trained MLLMs with varying parameter sizes (3B, 7B, and 32B) and evaluated the proposed strategy using the FLEURS and CoVoST-2 datasets. Experimental results show that the proposed strategy achieves state-of-the-art average performance in
15×14 language pairs, requiring fewer than 10 hours of speech data per language to achieve competitive results. The source code and models are released at
https://github.com/yxduir/LLM-SRT.
pdf
bib
abs
AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research
Yilun Zhao
|
Weiyuan Chen
|
Zhijian Xu
|
Manasi Patwardhan
|
Chengye Wang
|
Yixin Liu
|
Lovekesh Vig
|
Arman Cohan
We introduce AbGen, the first benchmark designed to evaluate the capabilities of LLMs in designing ablation studies for scientific research. AbGen consists of 2,000 expert-annotated examples derived from 677 NLP papers. In this benchmark, LLMs are tasked with generating detailed ablation study designs for a specified module or process based on the given research context. Our evaluation of leading LLMs, such as GPT-4o and Llama-3.1, highlights a significant performance gap between these models and human experts in terms of the importance, faithfulness, and soundness of the ablation study designs. Moreover, we demonstrate that current automated evaluation methods are not reliable for our task, as they show a significant discrepancy when compared to human assessment. To better investigate this, we develop AbGen-Eval, a meta-evaluation benchmark designed to assess the reliability of commonly used automated evaluation systems in measuring LLM performance on our task. We investigate various LLM-based evaluation methods on AbGen-Eval, providing insights for future research on developing more effective and reliable LLM-based evaluation systems for complex scientific tasks.
pdf
bib
abs
Redundancy Principles for MLLMs Benchmarks
Zicheng Zhang
|
Xiangyu Zhao
|
Xinyu Fang
|
Chunyi Li
|
Xiaohong Liu
|
Xiongkuo Min
|
Haodong Duan
|
Kai Chen
|
Guangtao Zhai
With the rapid iteration of Multi-modality Large Language Models (MLLMs) and the evolving demands of the field, the number of benchmarks produced annually has surged into the hundreds. The rapid growth has inevitably led to significant redundancy among benchmarks. Therefore, it is crucial to take a step back and critically assess the current state of redundancy and propose targeted principles for constructing effective MLLM benchmarks. In this paper, we focus on redundancy from three key perspectives: 1) Redundancy of benchmark capability dimensions, 2) Redundancy in the number of test questions, and 3) Cross-benchmark redundancy within specific domains. Through the comprehensive analysis over hundreds of MLLMs’ performance across more than 20 benchmarks, we aim to quantitatively measure the level of redundancy lies in existing MLLM evaluations, provide valuable insights to guide the future development of MLLM benchmarks, and offer strategies to refine and address redundancy issues effectively.
pdf
bib
abs
WavRAG: Audio-Integrated Retrieval Augmented Generation for Spoken Dialogue Models
Yifu Chen
|
Shengpeng Ji
|
Haoxiao Wang
|
Ziqing Wang
|
Siyu Chen
|
Jinzheng He
|
Jin Xu
|
Zhou Zhao
Retrieval Augmented Generation (RAG) has gained widespread adoption owing to its capacity to empower large language models (LLMs) to integrate external knowledge. However, existing RAG frameworks are primarily designed for text-based LLMs and rely on Automatic Speech Recognition to process speech input, which discards crucial audio information, risks transcription errors, and increases computational overhead. Therefore, we introduce WavRAG, the first retrieval augmented generation framework with native, end-to-end audio support. WavRAG offers two key features: 1) Bypassing ASR, WavRAG directly processes raw audio for both embedding and retrieval. 2) WavRAG integrates audio and text into a unified knowledge representation. Specifically, we propose the WavRetriever to facilitate the retrieval from a text-audio hybrid knowledge base, and further enhance the in-context capabilities of spoken dialogue models through the integration of chain-of-thought reasoning. In comparison to state-of-the-art ASR-Text RAG pipelines, WavRAG achieves comparable retrieval performance while delivering a 10x acceleration. Furthermore, WavRAG’s unique text-audio hybrid retrieval capability extends the boundaries of RAG to the audio modality.
pdf
bib
abs
ChildMandarin: A Comprehensive Mandarin Speech Dataset for Young Children Aged 3-5
Jiaming Zhou
|
Shiyao Wang
|
Shiwan Zhao
|
Jiabei He
|
Haoqin Sun
|
Hui Wang
|
Cheng Liu
|
Aobo Kong
|
Yujie Guo
|
Xi Yang
|
Yequan Wang
|
Yonghua Lin
|
Yong Qin
Automatic speech recognition (ASR) systems have advanced significantly with models like Whisper, Conformer, and self-supervised frameworks such as Wav2vec 2.0 and HuBERT. However, developing robust ASR models for young children’s speech remains challenging due to differences in pronunciation, tone, and pace compared to adult speech. In this paper, we introduce a new Mandarin speech dataset focused on children aged 3 to 5, addressing the scarcity of resources in this area. The dataset comprises 41.25 hours of speech with carefully crafted manual transcriptions, collected from 397 speakers across various provinces in China, with balanced gender representation. We provide a comprehensive analysis of speaker demographics, speech duration distribution and geographic coverage. Additionally, we evaluate ASR performance on models trained from scratch, such as Conformer, as well as fine-tuned pre-trained models like HuBERT and Whisper, where fine-tuning demonstrates significant performance improvements. Furthermore, we assess speaker verification (SV) on our dataset, showing that, despite the challenges posed by the unique vocal characteristics of young children, the dataset effectively supports both ASR and SV tasks. This dataset is a valuable contribution to Mandarin child speech research and holds potential for applications in educational technology and child-computer interaction. It will be open-source and freely available for all academic purposes.
pdf
bib
abs
Finding the Sweet Spot: Preference Data Construction for Scaling Preference Optimization
Yao Xiao
|
Hai Ye
|
Linyao Chen
|
Hwee Tou Ng
|
Lidong Bing
|
Xiaoli Li
|
Roy Ka-Wei Lee
Iterative data generation and model retraining are widely used to align large language models (LLMs).It typically involves a policy model to generate on-policy responses and a reward model to guide training data selection. Direct Preference Optimization (DPO) further enhances this process by constructing preference pairs of chosen and rejected responses. In this work, we aim to scale up the number of on-policy samples via repeated random sampling to improve alignment performance. Conventional practice selects the sample with the highest reward as chosen and the lowest as rejected for DPO. However, our experiments reveal that this strategy leads to a decline in performance as the sample size increases. To address this, we investigate preference data construction through the lens of underlying normal distribution of sample rewards. We categorize the reward space into seven representative points and systematically explore all 21 (C72) pairwise combinations. Through evaluations on four models using AlpacaEval 2, we find that selecting the rejected response at reward position 𝜇 - 2𝜎 rather than the minimum reward, is crucial for optimal performance. We finally introduce a scalable preference data construction strategy that consistently enhances model performance as the sample scale increases.
pdf
bib
abs
Enhancing Safe and Controllable Protein Generation via Knowledge Preference Optimization
Yuhao Wang
|
Keyan Ding
|
Kehua Feng
|
Zeyuan Wang
|
Ming Qin
|
Xiaotong Li
|
Qiang Zhang
|
Huajun Chen
Protein language models have emerged as powerful tools for sequence generation, offering substantial advantages in functional optimization and *denovo* design. However, these models also present significant risks of generating harmful protein sequences, such as those that enhance viral transmissibility or evade immune responses. These concerns underscore critical biosafety and ethical challenges. To address these issues, we propose a Knowledge-guided Preference Optimization (KPO) framework that integrates prior knowledge via a Protein Safety Knowledge Graph. This framework utilizes an efficient graph pruning strategy to identify preferred sequences and employs reinforcement learning to minimize the risk of generating harmful proteins. Experimental results demonstrate that KPO effectively reduces the likelihood of producing hazardous sequences while maintaining high functionality, offering a robust safety assurance framework for applying generative models in biotechnology.
pdf
bib
abs
SINCon: Mitigate LLM-Generated Malicious Message Injection Attack for Rumor Detection
Mingqing Zhang
|
Qiang Liu
|
Xiang Tao
|
Shu Wu
|
Liang Wang
In the era of rapidly evolving large language models (LLMs), state-of-the-art rumor detection systems, particularly those based on Message Propagation Trees (MPTs), which represent a conversation tree with the post as its root and the replies as its descendants, are facing increasing threats from adversarial attacks that leverage LLMs to generate and inject malicious messages. Existing methods are based on the assumption that different nodes exhibit varying degrees of influence on predictions. They define nodes with high predictive influence as important nodes and target them for attacks. If the model treats nodes’ predictive influence more uniformly, attackers will find it harder to target high predictive influence nodes. In this paper, we propose Similarizing the predictive Influence of Nodes with Contrastive Learning (SINCon), a defense mechanism that encourages the model to learn graph representations where nodes with varying importance have a more uniform influence on predictions. Extensive experiments on the Twitter and Weibo datasets demonstrate that SINCon not only preserves high classification accuracy on clean data but also significantly enhances resistance against LLM-driven message injection attacks.
pdf
bib
abs
Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models
Jungwoo Park
|
Taewhoo Lee
|
Chanwoong Yoon
|
Hyeon Hwang
|
Jaewoo Kang
Extreme activation outliers in Large Language Models (LLMs) critically degrade quantization performance, hindering efficient on-device deployment. While channel-wise operations and adaptive gradient scaling are recognized causes, practical mitigation remains challenging. We introduce **Outlier-Safe Pre-Training (OSP)**, a practical guideline that proactively prevents outlier formation, rather than relying on post-hoc mitigation. OSP combines three key innovations: (1) the Muon optimizer, eliminating privileged bases while maintaining training efficiency, (2) Single-Scale RMSNorm, preventing channel-wise amplification, and (3) a learnable embedding projection, redistributing activation magnitudes. We validate OSP by training a 1.4B-parameter model on 1 trillion tokens, which is the first production-scale LLM trained without such outliers. Under aggressive 4-bit quantization, our OSP model achieves a 35.7 average score across 10 benchmarks (versus 26.5 for an Adam-trained model), with only a 2% training overhead. Remarkably, OSP models exhibit near-zero excess kurtosis (0.04) compared to extreme values (1818.56) in standard models, fundamentally altering LLM quantization behavior. Our work demonstrates that outliers are not inherent to LLMs but are consequences of training strategies, paving the way for more efficient LLM deployment.
pdf
bib
abs
Agentic Knowledgeable Self-awareness
Shuofei Qiao
|
Zhisong Qiu
|
Baochang Ren
|
Xiaobin Wang
|
Xiangyuan Ru
|
Ningyu Zhang
|
Xiang Chen
|
Yong Jiang
|
Pengjun Xie
|
Fei Huang
|
Huajun Chen
Large Language Models (LLMs) have achieved considerable performance across various agentic planning tasks. However, traditional approaches adopt a “flood irrigation” methodology that indiscriminately injects gold trajectories, external feedback, and domain knowledge into agent models. This practice overlooks the fundamental human cognitive principle of self-awareness - the ability to dynamically assess situational demands and strategically employ resources during decision-making. We propose Agentic Knowledgeable Self-awareness to address this gap, a novel paradigm enabling LLM-based agents to autonomously regulate knowledge utilization. Specifically, we propose KnowSelf, a data-centric approach that applies agents with knowledgeable self-awareness like humans. Concretely, we devise a heuristic situation judgement criterion to mark special tokens on the agent’s self-explored trajectories for collecting training data. Through a two-stage training process, the agent model can switch between different situations by generating specific special tokens, achieving optimal planning effects with minimal costs. Our experiments demonstrate that can outperform various strong baselines on different tasks and models with minimal use of external knowledge.
pdf
bib
abs
A Unified Agentic Framework for Evaluating Conditional Image Generation
Jifang Wang
|
Yangxue Yangxue
|
Longyue Wang
|
Zhenran Xu
|
Yiyu Wang
|
Yaowei Wang
|
Weihua Luo
|
Kaifu Zhang
|
Baotian Hu
|
Min Zhang
Conditional image generation has gained significant attention for its ability to personalize content. However, the field faces challenges in developing task-agnostic, reliable, and explainable evaluation metrics. This paper introduces CIGEval, a unified agentic framework for comprehensive evaluation of conditional image generation tasks. CIGEval utilizes large multimodal models (LMMs) as its core, integrating a multi-functional toolbox and establishing a fine-grained evaluation framework. Additionally, we synthesize evaluation trajectories for fine-tuning, empowering smaller LMMs to autonomously select appropriate tools and conduct nuanced analyses based on tool outputs. Experiments across seven prominent conditional image generation tasks demonstrate that CIGEval (GPT-4o version) achieves a high correlation of 0.4625 with human assessments, closely matching the inter-annotator correlation of 0.47. Notably, when implemented with 7B open-source LMMs using only 2.3K training trajectories, CIGEval surpasses the previous GPT-4o-based state-of-the-art method. These findings indicate that CIGEval holds great potential for automating evaluation of image generation tasks while maintaining human-level reliability.
pdf
bib
abs
Planning-Driven Programming: A Large Language Model Programming Workflow
Chao Lei
|
Yanchuan Chang
|
Nir Lipovetzky
|
Krista A. Ehinger
The strong performance of large language models (LLMs) raises extensive discussion on their application to code generation. Recent research suggests continuous program refinements through visible tests to improve code generation accuracy in LLMs. However, these methods suffer from LLMs’ inefficiency and limited reasoning capacity. In this work, we propose an LLM programming workflow (LPW) designed to improve both initial code generation and subsequent refinements within a structured two-phase workflow. Specifically, the solution generation phase formulates a solution plan, which is then verified through visible tests to specify the intended natural language solution. Subsequently, the code implementation phase drafts an initial code according to the solution plan and its verification. If the generated code fails the visible tests, the plan verification serves as the intended solution to consistently inform the refinement process for correcting bugs. Compared to state-of-the-art methods across various existing LLMs, LPW significantly improves the Pass@1 accuracy by up to 16.4% on well-established text-to-code generation benchmarks. LPW also sets new state-of-the-art Pass@1 accuracy, achieving 98.2% on HumanEval, 84.8% on MBPP, 59.3% on LiveCode, 62.6% on APPS, and 34.7% on CodeContests, using GPT-4o as the backbone. Our code is publicly available at: https://github.com/you68681/lpw.
pdf
bib
abs
Can Knowledge Graphs Make Large Language Models More Trustworthy? An Empirical Study Over Open-ended Question Answering
Yuan Sui
|
Yufei He
|
Zifeng Ding
|
Bryan Hooi
Recent works integrating Knowledge Graphs (KGs) have shown promising improvements in enhancing the reasoning capabilities of Large Language Models (LLMs). However, existing benchmarks primarily focus on closed-ended tasks, leaving a gap in evaluating performance on more complex, real-world scenarios. This limitation also hinders a thorough assessment of KGs’ potential to reduce hallucinations in LLMs. To address this, we introduce OKGQA, a new benchmark specifically designed to evaluate LLMs augmented with KGs in open-ended, real-world question answering settings. OKGQA reflects practical complexities through diverse question types and incorporates metrics to quantify both hallucination rates and reasoning improvements in LLM+KG models. To consider the scenarios in which KGs may contain varying levels of errors, we propose a benchmark variant, OKGQA-P, to assess model performance when the semantics and structure of KGs are deliberately perturbed and contaminated. In this paper, we aims to (1) explore whether KGs can make LLMs more trustworthy in an open-ended setting, and (2) conduct a comparative analysis to shed light on method design. We believe this study can facilitate a more complete performance comparison and encourages continuous improvement in integrating KGs with LLMs to mitigate hallucination, and make LLMs more trustworthy.
pdf
bib
abs
Nudging: Inference-time Alignment of LLMs via Guided Decoding
Yu Fei
|
Yasaman Razeghi
|
Sameer Singh
Large language models (LLMs) require alignment to effectively and safely follow user instructions. This process necessitates training an aligned version for every base model, resulting in significant computational overhead. In this work, we propose NUDGING, a simple, training-free algorithm that aligns any base model at inference time using a small aligned model. NUDGING is motivated by recent findings that alignment primarily alters the model’s behavior on a small subset of stylistic tokens (e.g., discourse markers). We find that base models are significantly more uncertain when generating these tokens. Building on this insight, NUDGING employs a small aligned model to generate nudging tokens to guide the base model’s output during decoding when the base model’s uncertainty is high, with only a minor additional inference overhead. We evaluate NUDGING across 3 model families on a diverse range of open-instruction tasks. Without any training, nudging a large base model with a 7×-14× smaller aligned model achieves zero-shot performance comparable to, and sometimes surpassing, that of large aligned models. By operating at the token level, NUDGING enables off-the-shelf collaboration between model families. For instance, nudging Gemma-2-27b with Llama-27b-chat outperforms Llama-2-70b-chat on various tasks. Overall, our work offers a modular and cost-efficient solution to LLM alignment. Our code and demo are available at: https://fywalter.github.io/nudging/.
pdf
bib
abs
Unveiling Attractor Cycles in Large Language Models: A Dynamical Systems View of Successive Paraphrasing
Zhilin Wang
|
Yafu Li
|
Jianhao Yan
|
Yu Cheng
|
Yue Zhang
Dynamical systems theory provides a framework for analyzing iterative processes and evolution over time. Within such systems, repetitive transformations can lead to stable configurations, known as attractors, including fixed points and limit cycles. Applying this perspective to large language models (LLMs), which iteratively map input text to output text, provides a principled approach to characterizing long-term behaviors. Successive paraphrasing serves as a compelling testbed for exploring such dynamics, as paraphrases re-express the same underlying meaning with linguistic variation. Although LLMs are expected to explore a diverse set of paraphrases in the text space, our study reveals that successive paraphrasing converges to stable periodic states, such as 2-period attractor cycles, limiting linguistic diversity. This phenomenon is attributed to the self-reinforcing nature of LLMs, as they iteratively favour and amplify certain textual forms over others. This pattern persists with increasing generation randomness or alternating prompts and LLMs. These findings underscore inherent constraints in LLM generative capability, while offering a novel dynamical systems perspective for studying their expressive potential.
pdf
bib
abs
SCAR: Data Selection via Style Consistency-Aware Response Ranking for Efficient Instruction-Tuning of Large Language Models
Zhuang Li
|
Yuncheng Hua
|
Thuy-Trang Vu
|
Haolan Zhan
|
Lizhen Qu
|
Gholamreza Haffari
Recent studies emphasize that manually ensuring a consistent response style and maintaining high data quality in training sets can significantly improve the performance of fine-tuned Large Language Models (LLMs) while reducing the number of training examples needed. However, the precise definition of style and the relationship between style, data quality, and LLM performance remains unclear. This research identifies two key stylistic elements in responses: linguistic form and instructional surprisal. We find that, among training data of comparable quality, higher consistency in these response elements leads to better LLM performance. Inspired by this, we introduce Style Consistency-Aware Response Ranking (SCAR), which automatically prioritizes instruction-response pairs in the training set based on their response stylistic consistency. By selecting the most style-consistent examples, using 0.7% of the full dataset in certain cases, the fine-tuned LLMs can match or even surpass the performance of models trained on the entire dataset in coding and open-ended question-answering benchmarks. Code and data are available at https://github.com/zhuang-li/SCAR .
pdf
bib
abs
HFT: Half Fine-Tuning for Large Language Models
Tingfeng Hui
|
Zhenyu Zhang
|
Shuohuan Wang
|
Weiran Xu
|
Yu Sun
|
Hua Wu
Large language models (LLMs) with one or more fine-tuning phases have become necessary to unlock various capabilities, enabling LLMs to follow natural language instructions and align with human preferences. However, it carries the risk of catastrophic forgetting during sequential training, the parametric knowledge or the ability learned in previous stages may be overwhelmed by incoming training data. This paper finds that LLMs can restore some original knowledge by regularly resetting partial parameters. Inspired by this, we introduce Half Fine-Tuning (HFT) for LLMs, as a substitute for full fine-tuning (FFT), to mitigate the forgetting issues, where half of the parameters are selected to learn new tasks. In contrast, the other half are frozen to retain previous knowledge. We provide a feasibility analysis from the optimization perspective and interpret the parameter selection operation as a regularization term. HFT could be seamlessly integrated into existing fine-tuning frameworks without changing the model architecture. Extensive experiments and analysis on supervised fine-tuning, direct preference optimization, and continual learning consistently demonstrate the effectiveness, robustness, and efficiency of HFT. Compared with FFT, HFT not only significantly alleviates the forgetting problem, but also achieves the best performance in a series of downstream benchmarks, with an approximately 30% reduction in training time.
pdf
bib
abs
Beyond Surface Simplicity: Revealing Hidden Reasoning Attributes for Precise Commonsense Diagnosis
Huijun Lian
|
Zekai Sun
|
Keqi Chen
|
Yingming Gao
|
Ya Li
Commonsense question answering (QA) are widely used to evaluate the commonsense abilities of large language models. However, answering commonsense questions correctly requires not only knowledge but also reasoning—even for seemingly simple questions. We demonstrate that such hidden reasoning attributes in commonsense questions can lead evaluation accuracy differences of up to 24.8% across different difficulty levels in the same benchmark. Current benchmarks overlook these hidden reasoning attributes, making it difficult to assess a model’s specific levels of commonsense knowledge and reasoning ability. To address this issue, we introduce ReComSBench, a novel framework that reveals hidden reasoning attributes behind commonsense questions by leveraging the knowledge generated during the reasoning process. Additionally, ReComSBench proposes three new metrics for decoupled evaluation: Knowledge Balanced Accuracy, Marginal Sampling Gain, and Knowledge Coverage Ratio. Experiments show that ReComSBench provides insights into model performance that traditional benchmarks cannot offer. The difficulty stratification based on revealed hidden reasoning attributes performs as effectively as the model-probability-based approach but is more generalizable and better suited for improving a model’s commonsense reasoning abilities. By uncovering and analyzing the hidden reasoning attributes in commonsense data, ReComSBench offers a new approach to enhancing existing commonsense benchmarks.
pdf
bib
abs
From Objectives to Questions: A Planning-based Framework for Educational Mathematical Question Generation
Cheng Cheng
|
Zhenya Huang
|
GuanHao Zhao
|
Yuxiang Guo
|
Xin Lin
|
Jinze Wu
|
Xin Li
|
Shijin Wang
Automatically generating high-quality mathematical problems that align with educational objectives is a crucial task in NLP-based educational technology. Traditional generation methods focus primarily on textual quality, but they often overlook educational objectives. Moreover, these methods address only single-dimensional, simple question generation, failing to meet complex, multifaceted educational requirements. To address these challenges, we constructed and annotated EduMath, a dataset of 16k mathematical questions with multi-dimensional educational objectives. Based on this dataset, we developed EQGEVAL, which incorporates three evaluation dimensions and is designed to assess the ability of models to generate educational questions. Drawing inspiration from teachers’ problem design processes, we propose the Educational Question Planning with self-Reflection (EQPR) method for educational mathematical question generation, following a “plan-evaluate-optimize” approach. Specifically, by combining planning algorithm based on Monte Carlo Tree Search with the generative capabilities of Large Language Models, we continuously optimize questions through iterative feedback. This self-optimization mechanism ensures that the generated questions both fit the educational context and strategically achieve specific basic educational objectives. Through extensive experiments based on EQGEVAL, we have demonstrated that EQPR achieves significant improvements in generating questions that meet multi-dimensional educational objectives.
pdf
bib
abs
RankCoT: Refining Knowledge for Retrieval-Augmented Generation through Ranking Chain-of-Thoughts
Mingyan Wu
|
Zhenghao Liu
|
Yukun Yan
|
Xinze Li
|
Shi Yu
|
Zheni Zeng
|
Yu Gu
|
Ge Yu
Retrieval-Augmented Generation (RAG) enhances the performance of Large Language Models (LLMs) by incorporating external knowledge. However, LLMs still encounter challenges in effectively utilizing the knowledge from retrieved documents, often being misled by irrelevant or noisy information. To address this issue, we introduce RankCoT, a knowledge refinement method that incorporates reranking signals in generating CoT-based summarization for knowledge refinement based on given query and all retrieval documents. During training, RankCoT prompts the LLM to generate Chain-of-Thought (CoT) candidates based on the query and individual documents. It then fine-tunes the LLM to directly reproduce the best CoT from these candidate outputs based on all retrieved documents, which requires LLM to filter out irrelevant documents during generating CoT-style summarization. Additionally, RankCoT incorporates a self-reflection mechanism that further refines the CoT outputs, resulting in higher-quality training data. Our experiments demonstrate the effectiveness of RankCoT, showing its superior performance over other knowledge refinement models. Further analysis reveals that RankCoT can provide shorter but effective refinement results, enabling the generator to produce more accurate answers. All code and data are available at https://github.com/NEUIR/RankCoT.
pdf
bib
abs
Lost in Literalism: How Supervised Training Shapes Translationese in LLMs
Yafu Li
|
Ronghao Zhang
|
Zhilin Wang
|
Huajian Zhang
|
Leyang Cui
|
Yongjing Yin
|
Tong Xiao
|
Yue Zhang
Large language models (LLMs) have achieved remarkable success in machine translation, demonstrating impressive performance across diverse languages. However, translationese—characterized by overly literal and unnatural translations—remains a persistent challenge in LLM-based translation systems. Despite their pre-training on vast corpora of natural utterances, LLMs exhibit translationese errors and generate unexpected unnatural translations, stemming from biases introduced during supervised fine-tuning (SFT). In this work, we systematically evaluate the prevalence of translationese in LLM-generated translations and investigate its roots during supervised training. We introduce methods to mitigate these biases, including polishing golden references and filtering unnatural training instances. Empirical evaluations demonstrate that these approaches significantly reduce translationese while improving translation naturalness, validated by human evaluations and automatic metrics. Our findings highlight the need for training-aware adjustments to optimize LLM translation outputs, paving the way for more fluent and target-language-consistent translations.
pdf
bib
abs
Accurate KV Cache Quantization with Outlier Tokens Tracing
Yi Su
|
Yuechi Zhou
|
Quantong Qiu
|
Juntao Li
|
Qingrong Xia
|
Ping Li
|
Xinyu Duan
|
Zhefeng Wang
|
Min Zhang
The impressive capabilities of Large Language Models (LLMs) come at the cost of substantial computational resources during deployment. While KV Cache can significantly reduce recomputation during inference, it also introduces additional memory overhead. KV Cache quantization presents a promising solution, striking a good balance between memory usage and accuracy. Previous research has shown that the Keys are distributed by channel, while the Values are distributed by token. Consequently, the common practice is to apply channel-wise quantization to the Keys and token-wise quantization to the Values. However, our further investigation reveals that a small subset of unusual tokens exhibit unique characteristics that deviate from this pattern, which can substantially impact quantization accuracy. To address this, we develop a simple yet effective method to identify these tokens accurately during the decoding process and exclude them from quantization as outlier tokens, significantly improving overall accuracy. Extensive experiments show that our method achieves significant accuracy improvements under 2-bit quantization and can deliver a 6.4 times reduction in memory usage and a 2.3 times increase in throughput.
pdf
bib
abs
Can Large Language Models Understand Internet Buzzwords Through User-Generated Content
Chen Huang
|
Junkai Luo
|
Xinzuo Wang
|
Wenqiang Lei
|
Jiancheng Lv
The massive user-generated content (UGC) available in Chinese social media is giving rise to the possibility of studying internet buzzwords. In this paper, we study if large language models (LLMs) can generate accurate definitions for these buzzwords based on UGC as examples. Our work serves a threefold contribution. First, we introduce CHEER, the first dataset of Chinese internet buzzwords, each annotated with a definition and relevant UGC. Second, we propose a novel method, called RESS, to effectively steer the comprehending process of LLMs to produce more accurate buzzword definitions, mirroring the skills of human language learning. Third, with CHEER, we benchmark the strengths and weaknesses of various off-the-shelf definition generation methods and our RESS. Our benchmark demonstrates the effectiveness of RESS while revealing a crucial shared challenge: comprehending unseen buzzwords and leveraging sufficient, high-quality UGC to facilitate this comprehension. In this paper, we believe our work lays the groundwork for future advancements in LLM-based definition generation. Our dataset and code will be openly released.
pdf
bib
abs
EAC-MoE: Expert-Selection Aware Compressor for Mixture-of-Experts Large Language Models
Yuanteng Chen
|
Yuantian Shao
|
Peisong Wang
|
Jian Cheng
Mixture-of-Experts (MoE) has demonstrated promising potential in scaling LLMs. However, it is hindered by two critical challenges: (1) substantial GPU memory consumption to load all experts; (2) low activated parameters cannot be equivalently translated into inference acceleration effects. In this work, we propose EAC-MoE, an Expert-Selection Aware Compressor for MoE-LLMs, which deeply aligns with the characteristics of MoE from the perspectives of quantization and pruning, and introduces two modules to address these two challenges respectively: (1) The expert selection bias caused by low-bit quantization is a major factor contributing to the performance degradation in MoE-LLMs. Based on this, we propose Quantization with Expert-Selection Calibration (QESC), which mitigates the expert selection bias by calibrating the routers within the MoE; (2) There are always certain experts that are not crucial for the corresponding tasks, yet causing inference latency. Therefore, we propose Pruning based on Expert-Selection Frequency (PESF), which significantly improves inference speed by pruning less frequently used experts for current task. Extensive experiments demonstrate that our approach significantly reduces memory usage and improves inference speed with minimal performance degradation.
pdf
bib
abs
Activation Steering Decoding: Mitigating Hallucination in Large Vision-Language Models through Bidirectional Hidden State Intervention
Jingran Su
|
Jingfan Chen
|
Hongxin Li
|
Yuntao Chen
|
Li Qing
|
Zhaoxiang Zhang
Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities in multimodal understanding, but they frequently suffer from hallucination - generating content inconsistent with visual inputs. In this work, we explore a novel perspective on hallucination mitigation by examining the intermediate activations of LVLMs during generation. Our investigation reveals that hallucinated content manifests as distinct, identifiable patterns in the model’s hidden state space. Motivated by this finding, we propose Activation Steering Decoding (ASD), a training-free approach that mitigates hallucination through targeted intervention in the model’s intermediate activations. ASD operates by first identifying directional patterns of hallucination in the activation space using a small calibration set, then employing a contrast decoding mechanism that computes the difference between positive and negative steering predictions. This approach effectively suppresses hallucination patterns while preserving the model’s general capabilities. Extensive experiments demonstrate that our method significantly reduces hallucination across multiple benchmarks while maintaining performance on general visual understanding tasks. Notably, our approach requires no model re-training or architectural modifications, making it readily applicable to existing deployed models.
pdf
bib
abs
Interactive Evolution: A Neural-Symbolic Self-Training Framework For Large Language Models
Fangzhi Xu
|
Qiushi Sun
|
Kanzhi Cheng
|
Jun Liu
|
Yu Qiao
|
Zhiyong Wu
One of the primary driving forces contributing to the superior performance of Large Language Models (LLMs) is the extensive availability of human-annotated natural language data, which is used for alignment fine-tuning. This inspired researchers to investigate self-training methods to mitigate the extensive reliance on human annotations. However, the current success of self-training has been primarily observed in natural language scenarios, rather than in the increasingly important neural-symbolic scenarios. To this end, we propose an environment-guided neural-symbolic self-training framework named ENVISIONS. It aims to overcome two main challenges: (1) the scarcity of symbolic data, and (2) the limited proficiency of LLMs in processing symbolic language. Extensive evaluations conducted on three distinct domains demonstrate the effectiveness of our approach. Additionally, we have conducted a comprehensive analysis to uncover the factors contributing to ENVISIONS’s success, thereby offering valuable insights for future research in this area.
pdf
bib
abs
Improving Medical Large Vision-Language Models with Abnormal-Aware Feedback
Yucheng Zhou
|
Lingran Song
|
Jianbing Shen
Existing Medical Large Vision-Language Models (Med-LVLMs), encapsulating extensive medical knowledge, demonstrate excellent capabilities in understanding medical images. However, there remain challenges in visual localization in medical images, which is crucial for abnormality detection and interpretation. To address these issues, we propose a novel UMed-LVLM designed to unveil medical abnormalities. Specifically, we collect a Medical Abnormalities Unveiling (MAU) dataset and propose a two-stage training method for UMed-LVLM training. To collect MAU dataset, we propose a prompt method utilizing the GPT-4V to generate diagnoses based on identified abnormal areas in medical images. Moreover, the two-stage training method includes Abnormal-Aware Instruction Tuning and Abnormal-Aware Rewarding, comprising Relevance Reward, Abnormal Localization Reward and Vision Relevance Reward. Experimental results demonstrate that our UMed-LVLM significantly outperforms existing Med-LVLMs in identifying and understanding medical abnormalities, achieving a 58% improvement over the baseline. In addition, this work shows that enhancing the abnormality detection capabilities of Med-LVLMs significantly improves their understanding of medical images and generalization capability. Our code and data release at URL.
pdf
bib
abs
Upcycling Instruction Tuning from Dense to Mixture-of-Experts via Parameter Merging
Tingfeng Hui
|
Zhenyu Zhang
|
Shuohuan Wang
|
Yu Sun
|
Hua Wu
|
Sen Su
Mixture-of-Experts (MoE) shines brightly in large language models (LLMs) and demonstrates outstanding performance in plentiful natural language processing tasks. However, existing methods transforming LLMs from dense to MoE face significant data requirements and typically rely on large-scale post-training. In this paper, we propose Upcycling Instruction Tuning (UpIT), a data-efficient approach for tuning a dense pre-trained model into a MoE instruction model. Specifically, we first point out that intermediate checkpoints during instruction tuning of the dense model are naturally suitable for specialized experts, and then propose an expert expansion stage to flexibly achieve models with flexible numbers of experts, where genetic algorithm and parameter merging are introduced to ensure sufficient diversity of new extended experts. To ensure that each specialized expert in the MoE model works as expected, we select a small amount of seed data that each expert excels to pre-optimize the router. Extensive experiments with various data scales and upcycling settings demonstrate the outstanding performance and data efficiency of UpIT, as well as stable improvement in expert or data scaling. Further analysis reveals the importance of ensuring expert diversity in upcycling.
pdf
bib
abs
MapNav: A Novel Memory Representation via Annotated Semantic Maps for VLM-based Vision-and-Language Navigation
Lingfeng Zhang
|
Xiaoshuai Hao
|
Qinwen Xu
|
Qiang Zhang
|
Xinyao Zhang
|
Pengwei Wang
|
Jing Zhang
|
Zhongyuan Wang
|
Shanghang Zhang
|
Renjing Xu
Vision-language navigation (VLN) is a key task in Embodied AI, requiring agents to navigate diverse and unseen environments while following natural language instructions. Traditional approaches rely heavily on historical observations as spatio-temporal contexts for decision making, leading to significant storage and computational overhead. In this paper, we introduce MapNav, a novel end-to-end VLN model that leverages Annotated Semantic Map (ASM) to replace historical frames. Specifically, our approach constructs a top-down semantic map at the start of each episode and update it at each timestep, allowing for precise object mapping and structured navigation information. Then, we enhance this map with explicit textual labels for key regions, transforming abstract semantics into clear navigation cues and generate our ASM. MapNav agent using the constructed ASM as input, and use the powerful end-to-end capabilities of VLM to empower VLN. Extensive experiments demonstrate that MapNav achieves state-of-the-art (SOTA) performance in both simulated and real-world environments, validating the effectiveness of our method. We will release our ASM generation source code and dataset to ensure reproducibility, contributing valuable resources to the field. We believe that our proposed MapNav can be used as a new memory representation method in VLN, paving the way for future research in this field.
pdf
bib
abs
Exploring Compositional Generalization of Multimodal LLMs for Medical Imaging
Zhenyang Cai
|
Junying Chen
|
Rongsheng Wang
|
Weihong Wang
|
Yonglin Deng
|
Dingjie Song
|
Yize Chen
|
Zixu Zhang
|
Benyou Wang
Medical imaging provides essential visual insights for diagnosis, and multimodal large language models (MLLMs) are increasingly utilized for its analysis due to their strong generalization capabilities; however, the underlying factors driving this generalization remain unclear. Current research suggests that multi-task training outperforms single-task as different tasks can benefit each other, but they often overlook the internal relationships within these tasks. To analyze this phenomenon, we attempted to employ **compositional generalization** (CG), which refers to the models’ ability to understand novel combinations by recombining learned elements, as a guiding framework. Since medical images can be precisely defined by **M**odality, **A**natomical area, and **T**ask, naturally providing an environment for exploring CG, we assembled 106 medical datasets to create **Med-MAT** for comprehensive experiments. The experiments confirmed that MLLMs can use CG to understand unseen medical images and identified CG as one of the main drivers of the generalization observed in multi-task training. Additionally, further studies demonstrated that CG effectively supports datasets with limited data and confirmed that MLLMs can achieve CG across classification and detection tasks, underscoring its broader generalization potential. Med-MAT is available at https://github.com/FreedomIntelligence/Med-MAT.
pdf
bib
abs
CLAIM: Mitigating Multilingual Object Hallucination in Large Vision-Language Models with Cross-Lingual Attention Intervention
Zekai Ye
|
Qiming Li
|
Xiaocheng Feng
|
Libo Qin
|
Yichong Huang
|
Baohang Li
|
Kui Jiang
|
Yang Xiang
|
Zhirui Zhang
|
Yunfei Lu
|
Duyu Tang
|
Dandan Tu
|
Bing Qin
Large Vision-Language Models (LVLMs) have demonstrated impressive multimodal abilities but remain prone to multilingual object hallucination, with a higher likelihood of generating responses inconsistent with the visual input when utilizing queries in non-English languages compared to English. Most existing approaches to address these rely on pretraining or fine-tuning, which are resource-intensive. In this paper, inspired by observing the disparities in cross-modal attention patterns across languages, we propose Cross-Lingual Attention Intervention for Mitigating multilingual object hallucination (CLAIM) in LVLMs, a novel near training-free method by aligning attention patterns. CLAIM first identifies language-specific cross-modal attention heads, then estimates language shift vectors from English to the target language, and finally intervenes in the attention outputs during inference to facilitate cross-lingual visual perception capability alignment. Extensive experiments demonstrate that CLAIM achieves an average improvement of 13.56% (up to 30% in Spanish) on the POPE and 21.75% on the hallucination subsets of the MME benchmark across various languages. Further analysis reveals that multilingual attention divergence is most prominent in intermediate layers, highlighting their critical role in multilingual scenarios.
pdf
bib
abs
Wizard of Shopping: Target-Oriented E-commerce Dialogue Generation with Decision Tree Branching
Xiangci Li
|
Zhiyu Chen
|
Jason Ingyu Choi
|
Nikhita Vedula
|
Besnik Fetahu
|
Oleg Rokhlenko
|
Shervin Malmasi
The goal of conversational product search (CPS) is to develop an intelligent, chat-based shopping assistant that can directly interact with customers to understand shopping intents, ask clarification questions, and find relevant products. However, training such assistants is hindered mainly due to the lack of reliable and large-scale datasets. Prior human-annotated CPS datasets are extremely small in size and lack integration with real-world product search systems. We propose a novel approach, TRACER, which leverages large language models (LLMs) to generate realistic and natural conversations for different shopping domains. TRACER’s novelty lies in grounding the generation to dialogue plans, which are product search trajectories predicted from a decision tree model, that guarantees relevant product discovery in the shortest number of search conditions. We also release the first target-oriented CPS dataset Wizard of Shopping (WoS), containing highly natural and coherent conversations (3.6k) from three shopping domains. Finally, we demonstrate the quality and effectiveness of WoS via human evaluations and downstream tasks.
pdf
bib
abs
Qwen2.5-xCoder: Multi-Agent Collaboration for Multilingual Code Instruction Tuning
Jian Yang
|
Wei Zhang
|
Yibo Miao
|
Shanghaoran Quan
|
Zhenhe Wu
|
Qiyao Peng
|
Liqun Yang
|
Tianyu Liu
|
Zeyu Cui
|
Binyuan Hui
|
Junyang Lin
Recent advancement in code understanding and generation demonstrates that code LLMs fine-tuned on a high-quality instruction dataset can gain powerful capabilities to address wide-ranging code-related tasks. However, most previous existing methods mainly view each programming language in isolation and ignore the knowledge transfer among different programming languages. To bridge the gap among different programming languages, we introduce a novel multi-agent collaboration framework to enhance multilingual instruction tuning for code LLMs, where multiple language-specific intelligent agent components with generation memory work together to transfer knowledge from one language to another efficiently and effectively. Specifically, we first generate the language-specific instruction data from the code snippets and then provide the generated data as the seed data for language-specific agents. Multiple language-specific agents discuss and collaborate to formulate a new instruction and its corresponding solution (A new programming language or existing programming language), To further encourage the cross-lingual transfer, each agent stores its generation history as memory and then summarizes its merits and faults. Finally, the high-quality multilingual instruction data is used to encourage knowledge transfer among different programming languages to train Qwen2.5-xCoder. Experimental results on multilingual programming benchmarks demonstrate the superior performance of Qwen2.5-xCoder in sharing common knowledge, highlighting its potential to reduce the cross-lingual gap.
pdf
bib
abs
Cultivating Gaming Sense for Yourself: Making VLMs Gaming Experts
Wenxuan Lu
|
Jiangyang He
|
Zhanqiu Zhang
|
Steven Y. Guo
|
Tianning Zang
Developing agents capable of fluid gameplay in first/third-person games without API access remains a critical challenge in Artificial General Intelligence (AGI). Recent efforts leverage Vision Language Models (VLMs) as direct controllers, frequently pausing the game to analyze screens and plan action through language reasoning. However, this inefficient paradigm fundamentally restricts agents to basic and non-fluent interactions: relying on isolated VLM reasoning for each action makes it impossible to handle tasks requiring high reactivity (e.g., FPS shooting) or dynamic adaptability (e.g., ACT combat). To handle this, we propose a paradigm shift in gameplay agent design: instead of direct control, VLM serves as a developer, creating specialized execution modules tailored for tasks like shooting and combat. These modules handle real-time game interactions, elevating VLM to a high-level developer. Building upon this paradigm, we introduce GameSense, a gameplay agent framework where VLM develops task-specific game sense modules by observing task execution and leveraging vision tools and neural network training pipelines. These modules encapsulate action-feedback logic, ranging from direct action rules to neural network-based decisions. Experiments demonstrate that our framework is the first to achieve fluent gameplay in diverse genres, including ACT, FPS, and Flappy Bird, setting a new benchmark for game-playing agents.
pdf
bib
abs
Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced Reasoning
Fangzhi Xu
|
Hang Yan
|
Chang Ma
|
Haiteng Zhao
|
Qiushi Sun
|
Kanzhi Cheng
|
Junxian He
|
Jun Liu
|
Zhiyong Wu
Advancing LLM reasoning skills has captivated wide interest. However, current post-training techniques rely heavily on supervisory signals, such as outcome supervision or auxiliary reward models, which face the problem of scalability and high annotation costs. This motivates us to enhance LLM reasoning without the need for external supervision. Given the input query, the LLM seeks the globally optimal response by stepwise sampling and self-rewarding, and optimizes itself with the collected responses. Genius offers some technical solutions to address the following key challenges. To tackle the problem of how to determine the steps in the response via self-rewarding, Genius introduces a stepwise foresight re-sampling strategy to sample and estimate the step value by simulating future outcomes. Recognizing the intrinsic noise and uncertainty of self-supervision, we propose an advantage-calibrated optimization (ACO) loss function to mitigate estimation inconsistencies. In short, Genius provides an advanced initial step towards self-improve LLM reasoning with general queries and without supervision, revolutionizing reasoning scaling laws given the vast availability of general queries.
pdf
bib
abs
Extending Complex Logical Queries on Uncertain Knowledge Graphs
Weizhi Fei
|
Zihao Wang
|
Hang Yin
|
Yang Duan
|
Yangqiu Song
The study of machine learning-based logical query-answering enables reasoning with large-scale and incomplete knowledge graphs. This paper further advances this line of research by considering the uncertainty in the knowledge. The uncertain nature of knowledge is widely observed in the real world, but does not align seamlessly with the first-order logic underpinning existing studies. To bridge this gap, we study the setting of soft queries on uncertain knowledge, which is motivated by the establishment of soft constraint programming. We further propose an ML-based approach with both forward inference and backward calibration to answer soft queries on large-scale, incomplete, and uncertain knowledge graphs. Theoretical discussions reveal that our method ensures there are no catastrophic cascading errors in our forward inference algorithm while maintaining the same complexity as state-of-the-art inference algorithms for first-order queries. Empirical results justify the superior performance of our approach against previous ML-based methods with number embedding extensions.
pdf
bib
abs
Knowledge Decoupling via Orthogonal Projection for Lifelong Editing of Large Language Models
Haoyu Xu
|
Pengxiang Lan
|
Enneng Yang
|
Guibing Guo
|
Jianzhe Zhao
|
Linying Jiang
|
Xingwei Wang
As large language models (LLMs) require continuous knowledge updates and the mitigation of hallucination issues in generated content, lifelong model editing has become a prominent research area. A mainstream knowledge editing method usually freezes LLM’s original parameters and adds extra trainable modules for new knowledge management, reducing interference with old knowledge. Although these approaches have achieved some success, our experiments show that, after extensive editing, the model’s knowledge understanding and memory capacity significantly degrade, particularly concerning early edited knowledge. The root cause is that subsequent edits interfere with the previously edited knowledge, and we refer to this phenomenon as knowledge coupling. To address this issue, we propose the Knowledge Decoupling Editing (KDE) method. Specifically, KDE stores the basis vectors of the representation space of past edits in a knowledge cache. It projects the gradient of the current edit onto a space orthogonal to previous knowledge for updating. This method effectively alleviates the coupling between different pieces of knowledge. We also propose a two-stage training strategy to better balance the model’s ability to edit new knowledge and distinguish whether a query is related to previous edits. This strategy gradually reduces the interference between new knowledge editing and query distinction, maintaining stable performance during long-term editing. We compared KDE with nine cutting-edge editing methods across multiple mainstream LLMs. The results demonstrate that, regarding question-answering ability and hallucination mitigation, KDE achieves average improvements of 14% and 61%.
pdf
bib
abs
𝜙-Decoding: Adaptive Foresight Sampling for Balanced Inference-Time Exploration and Exploitation
Fangzhi Xu
|
Hang Yan
|
Chang Ma
|
Haiteng Zhao
|
Jun Liu
|
Qika Lin
|
Zhiyong Wu
Inference-time optimization scales computation to derive deliberate reasoning steps for effective performance. While previous search-based strategies address the short-sightedness of auto-regressive generation, the vast search space leads to excessive exploration and insufficient exploitation. To strike an efficient balance to derive the optimal step, we frame the decoding strategy as foresight sampling, leveraging simulated future steps to obtain globally optimal step estimation. Built on it, we propose a novel decoding strategy, named 𝜙-Decoding. To provide a precise and expressive estimation of step value, 𝜙-Decoding approximates two distributions via foresight and clustering. Sampling from the joint distribution, the optimal steps can be selected for exploitation. To support adaptive computation allocation, we propose in-width and in-depth pruning strategies, featuring a light-weight solution to achieve inference efficiency. Extensive experiments across seven benchmarks show 𝜙-Decoding outperforms strong baselines in both performance and efficiency. Additional analysis demonstrates its generalization across various LLMs and scalability across a wide range of computing budgets.
pdf
bib
abs
Can LLM Watermarks Robustly Prevent Unauthorized Knowledge Distillation?
Leyi Pan
|
Aiwei Liu
|
Shiyu Huang
|
Yijian Lu
|
Xuming Hu
|
Lijie Wen
|
Irwin King
|
Philip S. Yu
The radioactive nature of Large Language Model (LLM) watermarking enables the detection of watermarks inherited by student models when trained on the outputs of watermarked teacher models, making it a promising tool for preventing unauthorized knowledge distillation. However, the robustness of watermark radioactivity against adversarial actors remains largely unexplored. In this paper, we investigate whether student models can acquire the capabilities of teacher models through knowledge distillation while avoiding watermark inheritance. We propose two categories of watermark removal approaches: pre-distillation removal through untargeted and targeted training data paraphrasing (UP and TP), and post-distillation removal through inference-time watermark neutralization (WN). Extensive experiments across multiple model pairs, watermarking schemes and hyper-parameter settings demonstrate that both TP and WN thoroughly eliminate inherited watermarks, with WN achieving this while maintaining knowledge transfer efficiency and low computational overhead. Given the ongoing deployment of watermarking techniques in production LLMs, these findings emphasize the urgent need for more robust defense strategies.
pdf
bib
abs
Rethinking Reward Model Evaluation Through the Lens of Reward Overoptimization
Sunghwan Kim
|
Dongjin Kang
|
Taeyoon Kwon
|
Hyungjoo Chae
|
Dongha Lee
|
Jinyoung Yeo
Reward models (RMs) play a crucial role in reinforcement learning from human feedback (RLHF), aligning model behavior with human preferences. However, existing benchmarks for reward models show a weak correlation with the performance of optimized policies, suggesting that they fail to accurately assess the true capabilities of RMs. To bridge this gap, we explore several evaluation designs through the lens of reward overoptimization, i.e., a phenomenon that captures both how well the reward model aligns with human preferences and the dynamics of the learning signal it provides to the policy. The results highlight three key findings on how to construct a reliable benchmark: (i) it is important to minimize differences between chosen and rejected responses beyond correctness, (ii) evaluating reward models requires multiple comparisons across a wide range of chosen and rejected responses, and (iii) given that reward models encounter responses with diverse representations, responses should be sourced from a variety of models. However, we also observe that a extremely high correlation with degree of overoptimization leads to comparatively lower correlation with certain downstream performance. Thus, when designing a benchmark, it is desirable to use the degree of overoptimization as a useful tool, rather than the end goal.
pdf
bib
abs
Inducing lexicons of in-group language with socio-temporal context
Christine de Kock
In-group language is an important signifier of group dynamics. This paper proposes a novel method for inducing lexicons of in-group language, which incorporates its socio-temporal context. Existing methods for lexicon induction do not capture the evolving nature of in-group language, nor the social structure of the community. Using dynamic word and user embeddings trained on conversations from online anti-women communities, our approach outperforms prior methods for lexicon induction. We develop a test set for the task of lexicon induction and a new lexicon of manosphere language, validated by human experts, which quantifies the relevance of each term to a specific sub-community at a given point in time. Finally, we present novel insights on in-group language which illustrate the utility of this approach.
pdf
bib
abs
LLaSE-G1: Incentivizing Generalization Capability for LLaMA-based Speech Enhancement
Boyi Kang
|
Xinfa Zhu
|
Zihan Zhang
|
Zhen Ye
|
Mingshuai Liu
|
Ziqian Wang
|
Yike Zhu
|
Guobin Ma
|
Jun Chen
|
Longshuai Xiao
|
Chao Weng
|
Wei Xue
|
Lei Xie
Recent advancements in language models (LMs) have demonstrated strong capabilities in semantic understanding and contextual modeling, which have flourished in generative speech enhancement (SE). However, many LM-based SE approaches primarily focus on semantic information, often neglecting the critical role of acoustic information, which leads to acoustic inconsistency after enhancement and limited generalization across diverse SE tasks. In this paper, we introduce LLaSE-G1, a LLaMA-based language model that incentivizes generalization capabilities for speech enhancement. LLaSE-G1 offers the following key contributions: First, to mitigate acoustic inconsistency, LLaSE-G1 employs continuous representations from WavLM as input and predicts speech tokens from X-Codec2, maximizing acoustic preservation. Second, to promote generalization capability, LLaSE-G1 introduces dual-channel inputs and outputs, unifying multiple SE tasks without requiring task-specific IDs. Third, LLaSE-G1 outperforms prior task-specific discriminative and generative SE models, demonstrating scaling effects at test time and emerging capabilities for unseen SE tasks. Additionally, we release our code and models to support further research in this area.
pdf
bib
abs
MadaKV: Adaptive Modality-Perception KV Cache Eviction for Efficient Multimodal Long-Context Inference
Kunxi Li
|
Zhonghua Jiang
|
Zhouzhou Shen
|
ZhaodeWang ZhaodeWang
|
Chengfei Lv
|
Shengyu Zhang
|
Fan Wu
|
Fei Wu
This paper introduces MadaKV, a modality-adaptive key-value (KV) cache eviction strategy designed to enhance the efficiency of multimodal large language models (MLLMs) in long-context inference. In multimodal scenarios, attention heads exhibit varying preferences for different modalities, resulting in significant disparities in modality importance across attention heads. Traditional KV cache eviction methods, which are tailored for unimodal settings, fail to capture modality-specific information, thereby yielding suboptimal performance. MadaKV addresses these challenges through two key components: modality preference adaptation and hierarchical compression compensation. By dynamically sensing modality information within attention heads and adaptively retaining critical tokens, MadaKV achieves substantial reductions in KV cache memory footprint and model inference decoding latency (1.3 to 1.5 times improvement) while maintaining high accuracy across various multimodal long-context tasks. Extensive experiments on representative MLLMs and the MileBench benchmark demonstrate the effectiveness of MadaKV compared to existing KV cache eviction methods.
pdf
bib
abs
Efficient OpAmp Adaptation for Zoom Attention to Golden Contexts
Haoyuan Wu
|
Rui Ming
|
Haisheng Zheng
|
Zhuolun He
|
Bei Yu
Large language models (LLMs) have shown significant promise in question-answering (QA) tasks, particularly in retrieval-augmented generation (RAG) scenarios and long-context applications. However, their performance is hindered by noisy reference documents, which often distract from essential information. Despite fine-tuning efforts, Transformer-based architectures struggle to prioritize relevant content. This is evidenced by their tendency to allocate disproportionate attention to irrelevant or later-positioned documents. Recent work proposes the differential attention mechanism to address this issue, but this mechanism is limited by an unsuitable common-mode rejection ratio (CMRR) and high computational costs. Inspired by the operational amplifier (OpAmp), we propose the OpAmp adaptation to address these challenges, which is implemented with adapters efficiently. By integrating the adapter into pre-trained Transformer blocks, our approach enhances focus on the golden context without costly training from scratch. Empirical evaluations on noisy-context benchmarks reveal that our Qwen2.5-OpAmp-72B model, trained with our OpAmp adaptation, surpasses the performance of state-of-the-art LLMs, including DeepSeek-V3 and GPT-4o.Our code is available at https://github.com/wuhy68/OpampAdapter.
pdf
bib
abs
Language-Codec: Bridging Discrete Codec Representations and Speech Language Models
Shengpeng Ji
|
Minghui Fang
|
Jialong Zuo
|
Ziyue Jiang
|
Dingdong Wang
|
Hanting Wang
|
Hai Huang
|
Zhou Zhao
In recent years, large language models have achieved significant success in generative tasks (e.g., speech cloning and audio generation) related to speech, audio, music, and other signal domains. A crucial element of these models is the discrete acoustic codecs, which serve as an intermediate representation replacing the mel-spectrogram. However, there exist several gaps between discrete codecs and downstream speech language models. Specifically, 1) Due to the reconstruction paradigm of the Codec model and the structure of residual vector quantization, the initial channel of the codebooks contains excessive information, making it challenging to directly generate acoustic tokens from weakly supervised signals such as text in downstream tasks. 2) Achieving good reconstruction performance requires the utilization of numerous codebooks, which increases the burden on downstream speech language models. Consequently, leveraging the characteristics of speech language models, we propose Language-Codec. In the Language-Codec, we introduce a Masked Channel Residual Vector Quantization (MCRVQ) mechanism along with improved fourier transform structures, refined discriminator design to address the aforementioned gaps. We compare our method with competing audio compression algorithms and observe significant outperformance across extensive evaluations. Furthermore, we also validate the efficiency of the Language-Codec on downstream speech language models. The source code and pretrained models will be open-sourced after the paper is accepted. Codes are available at https://github.com/jishengpeng/Languagecodec.
pdf
bib
abs
Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger
Wenjun Li
|
Dexun Li
|
Kuicai Dong
|
Cong Zhang
|
Hao Zhang
|
Weiwen Liu
|
Yasheng Wang
|
Ruiming Tang
|
Yong Liu
Large language models (LLMs) have shown remarkable emergent capabilities, transforming the execution of functional tasks by leveraging external tools for complex problems that require specialized processing or up-to-date data. While existing research expands LLMs access to diverse tools (e.g., program interpreters, search engines, calculators), the necessity of using these tools is often overlooked, leading to indiscriminate tool invocation. This naive approach raises two key issues: increased latency due to unnecessary tool calls, and potential errors resulting from faulty interactions with external tools. In this paper, we introduce meta-cognition as a proxy for LLMs self-assessment of their capabilities, reflecting the model’s awareness of its own limitations. Based on this, we propose MeCo, an adaptive decision-making strategy for external tool use. MeCo quantifies metacognitive scores by capturing high-level cognitive signals in the representation space, guiding when to invoke tools. Notably, MeCo is fine-tuning-free and incurs minimal cost. Experiments across multiple backbone models and benchmarks show that MeCo reliably detects LLMs’ internal cognitive signals and significantly improves tool-use decision-making.
pdf
bib
abs
MMLU-CF: A Contamination-free Multi-task Language Understanding Benchmark
Qihao Zhao
|
Yangyu Huang
|
Tengchao Lv
|
Lei Cui
|
Qinzheng Sun
|
Shaoguang Mao
|
Xin Zhang
|
Ying Xin
|
Qiufeng Yin
|
Scarlett Li
|
Furu Wei
Multiple-choice question (MCQ) datasets like Massive Multitask Language Understanding (MMLU) are widely used to evaluate the commonsense, understanding, and problem-solving abilities of large language models (LLMs). However, the open-source nature of these benchmarks and the broad sources of training data for LLMs have inevitably led to benchmark contamination, resulting in unreliable evaluation. To alleviate this issue, we propose the contamination-free MCQ benchmark called MMLU-CF, which reassesses LLMs’ understanding of world knowledge by averting both unintentional and malicious data contamination. To mitigate unintentional data contamination, we source questions from a broader domain of over 200 billion webpages and apply three specifically designed decontamination rules. To prevent malicious data contamination, we divide the benchmark into validation and test sets with similar difficulty and subject distributions. The test set remains closed-source to ensure reliable results, while the validation set is publicly available to promote transparency and facilitate independent evaluation. The performance gap between these two sets of LLMs will indicate the contamination degree on the validation set in the future. We evaluated over 40 mainstream LLMs on the MMLU-CF. Compared to the original MMLU, not only LLMs’ performances significantly dropped but also the performance rankings of them changed considerably. This indicates the effectiveness of our approach in establishing a contamination-free and fairer evaluation standard.
pdf
bib
abs
Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding
Haneul Yoo
|
Yongjin Yang
|
Hwaran Lee
As large language models (LLMs) have advanced rapidly, concerns regarding their safety have become prominent. In this paper, we discover that code-switching in red-teaming queries can effectively elicit undesirable behaviors of LLMs, which are common practices in natural language. We introduce a simple yet effective framework, CSRT, to synthesize code-switching red-teaming queries and investigate the safety and multilingual understanding of LLMs comprehensively. Through extensive experiments with ten state-of-the-art LLMs and code-switching queries combining up to 10 languages, we demonstrate that the CSRT significantly outperforms existing multilingual red-teaming techniques, achieving 46.7% more attacks than standard attacks in English and being effective in conventional safety domains. We also examine the multilingual ability of those LLMs to generate and understand code-switching texts. Additionally, we validate the extensibility of the CSRT by generating code-switching attack prompts with monolingual data. We finally conduct detailed ablation studies exploring code-switching and propound unintended correlation between resource availability of languages and safety alignment in existing multilingual LLMs.
pdf
bib
abs
Unleashing LLM Reasoning Capability via Scalable Question Synthesis from Scratch
Yuyang Ding
|
Xinyu Shi
|
Xiaobo Liang
|
Juntao Li
|
Zhaopeng Tu
|
Qiaoming Zhu
|
Min Zhang
Improving the mathematical reasoning capabilities of Large Language Models (LLMs) is critical for advancing artificial intelligence. However, access to extensive, diverse, and high-quality reasoning datasets remains a significant challenge, particularly for the open-source community. In this paper, we propose ScaleQuest, a novel, scalable, and cost-effective data synthesis method that enables the generation of large-scale mathematical reasoning datasets using lightweight 7B-scale models. ScaleQuest introduces a two-stage question-tuning process comprising Question Fine-Tuning (QFT) and Question Preference Optimization (QPO) to unlock the question generation capabilities of problem-solving models. By generating diverse questions from scratch – without relying on powerful proprietary models or seed data – we produce a dataset of 1 million problem-solution pairs. Our experiments demonstrate that models trained on our data outperform existing open-source datasets in both in-domain and out-of-domain evaluations. Furthermore, our approach shows continued performance improvement as the volume of training data increases, highlighting its potential for ongoing data scaling. The extensive improvements observed in code reasoning tasks demonstrate the generalization capabilities of our proposed method. Our work provides the open-source community with a practical solution to enhance the mathematical reasoning abilities of LLMs.
pdf
bib
abs
DREsS: Dataset for Rubric-based Essay Scoring on EFL Writing
Haneul Yoo
|
Jieun Han
|
So-Yeon Ahn
|
Alice Oh
Automated essay scoring (AES) is a useful tool in English as a Foreign Language (EFL) writing education, offering real-time essay scores for students and instructors. However, previous AES models were trained on essays and scores irrelevant to the practical scenarios of EFL writing education and usually provided a single holistic score due to the lack of appropriate datasets. In this paper, we release DREsS, a large-scale, standard dataset for rubric-based automated essay scoring with 48.9K samples in total. DREsS comprises three sub-datasets: DREsS_New, DREsS_Std., and DREsS_CASE. We collect DREsS_New, a real-classroom dataset with 2.3K essays authored by EFL undergraduate students and scored by English education experts. We also standardize existing rubric-based essay scoring datasets as DREsS_Std. We suggest CASE, a corruption-based augmentation strategy for essays, which generates 40.1K synthetic samples of DREsS_CASE and improves the baseline results by 45.44%. DREsS will enable further research to provide a more accurate and practical AES system for EFL writing education.
pdf
bib
abs
PQR: Improving Dense Retrieval via Potential Query Modeling
Junfeng Kang
|
Rui Li
|
Qi Liu
|
Yanjiang Chen
|
Zheng Zhang
|
Junzhe Jiang
|
Heng Yu
|
Yu Su
Dense retrieval has now become the mainstream paradigm in information retrieval. The core idea of dense retrieval is to align document embeddings with their corresponding query embeddings by maximizing their dot product. The current training data is quite sparse, with each document typically associated with only one or a few labeled queries. However, a single document can be retrieved by multiple different queries. Aligning a document with just one or a limited number of labeled queries results in a loss of its semantic information. In this paper, we propose a training-free Potential Query Retrieval (PQR) framework to address this issue. Specifically, we use a Gaussian mixture distribution to model all potential queries for a document, aiming to capture its comprehensive semantic information. To obtain this distribution, we introduce three sampling strategies to sample a large number of potential queries for each document and encode them into a semantic space. Using these sampled queries, we employ the Expectation-Maximization algorithm to estimate parameters of the distribution. Finally, we also propose a method to calculate similarity scores between user queries and documents under the PQR framework. Extensive experiments demonstrate the effectiveness of the proposed method.
pdf
bib
abs
Cross-Lingual Generalization and Compression: From Language-Specific to Shared Neurons
Frederick Riemenschneider
|
Anette Frank
Multilingual language models (MLLMs) have demonstrated remarkable abilities to transfer knowledge across languages, despite being trained without explicit cross-lingual supervision. We analyze the parameter spaces of three MLLMs to study how their representations evolve during pre-training, observing patterns consistent with compression: models initially form language-specific representations, which gradually converge into cross-lingual abstractions as training progresses. Through probing experiments, we observe a clear transition from uniform language identification capabilities across layers to more specialized layer functions. For deeper analysis, we focus on neurons that encode distinct semantic concepts. By tracing their development during pre-training, we show how they gradually align across languages. Notably, we identify specific neurons that emerge as increasingly reliable predictors for the same concepts across languages. This alignment manifests concretely in generation: once an MLLM exhibits cross-lingual generalization according to our measures, we can select concept-specific neurons identified from, e.g., Spanish text and manipulate them to guide token predictions. Remarkably, rather than generating Spanish text, the model produces semantically coherent English text. This demonstrates that cross-lingually aligned neurons encode generalized semantic representations, independent of the original language encoding.
pdf
bib
abs
SDBench: A Survey-based Domain-specific LLM Benchmarking and Optimization Framework
Cheng Guo
|
Hu Kai
|
Shuxian Liang
|
Yiyang Jiang
|
Yi Gao
|
Xian-Sheng Hua
|
Wei Dong
The rapid advancement of large language models (LLMs) in recent years has made it feasible to establish domain-specific LLMs for specialized fields. However, in practical development, acquiring domain-specific knowledge often requires a significant amount of professional expert manpower. Moreover, even when domain-specific data is available, the lack of a unified methodology for benchmark dataset establishment often results in uneven data distribution. This imbalance can lead to an inaccurate assessment of the true model capabilities during the evaluation of domain-specific LLMs. To address these challenges, we introduce **SDBench**, a generic framework for generating evaluation datasets for domain-specific LLMs. This method is also applicable for establishing the LLM instruction datasets. It significantly reduces the reliance on expert manpower while ensuring that the collected data is uniformly distributed. To validate the effectiveness of this framework, we also present the **BridgeBench**, a novel benchmark for bridge engineering knowledge, and the **BridgeGPT**, the first LLM specialized in bridge engineering, which can solve bridge engineering tasks.
pdf
bib
abs
ReflecTool: Towards Reflection-Aware Tool-Augmented Clinical Agents
Yusheng Liao
|
Shuyang Jiang
|
Yanfeng Wang
|
Yu Wang
Large Language Models (LLMs) have shown promising potential in the medical domain, assisting with tasks like clinical note generation and patient communication. However, current LLMs are limited to text-based communication, hindering their ability to interact with diverse forms of information in clinical environments. Despite clinical agents succeeding in diverse signal interaction, they are oriented to a single clinical scenario and hence fail for broader applications. To evaluate clinical agents holistically, we propose ClinicalAgent Bench (CAB), a comprehensive medical agent benchmark consisting of 18 tasks across five key realistic clinical dimensions. Building on this, we introduce ReflectTool, a novel framework that excels at utilizing domain-specific tools within two stages. The first optimization stage progressively enlarges a long-term memory by saving successful solving processes and tool-wise experience of agents in a tiny pre-defined training set. In the following inference stage, ReflectTool can search for supportive successful demonstrations from already built long-term memory to guide the tool selection strategy, and a verifier improves the tool usage according to the tool-wise experience with two verification methods–iterative refinement and candidate selection. Extensive experiments on CAB demonstrate that ReflectTool surpasses the pure LLMs with more than 10 points and the well-established agent-based methods with 3 points, highlighting its adaptability and effectiveness in solving complex clinical tasks. Our code and datasets are available at https://github.com/BlueZeros/ReflecTool.
pdf
bib
abs
Lexical Recall or Logical Reasoning: Probing the Limits of Reasoning Abilities in Large Language Models
Henrike Beyer
|
Chris Reed
Despite the increasing interest in the reasoning abilities of Large Language Models (LLMs), existing work shows limitations in assessing logic abilities independently from lexical memory. We address this gap with Mystery-Zebra. This robust two-part benchmark (4,290 puzzles) challenges the logic abstraction abilities of LLMs in two setups: (1) a lexical obfuscation setup tests the dependence of LLMs on lexical content based on two canonical grid puzzles widely spread on the Internet; (2) a set of new grid puzzles in 42 different sizes and 12 difficulty levels tests how the formal difficulty degree of a puzzle affects LLMs.We test open and closed-weight LLMs on both parts of the benchmark. The results on part two suggest that model sizes up to 70B parameters have only a minor influence when solving newly generated puzzles, while performance mainly relates to the number of items in the puzzle. The results on the first part of the benchmark suggest that the applied obfuscation strategies help to mitigate effects of logic puzzles being part of LLM training data, showing a drastic drop in performance for obfuscated versions of well-known puzzles. In addition we conduct a case-study on the first part of the benchmark predicting the position of single items, unveiling that the reasoning abilities of LLMs are mainly limited to a few consecutive steps of reasoning.
pdf
bib
abs
ChainEdit: Propagating Ripple Effects in LLM Knowledge Editing through Logical Rule-Guided Chains
Zilu Dong
|
Xiangqing Shen
|
Zinong Yang
|
Rui Xia
Current knowledge editing methods for large language models (LLMs) struggle to maintain logical consistency when propagating ripple effects to associated facts. We propose ChainEdit, a framework that synergizes knowledge graph-derived logical rules with LLM logical reasoning capabilities to enable systematic chain updates. By automatically extracting logical patterns from structured knowledge bases and aligning them with LLMs’ internal logics, ChainEdit dynamically generates and edits logically connected knowledge clusters. Experiments demonstrate an improvement of more than 30% in logical generalization over baselines while preserving editing reliability and specificity. We further address evaluation biases in existing benchmarks through knowledge-aware protocols that disentangle external dependencies. This work establishes new state-of-the-art performance on ripple effect while ensuring internal logical consistency after knowledge editing.
pdf
bib
abs
HiDe-LLaVA: Hierarchical Decoupling for Continual Instruction Tuning of Multimodal Large Language Model
Haiyang Guo
|
Fanhu Zeng
|
Ziwei Xiang
|
Fei Zhu
|
Da-Han Wang
|
Xu-Yao Zhang
|
Cheng-Lin Liu
Instruction tuning is widely used to enhance a pre-trained Multimodal Large Language Model (MLLM) to understand and follow human instructions by training it on a curated set of task-specific dataset. However, it is infeasible to collect all possible instruction datasets simultaneously in real-world scenarios. Thus, enabling MLLM with continual instruction tuning is essential for maintaining their adaptability. However, existing methods often trade off memory efficiency for performance gains, significantly compromising overall efficiency. In this paper, we propose a task-specific expansion and task-general fusion framework based on the variations in Centered Kernel Alignment (CKA) similarity across different model layers when trained on diverse datasets. Furthermore, we analyze the information leakage present in the existing benchmark and propose a new and more challenging benchmark to rationally evaluate the performance of different methods. Comprehensive experiments showcase a significant performance improvement of our method compared to existing state-of-the-art methods. Our code will be public available.
pdf
bib
abs
Self-supervised Quantized Representation for Seamlessly Integrating Knowledge Graphs with Large Language Models
Qika Lin
|
Tianzhe Zhao
|
Kai He
|
Zhen Peng
|
Fangzhi Xu
|
Ling Huang
|
Jingying Ma
|
Mengling Feng
Due to the presence of the natural gap between Knowledge Graph (KG) structures and the natural language, the effective integration of holistic structural information of KGs with Large Language Models (LLMs) has emerged as a significant question. To this end, we propose a two-stage framework to learn and apply quantized codes for each entity, aiming for the seamless integration of KGs with LLMs. Firstly, a self-supervised quantized representation (SSQR) method is proposed to compress both KG structural and semantic knowledge into discrete codes (i.e., tokens) that align the format of language sentences. We further design KG instruction-following data by viewing these learned codes as features to directly input to LLMs, thereby achieving seamless integration. The experiment results demonstrate that SSQR outperforms existing unsupervised quantized methods, producing more distinguishable codes. Moreover, the fine-tuned LLaMA2 and LLaMA3.1 also have superior performance on KG link prediction and triple classification tasks, utilizing only 16 tokens per entity instead of thousands in conventional prompting methods.
pdf
bib
abs
Finite State Automata Inside Transformers with Chain-of-Thought: A Mechanistic Study on State Tracking
Yifan Zhang
|
Wenyu Du
|
Dongming Jin
|
Jie Fu
|
Zhi Jin
Chain-of-thought (CoT) significantly enhances the performance of large language models (LLMs) across a wide range of tasks, and prior research shows that CoT can theoretically increase expressiveness. However, there is limited mechanistic understanding of the algorithms that Transformer+CoT can learn. Our key contributions are: (1) We evaluate the state tracking capabilities of Transformer+CoT and its variants, confirming the effectiveness of CoT. (2) Next, we identify the circuit (a subset of model components, responsible for tracking the world state), indicating that late-layer MLP neurons play a key role. We propose two metrics, compression and distinction, and show that the neuron sets for each state achieve nearly 100% accuracy, providing evidence of an implicit finite state automaton (FSA) embedded within the model. (3) Additionally, we explore three challenging settings: skipping intermediate steps, introducing data noises, and testing length generalization. Our results demonstrate that Transformer+CoT learns robust algorithms (FSAs), highlighting its resilience in challenging scenarios. Our code is available at https://github.com/IvanChangPKU/FSA.
pdf
bib
abs
TeamLoRA: Boosting Low-Rank Adaptation with Expert Collaboration and Competition
Tianwei Lin
|
Jiang Liu
|
Wenqiao Zhang
|
Yang Dai
|
Haoyuan Li
|
Zhelun Yu
|
Wanggui He
|
Juncheng Li
|
Jiannan Guo
|
Hao Jiang
|
Siliang Tang
|
Yueting Zhuang
While Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA) effectively address resource constraints during fine-tuning, their performance often falls short, especially in multidimensional task scenarios. To address this issue, one straightforward solution is to introduce task-specific LoRA as domain experts, leveraging the modeling of multiple capabilities of experts and thus enhancing the general capability of multi-task learning.Although promising, these additional components often add complexity to the training and inference process, contravening the efficiency that PEFT is designed to deliver. Considering this, we introduce an innovative PEFT method, **TeamLoRA**, consisting of a collaboration and competition module for LoRA experts, thus achieving the right balance of effectiveness and efficiency:**(i)** For *collaboration*, we introduce a novel knowledge sharing and organization mechanism designed to optimize hierarchical learning while enhancing the efficiency of model training and inference.**(ii)** For *competition*, we propose leveraging a game-theoretic interaction mechanism for experts, encouraging experts to transfer their domain-specific knowledge while facing diverse downstream tasks, thus enhancing the performance.By doing so, TeamLoRA elegantly connects the experts as a “*Team*” with internal collaboration and competition, enabling a faster and more accurate PEFT paradigm. Meanwhile, we curate a **Comprehensive Multi-Task Evaluation (CME)** benchmark to thoroughly assess the capability of multi-task learning. Experiments conducted on our CME and other benchmarks indicate the effectiveness and efficiency of TeamLoRA. Our project is available at https://github.com/DCDmllm/TeamLoRA.
pdf
bib
abs
CRiskEval: A Chinese Multi-Level Risk Evaluation Benchmark Dataset for Large Language Models
Ling Shi
|
Deyi Xiong
Large language models (LLMs) are possessed of numerous beneficial capabilities, yet their potential inclination harbors unpredictable risks that may materialize in the future. We hence propose CRiskEval, a Chinese dataset meticulously designed for gauging the risk proclivities inherent in LLMs such as resource acquisition and malicious coordination, as part of efforts for proactive preparedness. To curate CRiskEval, we define a new risk taxonomy with 7 types of frontier risks and 4 safety levels, including extremely hazardous,moderately hazardous, neutral and safe. We follow the philosophy of tendency evaluation to empirically measure the stated ”desire” of LLMs via fine-grained multiple-choice question answering. The dataset consists of 14,888 questions that simulate scenarios related to predefined 7 types of frontier risks. Each question is accompanied with 4 answer choices that state opinions or behavioral tendencies corresponding to the question. All answer choices are manually annotated with one of the defined risk levels so that we can easily build a fine-grained frontier risk profile for each assessed LLM. Extensive evaluation with CRiskEval on a spectrum of prevalent Chinese LLMs has unveiled a striking revelation: most models exhibit risk tendencies of more than 40% (weighted tendency to the four risk levels). Furthermore, a subtle increase in the model’s inclination toward urgent self-sustainability, power seeking and other dangerous goals becomes evident as the size of models increases. To promote further research on the frontier risk evaluation of LLMs, we publicly release our dataset at https://github.com/tjunlp-lab/CRiskEval.
pdf
bib
abs
STUN: Structured-Then-Unstructured Pruning for Scalable MoE Pruning
Jaeseong Lee
|
Seung-won Hwang
|
Aurick Qiao
|
Daniel F Campos
|
Zhewei Yao
|
Yuxiong He
Mixture-of-experts (MoEs) have been adopted for reducing inference costs by sparsely activating experts in large language models (LLMs). Despite these reductions, the massive number of parameters in MoEs still makes them expensive to serve. Conventionally, unstructured or structured pruning has been considered to reduce number of parameters. Our key contribution is exploring the interpolation between structured and unstructured pruning, to propose a novel structured-then-unstructured (STUN) approach outperforming both of structured or unstructured pruning, especially for MoEs. In the first stage, we show a scalable expert pruning with O(1) forward pass, unlike existing work requiring O(kn⁄√n) forward passes for n experts that cannot scale for recent MoEs with hundreds of experts. We then show our expert-pruned MoEs are robust to unstructured pruning to follow. Experiments on Snowflake Arctic and Mixtral shows that our proposal is highly effective– For Snowflake Arctic, a 480B-sized MoE with 128 experts, our method needs only one H100 and two hours to achieve nearly no loss in performance with 40% sparsity, even in generative tasks such as GSM8K, where state-of-the-art structured or unstructured pruning methods fail. The code is publicly available.
pdf
bib
abs
Mimicking the Familiar: Dynamic Command Generation for Information Theft Attacks in LLM Tool-Learning System
Ziyou Jiang
|
Mingyang Li
|
Guowei Yang
|
Junjie Wang
|
Yuekai Huang
|
Zhiyuan Chang
|
Qing Wang
Information theft attacks pose a significant risk to Large Language Model (LLM) tool-learning systems. Adversaries can inject malicious commands through compromised tools, manipulating LLMs to send sensitive information to these tools, which leads to potential privacy breaches. However, existing attack approaches are black-box oriented and rely on static commands that cannot adapt flexibly to the changes in user queries and the invocation chain of tools. It makes malicious commands more likely to be detected by LLM and leads to attack failure. In this paper, we propose AutoCMD, a dynamic attack comment generation approach for information theft attacks in LLM tool-learning systems. Inspired by the concept of mimicking the familiar, AutoCMD is capable of inferring the information utilized by upstream tools in the toolchain through learning on open-source systems and reinforcement with target system examples, thereby generating more targeted commands for information theft. The evaluation results show that AutoCMD outperforms the baselines with +13.2% ASRTheft, and can be generalized to new tool-learning systems to expose their information leakage risks. We also design four defense methods to effectively protect tool-learning systems from the attack.
pdf
bib
abs
FlashAudio: Rectified Flow for Fast and High-Fidelity Text-to-Audio Generation
Huadai Liu
|
Jialei Wang
|
Rongjie Huang
|
Yang Liu
|
Heng Lu
|
Zhou Zhao
|
Wei Xue
Recent advancements in latent diffusion models (LDMs) have markedly enhanced text-to-audio generation, yet their iterative sampling processes impose substantial computational demands, limiting practical deployment. While recent methods utilizing consistency-based distillation aim to achieve few-step or single-step inference, their one-step performance is constrained by curved trajectories, preventing them from surpassing traditional diffusion models. In this work, we introduce FlashAudio with rectified flows to learn straight flow for fast simulation. To alleviate the inefficient timesteps allocation and suboptimal distribution of noise, FlashAudio optimizes the time distribution of rectified flow with Bifocal Samplers and proposes immiscible flow to minimize the total distance of data-noise pairs in a batch vias assignment. Furthermore, to address the amplified accumulation error caused by the classifier-free guidance (CFG), we propose Anchored Optimization, which refines the guidance scale by anchoring it to a reference trajectory. Experimental results on text-to-audio generation demonstrate that FlashAudio’s one-step generation performance surpasses the diffusion-based models with hundreds of sampling steps on audio quality and enables a sampling speed of 400x faster than real-time on a single NVIDIA 4090Ti GPU. Code will be available at
https://github.com/liuhuadai/FlashAudio. Audio Samples are available at https://FlashAudio-TTA.github.io/.
pdf
bib
abs
How does Misinformation Affect Large Language Model Behaviors and Preferences?
Miao Peng
|
Nuo Chen
|
Jianheng Tang
|
Jia Li
Large Language Models (LLMs) have shown remarkable capabilities in knowledge-intensive tasks, while they remain vulnerable when encountering misinformation. Existing studies have explored the role of LLMs in combating misinformation, but there is still a lack of fine-grained analysis on the specific aspects and extent to which LLMs are influenced by misinformation. To bridge this gap, we present MisBench, the current largest and most comprehensive benchmark for evaluating LLMs’ behavior and knowledge preference toward misinformation. MisBench consists of 10,346,712 pieces of misinformation, which uniquely considers both knowledge-based conflicts and stylistic variations in misinformation. Empirical results reveal that while LLMs demonstrate comparable abilities in discerning misinformation, they still remain susceptible to knowledge conflicts and stylistic variations. Based on these findings, we further propose a novel approach called Reconstruct to Discriminate (RtD) to strengthen LLMs’ ability to detect misinformation. Our study provides valuable insights into LLMs’ interactions with misinformation, and we believe MisBench can serve as an effective benchmark for evaluating LLM-based detectors and enhancing their reliability in real-world applications. Codes and data are available at: https://github.com/GKNL/MisBench.
pdf
bib
abs
YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering
Jennifer D’Souza
|
Hamed Babaei Giglou
|
Quentin Münch
Large Language Models (LLMs) drive scientific question-answering on modern search engines, yet their evaluation robustness remains underexplored. We introduce YESciEval, an open-source framework that combines fine-grained rubric-based assessment with reinforcement learning to mitigate optimism bias in LLM evaluators. We release multidisciplinary scienceQ&A datasets, including adversarial variants, with evaluation scores from multiple LLMs. Independent of proprietary models and human feedback, our approach enables scalable, cost-free evaluation. By advancing reliable LLM-as-a-judge models, this work supports AI alignment and fosters robust, transparent evaluation essential for scientific inquiry.
pdf
bib
abs
GALLa: Graph Aligned Large Language Models for Improved Source Code Understanding
Ziyin Zhang
|
Hang Yu
|
Sage Lee
|
Peng Di
|
Jianguo Li
|
Rui Wang
Programming languages possess rich semantic information - such as data flow - that is represented by graphs and not available from the surface form of source code. Recent code language models have scaled to billions of parameters, but model source code solely as text tokens while ignoring any other structural information. Conversely, models that do encode structural information of code make modifications to the Transformer architecture, limiting their scale and compatibility with pretrained LLMs. In this work, we take the best of both worlds with GALLa - Graph Aligned Large Language Models. GALLa utilizes graph neural networks and cross-modal alignment technologies to inject the structural information of code into LLMs as an auxiliary task during finetuning. This framework is both model-agnostic and task-agnostic, as it can be applied to any code LLM for any code downstream task, and requires the structural graph data only at training time from a corpus unrelated to the finetuning data, while incurring no cost at inference time over the baseline LLM. Experiments on five code tasks with six different baseline LLMs ranging in size from 350M to 14B validate the effectiveness of GALLa, demonstrating consistent improvement over the baseline, even for powerful models such as LLaMA3 and Qwen2.5-Coder.
pdf
bib
abs
MEDDxAgent: A Unified Modular Agent Framework for Explainable Automatic Differential Diagnosis
Daniel Philip Rose
|
Chia-Chien Hung
|
Marco Lepri
|
Israa Alqassem
|
Kiril Gashteovski
|
Carolin Lawrence
Differential Diagnosis (DDx) is a fundamental yet complex aspect of clinical decision-making, in which physicians iteratively refine a ranked list of possible diseases based on symptoms, antecedents, and medical knowledge. While recent advances in large language models (LLMs) have shown promise in supporting DDx, existing approaches face key limitations, including single-dataset evaluations, isolated optimization of components, unrealistic assumptions about complete patient profiles, and single-attempt diagnosis. We introduce a Modular Explainable DDx Agent (MEDDxAgent) framework designed for interactive DDx, where diagnostic reasoning evolves through iterative learning, rather than assuming a complete patient profile is accessible. MEDDxAgent integrates three modular components: (1) an orchestrator (DDxDriver), (2) a history taking simulator, and (3) two specialized agents for knowledge retrieval and diagnosis strategy. To ensure robust evaluation, we introduce a comprehensive DDx benchmark covering respiratory, skin, and rare diseases. We analyze single-turn diagnostic approaches and demonstrate the importance of iterative refinement when patient profiles are not available at the outset. Our broad evaluation demonstrates that MEDDxAgent achieves over 10% accuracy improvements in interactive DDx across both large and small LLMs, while offering critical explainability into its diagnostic reasoning process.
pdf
bib
abs
A Training-free LLM-based Approach to General Chinese Character Error Correction
Houquan Zhou
|
Bo Zhang
|
Zhenghua Li
|
Ming Yan
|
Min Zhang
Chinese spelling correction (CSC) is a crucial task that aims to correct character errors in Chinese text. While conventional CSC focuses on character substitution errors caused by mistyping, two other common types of character errors, missing and redundant characters, have received less attention. These errors are often excluded from CSC datasets during the annotation process or ignored during evaluation, even when they have been annotated. This issue limits the practicality of the CSC task. To address this issue, we introduce the task of General Chinese Character Error Correction (C2EC), which focuses on all three types of character errors. We construct a high-quality C2EC benchmark by combining and manually verifying data from CCTC and Lemon datasets. We extend the training-free prompt-free CSC method to C2EC by using Levenshtein distance for handling length changes and leveraging an additional prompt-based large language model (LLM) to improve performance. Experiments show that our method enables a 14B-parameter LLM to be on par with models nearly 50 times larger on both conventional CSC and C2EC tasks, without any fine-tuning.
pdf
bib
abs
HSCR: Hierarchical Self-Contrastive Rewarding for Aligning Medical Vision Language Models
Songtao Jiang
|
Yan Zhang
|
Yeying Jin
|
Zhihang Tang
|
Yangyang Wu
|
Yang Feng
|
Jian Wu
|
Zuozhu Liu
Medical Vision-Language Models (Med-VLMs) have achieved success across various tasks, yet most existing methods overlook the modality misalignment issue that can lead to untrustworthy responses in clinical settings. In this paper, we propose Hierarchical Self-Contrastive Rewarding (HSCR), a novel approach that addresses two critical challenges in Med-VLM alignment: 1) Cost-effective generation of high-quality preference data; 2) Capturing nuanced and context-aware preferences for improved alignment. HSCR first leverages the inherent capability of Med-VLMs to generate dispreferred responses with higher sampling probability. By analyzing output logit shifts after visual token dropout, we identify modality-coupled tokens that induce misalignment and derive an implicit alignment reward function. This function guides token replacement with hallucinated ones during decoding, producing high-quality dispreferred data. Furthermore, HSCR introduces a multi-level preference optimization strategy, which extends beyond traditional adjacent-level optimization by incorporating nuanced implicit preferences, leveraging relative quality in dispreferred data to capture subtle alignment cues for more precise and context-aware optimization. Extensive experiments across multiple medical tasks, including Med-VQA, medical image captioning and instruction following, demonstrate that HSCR not only enhances zero-shot performance but also significantly improves modality alignment and trustworthiness with just 2,000 training entries. Code is released on https://github.com/jiangsongtao/HSCR.
pdf
bib
abs
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale
Jiawei Guo
|
Tianyu Zheng
|
Yizhi Li
|
Yuelin Bai
|
Bo Li
|
Yubo Wang
|
King Zhu
|
Graham Neubig
|
Wenhu Chen
|
Xiang Yue
Open-source multimodal large language models (MLLMs) have shown significant potential in a broad range of tasks. However, their reasoning capabilities remain constrained by existing instruction-tuning datasets, which were predominately repurposed from academic datasets such as VQA, AI2D, and ChartQA. These datasets target simplistic tasks, and only provide phrase-level answers without any intermediate rationales.To address these challenges, we introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales designed to elicit CoT reasoning. Using only open models, we create a dataset containing 12M instruction-response pairs to cover diverse reasoning-intensive tasks.Experiments demonstrate that training MLLMs on our dataset not only significantly improves reasoning capabilities, achieving state-of-the-art performance on benchmarks such as MathVerse (+8.1%), MMMU-Pro (+7%), and MuirBench (+13.3%), but also gains improvements of up to 4% on non-reasoning-based benchmarks.
pdf
bib
abs
SIFT-50M: A Large-Scale Multilingual Dataset for Speech Instruction Fine-Tuning
Prabhat Pandey
|
Rupak Vignesh Swaminathan
|
K V Vijay Girish
|
Arunasish Sen
|
Jian. Xie
|
Grant Strimel
|
Andreas Schwarz
We introduce SIFT (Speech Instruction Fine-Tuning), a 50M-example dataset designed for instruction fine-tuning and pre-training of speech-text large language models (LLMs). SIFT-50M is built from publicly available speech corpora, which collectively contain 14K hours of speech, and leverages LLMs along with off-the-shelf expert models. The dataset spans five languages, encompassing a diverse range of speech understanding as well as controllable speech generation instructions. Using SIFT-50M, we train SIFT-LLM, which outperforms existing speech-text LLMs on instruction-following benchmarks while achieving competitive performance on foundational speech tasks. To support further research, we also introduce EvalSIFT, a benchmark dataset specifically designed to evaluate the instruction-following capabilities of speech-text LLMs.
pdf
bib
abs
Recent Advances in Speech Language Models: A Survey
Wenqian Cui
|
Dianzhi Yu
|
Xiaoqi Jiao
|
Ziqiao Meng
|
Guangyan Zhang
|
Qichao Wang
|
Steven Y. Guo
|
Irwin King
Text-based Large Language Models (LLMs) have recently gained significant attention, primarily for their capabilities in text-based interactions. However, natural human interaction often relies on speech, highlighting the need for voice-based models. In this context, Speech Language Models (SpeechLMs)—foundation models designed to understand and generate speech—emerge as a promising solution for end-to-end speech interaction. This survey offers a comprehensive overview of recent approaches to building SpeechLMs, outlining their core architectural components, training methodologies, evaluation strategies, and the challenges and potential directions for future research in this rapidly advancing field. The GitHub repository is available at https://github.com/dreamtheater123/Awesome-SpeechLM-Survey
pdf
bib
abs
LexCLiPR: Cross-Lingual Paragraph Retrieval from Legal Judgments
Rohit Upadhya
|
Santosh T.y.s.s
Efficient retrieval of pinpointed information from case law is crucial for legal professionals but challenging due to the length and complexity of legal judgments. Existing works mostly often focus on retrieving entire cases rather than precise, paragraph-level information. Moreover, multilingual legal practice necessitates cross-lingual retrieval, most works have been limited to monolingual settings. To address these gaps, we introduce LexCLiPR, a cross-lingual dataset for paragraph-level retrieval from European Court of Human Rights (ECtHR) judgments, leveraging multilingual case law guides and distant supervision to curate our dataset. We evaluate retrieval models in a zero-shot setting, revealing the limitations of pre-trained multilingual models for cross-lingual tasks in low-resource languages and the importance of retrieval based post-training strategies. In fine-tuning settings, we observe that two-tower models excel in cross-lingual retrieval, while siamese architectures are better suited for monolingual tasks. Fine-tuning multilingual models on native language queries improves performance but struggles to generalize to unseen legal concepts, highlighting the need for robust strategies to address topical distribution shifts in the legal queries.
pdf
bib
abs
Multi-task Adversarial Attacks against Black-box Model with Few-shot Queries
Wenqiang Wang
|
Yan Xiao
|
Hao Lin
|
Yangshijie Zhang
|
Xiaochun Cao
Current multi-task adversarial text attacks rely on abundant access to shared internal features and numerous queries, often limited to a single task type. As a result, these attacks are less effective against practical scenarios involving black-box feedback APIs, limited queries, or multiple task types. To bridge this gap, we propose Cluster and Ensemble Mutil-task Text Adversarial Attack (CEMA), an effective black-box attack that exploits the transferability of adversarial texts across different tasks. CEMA simplifies complex multi-task scenarios by using a deep-level substitute model trained in a plug-and-play manner for text classification, enabling attacks without mimicking the victim model. This approach requires only a few queries for training, converting multi-task attacks into classification attacks and allowing attacks across various tasks. CEMA generates multiple adversarial candidates using different text classification methods and selects the one that most effectively attacks substitute models. In experiments involving multi-task models with two, three, or six tasks—spanning classification, translation, summarization, and text-to-image generation—CEMA demonstrates significant attack success with as few as 100 queries. Furthermore, CEMA can target commercial APIs (e.g., Baidu and Google Translate), large language models (e.g., ChatGPT 4o), and image-generation models (e.g., Stable Diffusion V2), showcasing its versatility and effectiveness in real-world applications.
pdf
bib
abs
SPECTRA: Faster Large Language Model Inference with Optimized Internal and External Speculation
Nguyen-Khang Le
|
Truong Dinh Do
|
Le-Minh Nguyen
Inference with modern Large Language Models (LLMs) is both computationally expensive and time-consuming. Speculative decoding has emerged as a promising solution, but existing approaches face key limitations: training-based methods require a draft model that is challenging to obtain and lacks generalizability, while training-free methods offer limited speedup gains. In this work, we present Spectra, a novel framework for accelerating LLM inference without the need for additional training or modification to the original LLM. Spectra introduces two new techniques for efficiently utilizing internal and external speculation, each outperforming corresponding state-of-the-art (SOTA) methods independently. When combined, these techniques achieve up to a 4.08x speedup across various benchmarks and LLM architectures, significantly surpassing existing training-free approaches. The implementation of Spectra is publicly available.
pdf
bib
abs
Multi-level Association Refinement Network for Dialogue Aspect-based Sentiment Quadruple Analysis
Zeliang Tong
|
Wei Wei
|
Xiaoye Qu
|
Rikui Huang
|
Zhixin Chen
|
Xingyu Yan
Dialogue Aspect-based Sentiment Quadruple (DiaASQ) analysis aims to identify all quadruples (i.e., target, aspect, opinion, sentiment) from the dialogue. This task is challenging as different elements within a quadruple may manifest in different utterances, requiring precise handling of associations at both the utterance and word levels. However, most existing methods tackling it predominantly leverage predefined dialogue structure (e.g., reply) and word semantics, resulting in a surficial understanding of the deep sentiment association between utterances and words. In this paper, we propose a novel Multi-level Association Refinement Network (MARN) designed to achieve more accurate and comprehensive sentiment associations between utterances and words. Specifically, for utterances, we dynamically capture their associations with enriched semantic features through a holistic understanding of the dialogue, aligning them more closely with sentiment associations within elements in quadruples. For words, we develop a novel cross-utterance syntax parser (CU-Parser) that fully exploits syntactic information to enhance the association between word pairs within and across utterances. Moreover, to address the scarcity of labeled data in DiaASQ, we further introduce a multi-view data augmentation strategy to enhance the performance of MARN under low-resource conditions. Experimental results demonstrate that MARN achieves state-of-the-art performance and maintains robustness even under low-resource conditions.
pdf
bib
abs
Innovative Image Fraud Detection with Cross-Sample Anomaly Analysis: The Power of LLMs
QiWen Wang
|
Junqi Yang
|
Zhenghao Lin
|
Zhenzhe Ying
|
Weiqiang Wang
|
Chen Lin
The financial industry faces a substantial workload in verifying document images. Existing methods based on visual features struggle to identify fraudulent document images due to the lack of visual clues on the tampering region. This paper proposes CSIAD (Cross-Sample Image Anomaly Detection) by leveraging LLMs to identify logical inconsistencies in similar images. This novel framework accurately detects forged images with slight tampering traces and explains anomaly detection results. Furthermore, we introduce CrossCred, a new benchmark of real-world fraudulent images with fine-grained manual annotations. Experiments demonstrate that CSIAD outperforms state-of-the-art image fraud detection methods by 79.6% (F1) on CrossCred and deployed industrial solutions by 21.7% (F1) on business data. The benchmark is available at https://github.com/XMUDM/CSIAD.
pdf
bib
abs
Cooperative or Competitive? Understanding the Interaction between Attention Heads From A Game Theory Perspective
Xiaoye Qu
|
Zengqi Yu
|
Dongrui Liu
|
Wei Wei
|
Daizong Liu
|
Jianfeng Dong
|
Yu Cheng
Despite the remarkable success of attention-based large language models (LLMs), the precise interaction mechanisms between attention heads remain poorly understood. In contrast to prevalent methods that focus on individual head contributions, we rigorously analyze the intricate interplay among attention heads through a novel framework based on the Harsanyi dividend, a concept from cooperative game theory. Our analysis reveals that significant positive Harsanyi dividends are sparsely distributed across head combinations, indicating that most heads do not contribute cooperatively. Moreover, certain head combinations exhibit negative dividends, indicating implicit competitive relationships. To further optimize the interactions among attention heads, we propose a training-free Game-theoretic Attention Calibration (GAC) method. Specifically, GAC selectively retains heads demonstrating significant cooperative gains and applies fine-grained distributional adjustments to the remaining heads. Comprehensive experiments across 17 benchmarks demonstrate the effectiveness of our proposed GAC and its superior generalization capabilities across diverse model families, scales, and modalities. Crucially, the discovered interaction phenomena offer a path toward a deeper understanding of the behaviors of LLMs.
pdf
bib
abs
MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification
Linzhuang Sun
|
Hao Liang
|
Jingxuan Wei
|
Bihui Yu
|
Tianpeng Li
|
Fan Yang
|
Zenan Zhou
|
Wentao Zhang
According to the Test-Time Scaling, the integration of External Slow-Thinking with the Verify mechanism has been demonstrated to enhance multi-round reasoning in large language models (LLMs). However, in the multimodal (MM) domain, there is still a lack of a strong MM-Verifier. In this paper, we introduce MM-Verifier and MM-Reasoner to enhance multimodal reasoning through longer inference and more robust verification. First, we propose a two-step MM verification data synthesis method, which combines a simulation-based tree search with verification and uses rejection sampling to generate high-quality Chain-of-Thought (COT) data. This data is then used to fine-tune the verification model, MM-Verifier. Additionally, we present a more efficient method for synthesizing MMCOT data, bridging the gap between text-based and multimodal reasoning. The synthesized data is used to fine-tune MM-Reasoner. Our MM-Verifier outperforms all larger models on the MathCheck, MathVista, and MathVerse benchmarks. Moreover, MM-Reasoner demonstrates strong effectiveness and scalability, with performance improving as data size increases. Finally, our approach achieves strong performance when combining MM-Reasoner and MM-Verifier, reaching an accuracy of 65.3 on MathVista, surpassing GPT-4o (63.8) with 12 rollouts.
pdf
bib
abs
Graph-Structured Trajectory Extraction from Travelogues
Aitaro Yamamoto
|
Hiroyuki Otomo
|
Hiroki Ouchi
|
Shohei Higashiyama
|
Hiroki Teranishi
|
Hiroyuki Shindo
|
Taro Watanabe
Human traveling trajectories play a central role in characterizing each travelogue, and automatic trajectory extraction from travelogues is highly desired for tourism services, such as travel planning and recommendation. This work addresses the extraction of human traveling trajectories from travelogues. Previous work treated each trajectory as a sequence of visited locations, although locations with different granularity levels, e.g., “Kyoto City” and “Kyoto Station,” should not be lined up in a sequence. In this work, we propose to represent the trajectory as a graph that can capture the hierarchy as well as the visiting order, and construct a benchmark dataset for the trajectory extraction. The experiments using this dataset show that even naive baseline systems can accurately predict visited locations and the visiting order between them, while it is more challenging to predict the hierarchical relations.
pdf
bib
abs
Learning First-Order Logic Rules for Argumentation Mining
Yang Sun
|
Guanrong Chen
|
Hamid Alinejad-Rokny
|
Jianzhu Bao
|
Yuqi Huang
|
Bin Liang
|
Kam-Fai Wong
|
Min Yang
|
Ruifeng Xu
Argumentation Mining (AM) aims to extract argumentative structures from texts by identifying argumentation components (ACs) and their argumentative relations (ARs). While previous works focus on representation learning to encode ACs and AC pairs, they fail to explicitly model the underlying reasoning patterns of AM, resulting in limited interpretability. This paper proposes a novel ̲First- ̲Order ̲Logic reasoning framework for ̲AM (FOL-AM), designed to explicitly capture logical reasoning paths within argumentative texts. By interpreting multiple AM subtasks as a unified relation query task modeled using FOL rules, FOL-AM facilitates multi-hop relational reasoning and enhances interpretability. The framework supports two flexible implementations: a fine-tuned approach to leverage task-specific learning, and a prompt-based method utilizing large language models to harness their generalization capabilities. Extensive experiments on two AM benchmarks demonstrate that FOL-AM outperforms strong baselines while significantly improving explainability.
pdf
bib
abs
Investigating and Enhancing the Robustness of Large Multimodal Models Against Temporal Inconsistency
Jiafeng Liang
|
Shixin Jiang
|
Xuan Dong
|
Ning Wang
|
Zheng Chu
|
Hui Su
|
Jinlan Fu
|
Ming Liu
|
See-Kiong Ng
|
Bing Qin
Large Multimodal Models (LMMs) have recently demonstrated impressive performance on general video comprehension benchmarks. Nevertheless, for broader applications, the robustness of their temporal analysis capability needs to be thoroughly investigated yet predominantly ignored. Motivated by this, we propose a novel temporal robustness benchmark (TemRobBench), which introduces temporal inconsistency perturbations separately at the visual and textual modalities to assess the robustness of models. We evaluate 16 mainstream LMMs and find that they exhibit over-reliance on prior knowledge and textual context in adversarial environments, while ignoring the actual temporal dynamics in the video. To mitigate this issue, we design panoramic direct preference optimization (PanoDPO), which encourages LMMs to incorporate both visual and linguistic feature preferences simultaneously. Experimental results show that PanoDPO can effectively enhance the model’s robustness and reliability in temporal analysis.
pdf
bib
abs
UniRAG: Unified Query Understanding Method for Retrieval Augmented Generation
Rui Li
|
Liyang He
|
Qi Liu
|
Zheng Zhang
|
Heng Yu
|
Yuyang Ye
|
Linbo Zhu
|
Yu Su
Retrieval-Augmented Generation (RAG) technology effectively addresses the issues of knowledge update lag and hallucinations in large language models (LLMs) by integrating internal and external knowledge. Existing query augmentation methods improve RAG’s performance in handling complex queries but face two key challenges: (1) the separation of query augmentation and encoding tasks, which hinders information sharing and introduces cumulative errors, and (2) the difficulty of selecting the optimal augmentation strategy for different scenarios. In this work, we propose UniRAG, a unified framework for query understanding in RAG. UniRAG employs a decoder-only LLM to jointly perform query augmentation and encoding, eliminating task separation. To facilitate adaptive query augmentation, we categorize existing techniques into query paraphrasing, query expansion, and query abstraction. Our model learns to select the optimal augmentation strategy based on user queries, leveraging retrieval and generation outputs as feedback. Experimental results show that UniRAG significantly outperforms traditional query augmentation methods in five knowledge-intensive benchmark tasks in both closed and open domain question answering.
pdf
bib
abs
Contextual Experience Replay for Self-Improvement of Language Agents
Yitao Liu
|
Chenglei Si
|
Karthik R Narasimhan
|
Shunyu Yao
Large language model (LLM) agents have been applied to sequential decision-making tasks such as web navigation, but without any environment-specific experiences, they often fail in these complex tasks. Moreover, current LLM agents are not designed to continually learn from past experiences during inference time, which could be crucial for them to gain these environment-specific experiences. To address this, we propose Contextual Experience Replay (CER), a training-free framework to enable efficient self-improvement for language agents in their context window. Specifically, CER accumulates and synthesizes past experiences into a dynamic memory buffer. These experiences encompass environment dynamics and common decision-making patterns, allowing the agents to retrieve and augment themselves with relevant knowledge in new tasks, enhancing their adaptability in complex environments. We evaluate CER on the challenging WebArena and VisualWebArena benchmarks. On VisualWebArena, CER surpasses the tree search method with much fewer token costs and achieves the state-of-the-art performance of 31.9%. On WebArena, CER also gets a competitive average success rate of 36.7%, relatively improving the success rate of the GPT-4o agent baseline by 51.0%. We also conduct a comprehensive analysis on it to prove its efficiency, validity and understand it better.
pdf
bib
abs
Emma-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning
Qi Sun
|
Pengfei Hong
|
Tej Deep Pala
|
Vernon Toh
|
U-Xuan Tan
|
Deepanway Ghosal
|
Soujanya Poria
Traditional reinforcement learning-based robotic control methods are often task-specific and fail to generalize across diverse environments or unseen objects and instructions. Visual Language Models (VLMs) demonstrate strong scene understanding and planning capabilities but lack the ability to generate actionable policies tailored to specific robotic embodiments. To address this, Visual-Language-Action (VLA) models have emerged, yet they face challenges in long-horizon spatial reasoning and grounded task planning. In this work, we propose the Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning, EMMA-X. EMMA-X leverages our constructed hierarchical embodiment dataset based on BridgeV2, containing 60,000 robot manipulation trajectories auto-annotated with grounded task reasoning and spatial guidance. Additionally, we introduce a trajectory segmentation strategy based on gripper states and motion trajectories, which can help mitigate hallucination in grounding subtask reasoning generation. Experimental results demonstrate that EMMA-X achieves superior performance over competitive baselines, particularly in real-world robotic tasks requiring spatial reasoning.
pdf
bib
abs
Towards Comprehensive Argument Analysis in Education: Dataset, Tasks, and Method
Yupei Ren
|
Xinyi Zhou
|
Ning Zhang
|
Shangqing Zhao
|
Man Lan
|
Xiaopeng Bai
Argument mining has garnered increasing attention over the years, with the recent advancement of Large Language Models (LLMs) further propelling this trend. However, current argument relations remain relatively simplistic and foundational, struggling to capture the full scope of argument information. To address this limitation, we propose a systematic framework comprising 14 fine-grained relation types from the perspectives of vertical argument relations and horizontal discourse relations, thereby capturing the intricate interplay between argument components for a thorough understanding of argument structure. On this basis, we conducted extensive experiments on three tasks: argument component prediction, relation prediction, and automated essay grading. Additionally, we explored the impact of writing quality on argument component prediction and relation prediction, as well as the connections between discourse relations and argumentative features. The findings highlight the importance of fine-grained argumentative annotations for argumentative writing assessment and encourage multi-dimensional argument analysis.
pdf
bib
abs
Browsing Like Human: A Multimodal Web Agent with Experiential Fast-and-Slow Thinking
Haohao Luo
|
Jiayi Kuang
|
Wei Liu
|
Ying Shen
|
Jian Luan
|
Yang Deng
Automating web navigation which aims to build a web agent that follows user instructions to complete tasks like booking flights by interacting with websites, has received increasing attention due to its practical value. Although existing web agents are mostly equipped with visual perception, planning, and memory abilities, their reasoning process are still deviate from human cognition. In this work, we study the human thought pattern to empower agent with more human-like abilities in web navigation. To tackle this problem, we propose a novel multimodal web agent framework called WebExperT, which is designed to emulate the human planning process of “thinking fast and slow” to effectively decompose complex user instructions. Furthermore, WebExperT leverages experiential learning by reflecting from failure for continuously refining planning and decision-making outcomes. Experimental results on the Mind2Web benchmark demonstrate the superiority of WebExperT in both supervised and unsupervised settings.
pdf
bib
abs
MaXIFE: Multilingual and Cross-lingual Instruction Following Evaluation
Yile Liu
|
Ziwei Ma
|
Xiu Jiang
|
Jinglu Hu
|
ChangJing ChangJing
|
Liang Li
With the rapid adoption of large language models (LLMs) in natural language processing, the ability to follow instructions has emerged as a key metric for evaluating their practical utility. However, existing evaluation methods often focus on single-language scenarios, overlooking the challenges and differences present in multilingual and cross-lingual contexts. To address this gap, we introduce MaXIFE: a comprehensive evaluation benchmark designed to assess instruction-following capabilities across 23 different languages with 1667 verifiable instruction tasks. MaXIFE integrates both Rule-Based Evaluation and Model-Based Evaluation, ensuring a balance of efficiency and accuracy. We applied MaXIFE to evaluate several leading commercial LLMs, establishing baseline results for future comparisons. By providing a standardized tool for multilingual instruction-following evaluation, MaXIFE aims to advance research and development in natural language processing.
pdf
bib
abs
Linguistic Generalizability of Test-Time Scaling in Mathematical Reasoning
Guijin Son
|
Jiwoo Hong
|
Hyunwoo Ko
|
James Thorne
Scaling pre-training compute has proven effective for achieving multilinguality, but does the same hold for test-time scaling? In this work, we introduce **MCLM**, a multilingual math benchmark featuring competition-level problems in 55 languages. We then compare three test-time scaling methods—Outcome Reward Modeling, Process Reward Modeling, and Budget Forcing. Our findings indicate that although “thinking LLMs” have recently garnered significant attention, their performance is comparable to traditional scaling methods like best-of-N once constrained to similar levels of inference FLOPs. More importantly, all tested methods fail to generalize robustly across languages, achieving only modest gains that are smaller than those observed in English, with no improvements in variance or consistency. To foster further research, we release MCLM and MR1-1.5B (a multilingual LLM with reasoning capabilities) and our evaluation results.
pdf
bib
abs
Can MLLMs Understand the Deep Implication Behind Chinese Images?
Chenhao Zhang
|
Xi Feng
|
Yuelin Bai
|
Xeron Du
|
Jinchang Hou
|
Kaixin Deng
|
Guangzeng Han
|
Qinrui Li
|
Bingli Wang
|
Jiaheng Liu
|
Xingwei Qu
|
Yifei Zhang
|
Qixuan Zhao
|
Yiming Liang
|
Ziqiang Liu
|
Feiteng Fang
|
Min Yang
|
Wenhao Huang
|
Chenghua Lin
|
Ge Zhang
|
Shiwen Ni
As the capabilities of Multimodal Large Language Models (MLLMs) improve, the need for higher-order evaluation of them is increasing. However, there is a lack of work evaluating MLLM for higher-order perception and understanding of Chinese visual content. To address this, we introduce the CII-Bench, which aims to assess MLLMs’ such capabilities for Chinese images. To ensure the authenticity of the Chinese context, images in CII-Bench are sourced from the Chinese Internet and manually reviewed, with corresponding answers also manually crafted. Additionally, CII-Bench incorporates images that represent Chinese traditional culture, such as famous Chinese traditional paintings, which can deeply reflect the model’s understanding of Chinese traditional culture. Through experiments on multiple MLLMs using CII-Bench, significant findings emerged. There is a large gap between MLLMs and humans in performance. The highest MLLM accuracy is 64.4%, while the human average is 78.2% and the peak is 81.0%. MLLMs perform poorly on traditional culture images, indicating limitations in understanding high-level semantics and lacking a deep knowledge base of Chinese traditional culture. Moreover, most models have higher accuracy when image emotion hints are added to the prompts. We believe CII-Bench will help MLLMs better understand Chinese semantics and specific images, and move forward the development of expert artificial general intelligence (AGI). Our project is publicly available at https://cii-bench.github.io.
pdf
bib
abs
KazMMLU: Evaluating Language Models on Kazakh, Russian, and Regional Knowledge of Kazakhstan
Mukhammed Togmanov
|
Nurdaulet Mukhituly
|
Diana Turmakhan
|
Jonibek Mansurov
|
Maiya Goloburda
|
Akhmed Sakip
|
Zhuohan Xie
|
Yuxia Wang
|
Bekassyl Syzdykov
|
Nurkhan Laiyk
|
Alham Fikri Aji
|
Ekaterina Kochmar
|
Preslav Nakov
|
Fajri Koto
Despite having a population of twenty million, Kazakhstan’s culture and language remain underrepresented in the field of natural language processing. Although large language models (LLMs) continue to advance worldwide, progress in Kazakh language has been limited, as seen in the scarcity of dedicated models and benchmark evaluations. To address this gap, we introduce KazMMLU, the first MMLU-style dataset specifically designed for Kazakh language. KazMMLU comprises 23,000 questions that cover various educational levels, including STEM, humanities, and social sciences, sourced from authentic educational materials and manually validated by native speakers and educators. The dataset includes 10,969 Kazakh questions and 12,031 Russian questions, reflecting Kazakhstan’s bilingual education system and rich local context. Our evaluation of several state-of-the-art multilingual models (Llama3.1, Qwen-2.5, GPT-4, and DeepSeek V3) demonstrates substantial room for improvement, as even the best-performing models struggle to achieve competitive performance in Kazakh and Russian. These findings highlight significant performance gaps compared to high-resource languages. We hope that our dataset will enable further research and development of Kazakh-centric LLMs.
pdf
bib
abs
Towards Multi-dimensional Evaluation of LLM Summarization across Domains and Languages
Hyangsuk Min
|
Yuho Lee
|
Minjeong Ban
|
Jiaqi Deng
|
Nicole Hee-Yeon Kim
|
Taewon Yun
|
Hang Su
|
Jason Cai
|
Hwanjun Song
Evaluation frameworks for text summarization have evolved in terms of both domain coverage and metrics. However, existing benchmarks still lack domain-specific assessment criteria, remain predominantly English-centric, and face challenges with human annotation due to the complexity of reasoning. To address these, we introduce MSumBench, which provides a multi-dimensional, multi-domain evaluation of summarization in English and Chinese. It also incorporates specialized assessment criteria for each domain and leverages a multi-agent debate system to enhance annotation quality. By evaluating eight modern summarization models, we discover distinct performance patterns across domains and languages. We further examine large language models as summary evaluators, analyzing the correlation between their evaluation and summarization capabilities, and uncovering systematic bias in their assessment of self-generated summaries. Our benchmark dataset is publicly available at https://github.com/DISL-Lab/MSumBench.
pdf
bib
abs
ClusterAttn: KV Cache Compression under Intrinsic Attention Clustering
Minwei Zhang
|
Haifeng Sun
|
Jingyu Wang
|
Shaolong Li
|
Wanyi Ning
|
Qi Qi
|
Zirui Zhuang
|
Jianxin Liao
Sparse attention can effectively alleviate the significant demands on memory when large language models (LLMs) process long contexts. Existing methods typically apply the same sparse pattern across different attention heads and inputs. However, this uniform approach fails to capture the inherent diversity of attention patterns within LLMs — the intrinsic attention clustering. To address this, we propose ClusterAttn, a training-free sparse attention method that provides an efficient prompt cache compression scheme under intrinsic attention clustering for efficient LLM inference.Our findings show that attention heads consistently focus on specific clusters of the prompt during decoding, a pattern detectable from an observation window at the prompt’s end. ClusterAttn adaptively fits these clusters utilizing a density-based attention clustering algorithm, thus compressing the KV cache of the prompt. Evaluations on different models across various benchmarks demonstrate ClusterAttn’s superior compression rates and efficiency. By utilizing only 1024 tokens, it can reduce memory usage by 10%–65%, resulting in a latency reduction of 12%–23% and a throughput increase of 2.6–4.8 times, all with nearly no accuracy loss. Additionally, ClusterAttn can handle up to 128k context on a single A100-80GB GPU, outperforming existing methods.
pdf
bib
abs
SHARE: Shared Memory-Aware Open-Domain Long-Term Dialogue Dataset Constructed from Movie Script
Eunwon Kim
|
Chanho Park
|
Buru Chang
Shared memories between two individuals strengthen their bond and are crucial for facilitating their ongoing conversations. This study aims to make long-term dialogue more engaging by leveraging these shared memories. To this end, we introduce a new long-term dialogue dataset named SHARE, constructed from movie scripts, which are a rich source of shared memories among various relationships. Our dialogue dataset contains the summaries of persona information and events of two individuals, as explicitly revealed in their conversation, along with implicitly extractable shared memories. We also introduce EPISODE, a long-term dialogue framework based on SHARE that utilizes shared experiences between individuals. Through experiments using SHARE, we demonstrate that shared memories between two individuals make long-term dialogues more engaging and sustainable, and that EPISODE effectively manages shared memories during dialogue. Our dataset and code are available at https://github.com/e1kim/SHARE.
pdf
bib
abs
Incongruity-aware Tension Field Network for Multi-modal Sarcasm Detection
Jiecheng Zhang
|
C.L.Philip Chen
|
Shuzhen Li
|
Tong Zhang
Multi-modal sarcasm detection (MSD) identifies sarcasm and accurately understands users’ real attitudes from text-image pairs. Most MSD researches explore the incongruity of text-image pairs as sarcasm information through consistency preference methods. However, these methods prioritize consistency over incongruity and blur incongruity information under their global feature aggregation mechanisms, leading to incongruity distortions and model misinterpretations. To address the above issues, this paper proposes a pioneering inconsistency preference method called incongruity-aware tension field network (ITFNet) for multi-modal sarcasm detection tasks. Specifically, ITFNet extracts effective text-image feature pairs in fact and sentiment perspectives. It then constructs a fact/sentiment tension field with discrepancy metrics to capture the contextual tone and polarized incongruity after the iterative learning of tension intensity, effectively highlighting incongruity information during such inconsistency preference learning. It further standardizes the polarized incongruity with reference to contextual tone to obtain standardized incongruity, effectively implementing instance standardization for unbiased decision-making in MSD. ITFNet performs well in extracting salient and standardized incongruity through an incongruity-aware tension field, significantly tackling incongruity distortions and cross-instance variance. Moreover, ITFNet achieves state-of-the-art performance surpassing LLaVA1.5-7B with only 17.3M trainable parameters, demonstrating its optimal performance-efficiency in multi-modal sarcasm detection tasks.
pdf
bib
abs
Instruction Tuning on Public Government and Cultural Data for Low-Resource Language: a Case Study in Kazakh
Nurkhan Laiyk
|
Daniil Orel
|
Rituraj Joshi
|
Maiya Goloburda
|
Yuxia Wang
|
Preslav Nakov
|
Fajri Koto
Instruction tuning in low-resource languages remains underexplored due to limited text data, particularly in government and cultural domains. To address this, we introduce and open-source a large-scale (10,600 samples) instruction-following (IFT) dataset, covering key institutional and cultural knowledge relevant to Kazakhstan. Our dataset enhances LLMs’ understanding of procedural, legal, and structural governance topics. We employ LLM-assisted data generation, comparing open-weight and closed-weight models for dataset construction, and select GPT-4o as the backbone. Each entity of our dataset undergoes full manual verification to ensure high quality. We also show that fine-tuning Qwen, Falcon, and Gemma on our dataset leads to consistent performance improvements in both multiple-choice and generative tasks, demonstrating the potential of LLM-assisted instruction tuning for low-resource languages.
pdf
bib
abs
Stealing Training Data from Large Language Models in Decentralized Training through Activation Inversion Attack
Chenxi Dai
|
Lin Lu
|
Pan Zhou
Decentralized training has become a resource-efficient framework to democratize the training of large language models (LLMs). However, the privacy risks associated with this framework, particularly due to the potential inclusion of sensitive data in training datasets, remain unexplored. This paper identifies a novel and realistic attack surface: the privacy leakage from training data in decentralized training, and proposes activation inversion attack (AIA) for the first time. AIA first constructs a shadow dataset comprising text labels and corresponding activations using public datasets. Leveraging this dataset, an attack model can be trained to reconstruct the training data from activations in victim decentralized training. We conduct extensive experiments on various LLMs and publicly available datasets to demonstrate the susceptibility of decentralized training to AIA. These findings highlight the urgent need to enhance security measures in decentralized training to mitigate privacy risks in training LLMs.
pdf
bib
abs
From Selection to Generation: A Survey of LLM-based Active Learning
Yu Xia
|
Subhojyoti Mukherjee
|
Zhouhang Xie
|
Junda Wu
|
Xintong Li
|
Ryan Aponte
|
Hanjia Lyu
|
Joe Barrow
|
Hongjie Chen
|
Franck Dernoncourt
|
Branislav Kveton
|
Tong Yu
|
Ruiyi Zhang
|
Jiuxiang Gu
|
Nesreen K. Ahmed
|
Yu Wang
|
Xiang Chen
|
Hanieh Deilamsalehy
|
Sungchul Kim
|
Zhengmian Hu
|
Yue Zhao
|
Nedim Lipka
|
Seunghyun Yoon
|
Ting-Hao Kenneth Huang
|
Zichao Wang
|
Puneet Mathur
|
Soumyabrata Pal
|
Koyel Mukherjee
|
Zhehao Zhang
|
Namyong Park
|
Thien Huu Nguyen
|
Jiebo Luo
|
Ryan A. Rossi
|
Julian McAuley
Active Learning (AL) has been a powerful paradigm for improving model efficiency and performance by selecting the most informative data points for labeling and training. In recent active learning frameworks, Large Language Models (LLMs) have been employed not only for selection but also for generating entirely new data instances and providing more cost-effective annotations. Motivated by the increasing importance of high-quality data and efficient model training in the era of LLMs, we present a comprehensive survey on LLM-based Active Learning. We introduce an intuitive taxonomy that categorizes these techniques and discuss the transformative roles LLMs can play in the active learning loop. We further examine the impact of AL on LLM learning paradigms and its applications across various domains. Finally, we identify open challenges and propose future research directions. This survey aims to serve as an up-to-date resource for researchers and practitioners seeking to gain an intuitive understanding of LLM-based AL techniques and deploy them to new applications.
pdf
bib
abs
OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation
Qinglin Zhang
|
Luyao Cheng
|
Chong Deng
|
Qian Chen
|
Wen Wang
|
Siqi Zheng
|
Jiaqing Liu
|
Hai Yu
|
Chao-Hong Tan
|
Zhihao Du
|
ShiLiang Zhang
Full-duplex spoken dialogue systems significantly surpass traditional turn-based dialogue systems, as they allow simultaneous bidirectional communication, closely mirroring human-human interactions. However, achieving low latency and natural interactions in full-duplex dialogue systems remains a significant challenge, especially considering human conversation dynamics such as interruptions, backchannels, and overlapping speech. In this paper, we introduce a novel End-to-End GPT-based model OmniFlatten for full-duplex conversation, capable of effectively modeling the complex behaviors inherent to natural conversations with low latency. To achieve full-duplex conversation capabilities, we propose a multi-stage post-training scheme that progressively adapts a text large language model (LLM) backbone into a speech-text dialogue LLM, capable of generating text and speech in real time, without modifying the architecture of the backbone LLM. The training process comprises three stages: modality alignment, half-duplex dialogue learning, and full-duplex dialogue learning. In all training stages, we standardize the data using a flattening operation, which enables unifying the training methods and the GPT backbone across different modalities and tasks. Our approach offers a simple modeling technique and a promising research direction for developing efficient and natural end-to-end full-duplex spoken dialogue systems.
pdf
bib
abs
DoMIX: An Efficient Framework for Exploiting Domain Knowledge in Fine-Tuning
Dohoon Kim
|
Donghun Kang
|
Taesup Moon
Domain-Adaptive Pre-training (DAP) has recently gained attention for its effectiveness in fine-tuning pre-trained models. Building on this, continual DAP has been explored to develop pre-trained models capable of incrementally incorporating different domain datasets. However, existing continual DAP methods face several limitations: (1) high computational cost and GPU memory usage during training; (2) sensitivity to incremental data order; and (3) providing a single, generalized model for all end tasks, which contradicts the essence of DAP. In this paper, we propose DoMIX, a novel approach that addresses these challenges by leveraging LoRA modules, a representative parameter-efficient fine-tuning (PEFT) method. Our approach enables efficient and parallel domain-adaptive pre-training that is robust to domain order and effectively utilizes accumulated knowledge to provide tailored pre-trained models for specific tasks.We also demonstrate that our method can be extended beyond the DAP setting to standard LLM fine-tuning scenarios. Code is available at https://github.com/dohoonkim-ai/DoMIX.
pdf
bib
abs
EAGLE: Expert-Guided Self-Enhancement for Preference Alignment in Pathology Large Vision-Language Model
Meidan Ding
|
Jipeng Zhang
|
Wenxuan Wang
|
Haiqin Zhong
|
Xiaoqin Wang
|
Xinheng Lyu
|
Wenting Chen
|
Linlin Shen
Recent advancements in Large Vision Language Models (LVLMs) show promise for pathological diagnosis, yet their application in clinical settings faces critical challenges of multimodal hallucination and biased responses. While preference alignment methods have proven effective in general domains, acquiring high-quality preference data for pathology remains challenging due to limited expert resources and domain complexity. In this paper, we propose EAGLE (Expert-guided self-enhancement for preference Alignment in patholoGy Large vision-languagE model), a novel framework that systematically integrates medical expertise into preference alignment. EAGLE consists of three key stages: initialization through supervised fine-tuning, self-preference creation leveraging expert prompting and medical entity recognition, and iterative preference following-tuning. The self-preference creation stage uniquely combines expert-verified chosen sampling with expert-guided rejected sampling to generate high-quality preference data, while the iterative tuning process continuously refines both data quality and model performance. Extensive experiments demonstrate that EAGLE significantly outperforms existing pathological LVLMs, effectively reducing hallucination and bias while maintaining pathological accuracy. The source code is available at https://github.com/meidandz/EAGLE.
pdf
bib
abs
CoT-ICL Lab: A Synthetic Framework for Studying Chain-of-Thought Learning from In-Context Demonstrations
Vignesh Kothapalli
|
Hamed Firooz
|
Maziar Sanjabi
We introduce CoT-ICL Lab, a framework and methodology to generate synthetic tokenized datasets and systematically study chain-of thought (CoT) in-context learning (ICL) in language models. CoT-ICL Lab allows fine grained control over the complexity of in-context examples by decoupling (1) the causal structure involved in chain token generation from (2) the underlying token processing functions. We train decoder-only transformers (up to 700M parameters) on these datasets and show that CoT accelerates the accuracy transition to higher values across model sizes. In particular, we find that model depth is crucial for leveraging CoT with limited in-context examples, while more examples help shallow models match deeper model performance. Additionally, limiting the diversity of token processing functions throughout training improves causal structure learning via ICL. We also interpret these transitions by analyzing transformer embeddings and attention maps. Overall, CoT-ICL Lab serves as a simple yet powerful testbed for theoretical and empirical insights into ICL and CoT in language models.
pdf
bib
abs
Flexora: Flexible Low-Rank Adaptation for Large Language Models
Chenxing Wei
|
Yao Shu
|
Ying Tiffany He
|
Fei Yu
Large language models (LLMs) have revolutionized artificial intelligence, but their performance on specific tasks is often limited by knowledge boundaries. While fine-tuning techniques like low-rank adaptation (LoRA) aim to address this, they can suffer from overfitting. We propose flexible low-rank adaptation (Flexora), a novel method that automatically selects the most critical layers for fine-tuning to optimize performance across diverse downstream tasks. Flexora formulates layer selection as a hyperparameter optimization problem, employs unrolled differentiation for efficient solving, and identifies the most impactful layers based on optimized hyperparameters. Extensive experiments across various pre-trained models and natural language tasks demonstrate that Flexora consistently outperforms existing baselines. We provide theoretical insights and comprehensive ablation studies to elucidate the effectiveness of Flexora. Therefore, Flexora offers a robust solution to enhance LoRA fine-tuning for LLMs, potentially advancing the field of adaptive language model optimization.
pdf
bib
abs
QDTSynth: Quality-Driven Formal Theorem Synthesis for Enhancing Proving Performance of LLMs
Lei Wang
|
Ruobing Zuo
|
Gaolei He
|
Jianlin Wang
|
Zhengfeng Yang
Automated Theorem Proving is an important and challenging task. Although large language models (LLMs) have demonstrated remarkable potential in mathematical reasoning, their performance in formal theorem proving remains constrained by the scarcity of high-quality supervised fine-tuning (SFT) data. To address this limitation, we propose a **Q**uality-**D**riven **T**heorem **S**ynthesis method (QDTSynth) in Lean4. During the statement synthesis, we enhance Monte Carlo Tree Search (MCTS) with an adaptive adjustment mechanism that dynamically optimizes the search strategy based on the synthesis of statements. In addition, we propose diversity screening and the self-assessment method to select theorems that exhibit both diversity and high quality from the initially synthetic statements, enabling the synthesis of a high-quality Lean4 theorem dataset. After fine-tuning three open-source large language models on our synthetic dataset, experiments on the miniF2F benchmark demonstrate that QDTSynth significantly improves the performance of various open-source LLMs in theorem proving tasks. Our work offers a promising new direction for the future synthesis of high-quality formal mathematical theorems.
pdf
bib
abs
RSVP: Reasoning Segmentation via Visual Prompting and Multi-modal Chain-of-Thought
Yi Lu
|
Jiawang Cao
|
Yongliang Wu
|
Bozheng Li
|
Licheng Tang
|
Yangguang Ji
|
Chong Wu
|
Jay Wu
|
Wenbo Zhu
Multi-modal Large Language Models (MLLMs) have demonstrated remarkable reasoning capability while lack explicit mechanisms for visual grounding and segmentation, creating a gap between cognitive reasoning and visual perception. To bridge this gap, we introduce Reasoning Segmentation via Visual Prompting (RSVP), a novel framework that unifies multi-step multimodal reasoning with grounded visual understanding. RSVP is a two-stage structuralized framework that integrates reasoning-driven localization with segmentation refinement. In the reasoning stage, RSVP employs multimodal chain-of-thought visual prompts to help MLLMs understand queries and infer targets, generating interpretable region proposals that enhance visual grounding. In segmentation stage, RSVP refines these proposals with a Vision-Language Segmentation Module (VLSM), seamlessly integrates textual and visual cues to produce precise segmentation masks. By explicitly modelling the interaction between multimodal reasoning and segmentation, RSVP introduces a new paradigm for interpretable reasoning segmentation. It exploits MLLMs’ inherent localization capabilities, enabling the models to not only reason about objects but also generate structured visual representations. Our extensive experiments demonstrate that RSVP achieves state-of-the-art performance, surpasses state-of-the-art methods by up to +6.5 gIoU and +9.2 cIoU on ReasonSeg, and achieves 49.7 mAP on SegInW under zero-shot settings. These results validate RSVP as an effective and scalable framework for integrating cognitive reasoning with structured visual understanding.
pdf
bib
abs
QAEval: Mixture of Evaluators for Question-Answering Task Evaluation
Tan Yue
|
Rui Mao
|
Xuzhao Shi
|
Shuo Zhan
|
Zuhao Yang
|
Dongyan Zhao
Question answering (QA) tasks serve as a key benchmark for evaluating generation systems. Traditional rule-based metrics, such as accuracy and relaxed-accuracy, struggle with open-ended and unstructured responses. LLM-based evaluation methods offer greater flexibility but suffer from sensitivity to instructions, robustness issues, and high computational costs. To overcome these challenges, we introduce QAEval, a hybrid framework combining rule-based reliability with LLM-based adaptability. QAEval utilizes two high-quality datasets: QAExtract for short-answer extraction and QAScore for scoring model training. By integrating a Mixture of Evaluators model with Dynamic Load Balancing Optimization, QAEval enables accurate, cost-effective QA evaluation. Experimental results show it outperforms models like GPT-4o and Claude-3, achieving 92.3% accuracy with only 0.6B parameters.
pdf
bib
abs
Debiasing the Fine-Grained Classification Task in LLMs with Bias-Aware PEFT
Daiying Zhao
|
Xinyu Yang
|
Hang Chen
Fine-grained classification via LLMs is susceptible to more complex label biases compared to traditional classification tasks. Existing bias mitigation strategies, such as retraining, post-hoc adjustment, and parameter-efficient fine-tuning (PEFT) are primarily effective for simple classification biases, such as stereotypes, but fail to adequately address prediction propensity and discriminative ability biases. In this paper, we analyze these two bias phenomena and observe their progressive accumulation from intermediate to deeper layers within LLMs. To mitigate this issue, we propose a bias-aware optimization framework that incorporates two distinct label balance constraints with a PEFT strategy targeting an intermediate layer. Our approach adjusts less than 1% of the model’s parameters while effectively curbing bias amplification in deeper layers. Extensive experiments conducted across 12 datasets and 5 LLMs demonstrate that our method consistently outperforms or matches the performance of full-parameter fine-tuning and LoRA, achieving superior results with lower perplexity.
pdf
bib
abs
Demystifying Small Language Models for Edge Deployment
Zhenyan Lu
|
Xiang Li
|
Dongqi Cai
|
Rongjie Yi
|
Fangming Liu
|
Wei Liu
|
Jian Luan
|
Xiwen Zhang
|
Nicholas D. Lane
|
Mengwei Xu
Small language models (SLMs) have emerged as a promising solution for deploying resource-constrained devices, such as smartphones and Web of Things. This work presents the first comprehensive study of over 60 SLMs such as Microsoft Phi and Google Gemma that are publicly accessible. Our findings show that state-of-the-art SLMs outperform 7B models in general tasks, proving their practical viability. However, SLMs’ in-context learning capabilities remain limited, and their efficiency has significant optimization potential. We identify key SLM optimization opportunities, including dynamic task-specific routing, model-hardware co-design, and vocabulary/KV cache compression. Overall, we expect the work to reveal an all-sided landscape of SLMs, benefiting the research community across algorithm, model, system, and hardware levels.
pdf
bib
abs
Adapt Once, Thrive with Updates: Transferable Parameter-Efficient Fine-Tuning on Evolving Base Models
Naibin Gu
|
Peng Fu
|
Xiyu Liu
|
Ke Ma
|
Zheng Lin
|
Weiping Wang
Parameter-efficient fine-tuning (PEFT) has become a common method for fine-tuning large language models, where a base model can serve multiple users through PEFT module switching. To enhance user experience, base models require periodic updates. However, once updated, PEFT modules fine-tuned on previous versions often suffer substantial performance degradation on newer versions. Re-tuning these numerous modules to restore performance would incur significant computational costs. Through a comprehensive analysis of the changes that occur during base model updates, we uncover an interesting phenomenon: continual training primarily affects task-specific knowledge stored in Feed-Forward Networks (FFN), while having less impact on the task-specific pattern in the Attention mechanism. Based on these findings, we introduce Trans-PEFT, a novel approach that enhances the PEFT module by focusing on the task-specific pattern while reducing its dependence on certain knowledge in the base model. Further theoretical analysis supports our approach. Extensive experiments across 7 base models and 12 datasets demonstrate that Trans-PEFT trained modules can maintain performance on updated base models without re-tuning, significantly reducing maintenance overhead in real-world applications.
pdf
bib
abs
Can Vision-Language Models Evaluate Handwritten Math?
Oikantik Nath
|
Hanani Bathina
|
Mohammed Safi Ur Rahman Khan
|
Mitesh M Khapra
Recent advancements in Vision-Language Models (VLMs) have opened new possibilities in automatic grading of handwritten student responses, particularly in mathematics. However, a comprehensive study to test the ability of VLMs to evaluate and reason over handwritten content remains absent. To address this gap, we introduce FERMAT, a benchmark designed to assess VLMs’ ability to detect, localize and correct errors in handwritten mathematical content. FERMAT spans four key error dimensions - computational, conceptual, notational, and presentation - and comprises over 2,200 handwritten math solutions derived from 609 manually curated problems from grades 7-12 with intentionally introduced perturbations. Using FERMAT we benchmark nine VLMs across three tasks: error detection, localization, and correction. Our results reveal significant shortcomings in current VLMs in reasoning over handwritten text, with Gemini-1.5-Pro achieving the highest error correction rate (77%). We also observed that some models struggle with processing handwritten content, as their accuracy improves when handwritten inputs are replaced with printed text or images. These findings highlight the limitations of current VLMs and reveal new avenues for improvement. We will release FERMAT and all the associated resources in the open-source to drive further research.
pdf
bib
abs
Continual Gradient Low-Rank Projection Fine-Tuning for LLMs
Chenxu Wang
|
Yilin Lyu
|
Zicheng Sun
|
Liping Jing
Continual fine-tuning of Large Language Models (LLMs) is hampered by the trade-off between efficiency and expressiveness. Low-Rank Adaptation (LoRA) offers efficiency but constrains the model’s ability to learn new tasks and transfer knowledge due to its low-rank nature and reliance on explicit parameter constraints. We propose GORP ( ̲Gradient L ̲Ow ̲Rank ̲Projection) for Continual Learning, a novel training strategy that overcomes these limitations by synergistically combining full and low-rank parameters and jointly updating within a unified low-rank gradient subspace. GORP expands the optimization space while preserving efficiency and mitigating catastrophic forgetting. Extensive experiments on continual learning benchmarks demonstrate GORP’s superior performance compared to existing state-of-the-art approaches. Code is available at https://github.com/Wcxwcxw/GORP.
pdf
bib
abs
Towards Objective Fine-tuning: How LLMs’ Prior Knowledge Causes Potential Poor Calibration?
Ziming Wang
|
Zeyu Shi
|
Haoyi Zhou
|
Shiqi Gao
|
Qingyun Sun
|
Jianxin Li
Fine-tuned Large Language Models (LLMs) often demonstrate poor calibration, with their confidence scores misaligned with actual performance. While calibration has been extensively studied in models trained from scratch, the impact of LLMs’ prior knowledge on calibration during fine-tuning remains understudied. Our research reveals that LLMs’ prior knowledge causes potential poor calibration due to the ubiquitous presence of known data in real-world fine-tuning, which appears harmful for calibration. Specifically, data aligned with LLMs’ prior knowledge would induce overconfidence, while new knowledge improves calibration. Our findings expose a tension: LLMs’ encyclopedic knowledge, while enabling task versatility, undermines calibration through unavoidable knowledge overlaps. To address this, we propose CogCalib, a cognition-aware framework that applies targeted learning strategies according to the model’s prior knowledge. Experiments across 7 tasks using 3 LLM families prove that CogCalib significantly improves calibration while maintaining performance, achieving an average 57% reduction in ECE compared to standard fine-tuning in Llama3-8B. These improvements generalize well to out-of-domain tasks, enhancing the objectivity and reliability of domain-specific LLMs, and making them more trustworthy for critical human-AI interaction applications.
pdf
bib
abs
Towards Robust ESG Analysis Against Greenwashing Risks: Aspect-Action Analysis with Cross-Category Generalization
Keane Ong
|
Rui Mao
|
Deeksha Varshney
|
Erik Cambria
|
Gianmarco Mengaldo
Sustainability reports are key for evaluating companies’ environmental, social and governance (ESG) performance. To analyze these reports, NLP approaches can efficiently extract ESG insights at scale. However, even the most advanced NLP methods lack robustness against ESG content that is greenwashed – i.e. sustainability claims that are misleading, exaggerated, and fabricated. Accordingly, existing NLP approaches often extract insights that reflect misleading or exaggerated sustainability claims rather than objective ESG performance. To tackle this issue, we introduce A3CG - **A**spect-**A**ction **A**nalysis with Cross-**C**ategory **G**eneralization, as a novel dataset to improve the robustness of ESG analysis amid the prevalence of greenwashing. By explicitly linking sustainability aspects with their associated actions, A3CG facilitates a more fine-grained and transparent evaluation of sustainability claims, ensuring that insights are grounded in verifiable actions rather than vague or misleading rhetoric. Additionally, A3CG emphasizes cross-category generalization. This ensures robust model performance in aspect-action analysis even when companies change their reports to selectively favor certain sustainability areas. Through experiments on A3CG, we analyze state-of-the-art supervised models and LLMs, uncovering their limitations and outlining key directions for future research.
pdf
bib
abs
HiddenDetect: Detecting Jailbreak Attacks against Multimodal Large Language Models via Monitoring Hidden States
Yilei Jiang
|
Xinyan Gao
|
Tianshuo Peng
|
Yingshui Tan
|
Xiaoyong Zhu
|
Bo Zheng
|
Xiangyu Yue
The integration of additional modalities increases the susceptibility of large vision-language models (LVLMs) to safety risks, such as jailbreak attacks, compared to their language-only counterparts. While existing research primarily focuses on post-hoc alignment techniques, the underlying safety mechanisms within LVLMs remain largely unexplored. In this work , we investigate whether LVLMs inherently encode safety-relevant signals within their internal activations during inference. Our findings reveal that LVLMs exhibit distinct activation patterns when processing unsafe prompts, which can be leveraged to detect and mitigate adversarial inputs without requiring extensive fine-tuning. Building on this insight, we introduce HiddenDetect, a novel tuning-free framework that harnesses internal model activations to enhance safety. Experimental results show that HiddenDetect surpasses state-of-the-art methods in detecting jailbreak attacks against LVLMs. By utilizing intrinsic safety-aware patterns, our method provides an efficient and scalable solution for strengthening LVLM robustness against multimodal threats. Our code and data will be released publicly.
pdf
bib
abs
SwiLTra-Bench: The Swiss Legal Translation Benchmark
Joel Niklaus
|
Jakob Merane
|
Luka Nenadic
|
Sina Ahmadi
|
Yingqiang Gao
|
Cyrill A. H. Chevalley
|
Claude Humbel
|
Christophe Gösken
|
Lorenzo Tanzi
|
Thomas Lüthi
|
Stefan Palombo
|
Spencer Poff
|
Boling Yang
|
Nan Wu
|
Matthew Guillod
|
Robin Mamié
|
Daniel Brunner
|
Julio Pereyra
|
Niko Grupen
In Switzerland legal translation is uniquely important due to the country’s four official languages and requirements for multilingual legal documentation. However, this process traditionally relies on professionals who must be both legal experts and skilled translators—creating bottlenecks and impacting effective access to justice. To address this challenge, we introduce SwiLTra-Bench, a comprehensive multilingual benchmark of over 180K aligned Swiss legal translation pairs comprising laws, headnotes, and press releases across all Swiss languages along with English, designed to evaluate LLM-based translation systems. Our systematic evaluation reveals that frontier models achieve superior translation performance across all document types, while specialized translation systems excel specifically in laws but under-perform in headnotes. Through rigorous testing and human expert validation, we demonstrate that while fine-tuning open SLMs significantly improves their translation quality, they still lag behind the best zero-shot prompted frontier models such as Claude-3.5-Sonnet. Additionally, we present SwiLTra-Judge, a specialized LLM evaluation system that aligns best with human expert assessments.
pdf
bib
abs
Two Intermediate Translations Are Better Than One: Fine-tuning LLMs for Document-level Translation Refinement
Yichen Dong
|
Xinglin Lyu
|
Junhui Li
|
Daimeng Wei
|
Min Zhang
|
Shimin Tao
|
Hao Yang
Recent research has shown that large language models (LLMs) can enhance translation quality through self-refinement. In this paper, we build on this idea by extending the refinement from sentence-level to document-level translation, specifically focusing on document-to-document (Doc2Doc) translation refinement. Since sentence-to-sentence (Sent2Sent) and Doc2Doc translation address different aspects of the translation process, we propose fine-tuning LLMs for translation refinement using two intermediate translations, combining the strengths of both Sent2Sent and Doc2Doc. Additionally, recognizing that the quality of intermediate translations varies, we introduce an enhanced fine-tuning method with quality awareness that assigns lower weights to easier translations and higher weights to more difficult ones, enabling the model to focus on challenging translation cases. Experimental results across ten translation tasks with LLaMA-3-8B-Instruct and Mistral-Nemo-Instruct demonstrate the effectiveness of our approach. We will release our code on GitHub.
pdf
bib
abs
Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models
Philipp Mondorf
|
Sondre Wold
|
Barbara Plank
A fundamental question in interpretability research is to what extent neural networks, particularly language models, implement reusable functions through subnetworks that can be composed to perform more complex tasks. Recent advances in mechanistic interpretability have made progress in identifying circuits, the minimal computational subgraphs responsible for a model’s behavior on specific tasks. However, most studies focus on identifying circuits for individual tasks without investigating how functionally similar circuits relate to each other. To address this gap, we study the modularity of neural networks by analyzing circuits for highly compositional subtasks within a transformer-based language model. Specifically, given a probabilistic context-free grammar, we identify and compare circuits responsible for ten modular string-edit operations. Our results indicate that functionally similar circuits exhibit both notable node overlap and cross-task faithfulness. Moreover, we demonstrate that the circuits identified can be reused and combined through set operations to represent more complex functional model capabilities.
pdf
bib
abs
Can LLMs Ground when they (Don’t) Know: A Study on Direct and Loaded Political Questions
Clara Lachenmaier
|
Judith Sieker
|
Sina Zarrieß
Communication among humans relies on conversational grounding, allowing interlocutors to reach mutual understanding even when they do not have perfect knowledge and must resolve discrepancies in each other’s beliefs. This paper investigates how large language models (LLMs) manage common ground in cases where they (don’t) possess knowledge, focusing on facts in the political domain where the risk of misinformation and grounding failure is high. We examine LLMs’ ability to answer direct knowledge questions and loaded questions that presuppose misinformation.We evaluate whether loaded questions lead LLMs to engage in active grounding and correct false user beliefs, in connection to their level of knowledge and their political bias.Our findings highlight significant challenges in LLMs’ ability to engage in grounding and reject false user beliefs, raising concerns about their role in mitigating misinformation in political discourse.
pdf
bib
abs
GraphCheck: Breaking Long-Term Text Barriers with Extracted Knowledge Graph-Powered Fact-Checking
Yingjian Chen
|
Haoran Liu
|
Yinhong Liu
|
Jinxiang Xie
|
Rui Yang
|
Han Yuan
|
Yanran Fu
|
Peng Yuan Zhou
|
Qingyu Chen
|
James Caverlee
|
Irene Li
Large language models (LLMs) are widely used, but they often generate subtle factual errors, especially in long-form text. These errors are fatal in some specialized domains such as medicine. Existing fact-checking with grounding documents methods face two main challenges: (1) they struggle to understand complex multihop relations in long documents, often overlooking subtle factual errors; (2) most specialized methods rely on pairwise comparisons, requiring multiple model calls, leading to high resource and computational costs. To address these challenges, we propose GraphCheck, a fact-checking framework that uses extracted knowledge graphs to enhance text representation. Graph Neural Networks further process these graphs as a soft prompt, enabling LLMs to incorporate structured knowledge more effectively. Enhanced with graph-based reasoning, GraphCheck captures multihop reasoning chains that are often overlooked by existing methods, enabling precise and efficient fact-checking in a single inference call. Experimental results on seven benchmarks spanning both general and medical domains demonstrate up to a 7.1% overall improvement over baseline models. Notably, GraphCheck outperforms existing specialized fact-checkers and achieves comparable performance with state-of-the-art LLMs, such as DeepSeek-V3 and OpenAI-o1, with significantly fewer parameters.
pdf
bib
abs
SCULPT: Systematic Tuning of Long Prompts
Shanu Kumar
|
Akhila Yesantarao Venkata
|
Shubhanshu Khandelwal
|
Bishal Santra
|
Parag Agrawal
|
Manish Gupta
Prompt optimization is essential for effective utilization of large language models (LLMs) across diverse tasks. While existing optimization methods are effective in optimizing short prompts, they struggle with longer, more complex ones, often risking information loss and being sensitive to small perturbations. To address these challenges, we propose SCULPT (Systematic Tuning of Long Prompts), a framework that treats prompt optimization as a hierarchical tree refinement problem. SCULPT represents prompts as tree structures, enabling targeted modifications while preserving contextual integrity. It employs a Critic-Actor framework that generates reflections and applies actions to refine the prompt. Evaluations demonstrate SCULPT’s effectiveness on long prompts, its robustness to adversarial perturbations, and its ability to generate high-performing prompts even without any initial human-written prompt. Compared to existing state of the art methods, SCULPT consistently improves LLM performance by preserving essential task information while applying structured refinements. Both qualitative and quantitative analyses show that SCULPT produces more stable and interpretable prompt modifications, ensuring better generalization across tasks.
pdf
bib
abs
Crab: A Novel Configurable Role-Playing LLM with Assessing Benchmark
Kai He
|
Yucheng Huang
|
Wenqing Wang
|
Delong Ran
|
Dongming Sheng
|
Junxuan Huang
|
Qika Lin
|
Jiaxing Xu
|
Wenqiang Liu
|
Mengling Feng
This study introduces Crab, a novel Configurable Role-Playing (RP) LLM with Assessing Benchmark, which consists of Role-Centric Dataset Curation, Persona-Embodying LLM Construction, and Comprehensive Benchmark Creation for RP dialogue generation. Distinct from traditional RP models that employ only several preset roles, Crab enables dynamic configuration of desired roles, thereby enhancing related flexibility and adaptability. To effectively train RP-LLMs, we curated the largest RP training dataset. The dataset provides a detailed role overview for each dialogue, including character profile, conversation scenario, and tagged topic, capturing a broad range of role-based behaviors, emotions, and interactions. We also noticed that current benchmarks lack both proper evaluation standards and methods. Thus, to validate RP-LLMs’ effectiveness, we introduced a new benchmark containing an evaluation standard, a test dataset with manual annotations, and a reward model RoleRM designed to automatically assess specific aspects of RP while aligning with human perception. Sufficient experiments reveal that RoleRM significantly outperforms ChatGPT and other evaluation methods in conducting fine-grained evaluations of RP. Also, RP-LLMs powered by Crab demonstrate superior performance across various fine-grained aspects.
pdf
bib
abs
Chinese SafetyQA: A Safety Short-form Factuality Benchmark for Large Language Models
Yingshui Tan
|
Boren Zheng
|
Baihui Zheng
|
Kerui Cao
|
Huiyun Jing
|
Jincheng Wei
|
Jiaheng Liu
|
Yancheng He
|
Wenbo Su
|
Xiaoyong Zhu
|
Bo Zheng
|
Kaifu Zhang
With the rapid advancement of Large Language Models (LLMs), significant safety concerns have emerged. Fundamentally, the safety of large language models is closely linked to the accuracy, comprehensiveness, and clarity of their understanding of safety knowledge, particularly in domains such as law, policy and ethics. This factuality ability is crucial in determining whether these models can be deployed and applied safely and compliantly within specific regions. To address these challenges and better evaluate the factuality ability of LLMs to answer short question, we introduce the Chinese SafetyQA benchmark. Chinese SafetyQA has several properties (i.e., Chinese, Diverse, High-quality, Static, Easy-to-evaluate, safety-related, harmless). Based on Chinese SafetyQA, we perform a comprehensive evaluation on the factuality abilities of existing LLMs and analyze how these capabilities relate to LLM abilities, e.g., RAG ability and robustness against attacks.
pdf
bib
abs
TRIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data Synthesis
Xiaorui Wu
|
Xiaofeng Mao
|
Fei Li
|
Xin Zhang
|
Xuanhong Li
|
Chong Teng
|
Donghong Ji
|
Zhuang Li
Large Language Models (LLMs) excel in various natural language processing tasks but remain vulnerable to generating harmful content or being exploited for malicious purposes. Although safety alignment datasets have been introduced to mitigate such risks through supervised fine-tuning (SFT), these datasets often lack comprehensive risk coverage. Most existing datasets focus primarily on lexical diversity while neglecting other critical dimensions. To address this limitation, we propose a novel analysis framework to systematically measure the risk coverage of alignment datasets across three essential dimensions: Lexical Diversity, Malicious Intent, and Jailbreak Tactics. We further introduce TRIDENT, an automated pipeline that leverages persona-based, zero-shot LLM generation to produce diverse and comprehensive instructions spanning these dimensions. Each harmful instruction is paired with an ethically aligned response, resulting in two datasets: TRIDENT-Core, comprising 26,311 examples, and TRIDENT-Edge, with 18,773 examples. Fine-tuning Llama 3.1-8B on TRIDENT-Edge demonstrates substantial improvements, achieving an average 14.29% reduction in Harm Score, and a 20% decrease in Attack Success Rate compared to the best-performing baseline model fine-tuned on the WildBreak dataset.
pdf
bib
abs
Cross-Lingual Optimization for Language Transfer in Large Language Models
Jungseob Lee
|
Seongtae Hong
|
Hyeonseok Moon
|
Heuiseok Lim
Adapting large language models to other languages typically employs supervised fine-tuning (SFT) as a standard approach. However, it often suffers from an overemphasis on English performance, a phenomenon that is especially pronounced in data-constrained environments. To overcome these challenges, we propose Cross-Lingual Optimization (CLO) that efficiently transfers an English-centric LLM to a target language while preserving its English capabilities. CLO utilizes publicly available English SFT data and a translation model to enable cross-lingual transfer. We conduct experiments using five models on six languages, each possessing varying levels of resource. Our results show that CLO consistently outperforms SFT in both acquiring target language proficiency and maintaining English performance. Remarkably, in low-resource languages, CLO with only 3,200 samples surpasses SFT with 6,400 samples, demonstrating that CLO can achieve better performance with less data. Furthermore, we find that SFT is particularly sensitive to data quantity in medium and low-resource languages, whereas CLO remains robust. Our comprehensive analysis emphasizes the limitations of SFT and incorporates additional training strategies in CLO to enhance efficiency.
pdf
bib
abs
CART: A Generative Cross-Modal Retrieval Framework With Coarse-To-Fine Semantic Modeling
Minghui Fang
|
Shengpeng Ji
|
Jialong Zuo
|
Hai Huang
|
Yan Xia
|
Jieming Zhu
|
Xize Cheng
|
Xiaoda Yang
|
Wenrui Liu
|
Gang Wang
|
Zhenhua Dong
|
Zhou Zhao
Cross-modal retrieval aims to search for instances, which are semantically related to the query through the interaction of different modal data. Traditional solutions utilize a single-tower or dual-tower framework to explicitly compute the score between queries and candidates, which is challenged by training cost and inference latency with large-scale data. Inspired by the remarkable performance and efficiency of generative models, we propose a generative cross-modal retrieval framework (CART) based on coarse-to-fine semantic modeling, which assigns identifiers to each candidate and treats the generating identifier as the retrieval target. Specifically, we explore an effective coarse-to-fine scheme, combining K-Means and RQ-VAE to discretize multimodal data into token sequences that support autoregressive generation. Further, considering the lack of explicit interaction between queries and candidates, we propose a feature fusion strategy to align their semantics. Extensive experiments demonstrate the effectiveness of the strategies in the CART, achieving excellent results in both retrieval performance and efficiency.
pdf
bib
abs
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
Xiang Yue
|
Tianyu Zheng
|
Yuansheng Ni
|
Yubo Wang
|
Kai Zhang
|
Shengbang Tong
|
Yuxuan Sun
|
Botao Yu
|
Ge Zhang
|
Huan Sun
|
Yu Su
|
Wenhu Chen
|
Graham Neubig
This paper introduces MMMU-Pro, a robust version of the Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark. MMMU-Pro rigorously assesses multimodal models’ true understanding and reasoning capabilities through a three-step process based on MMMU: (1) filtering out questions answerable by text-only models, (2) augmenting candidate options, and (3) introducing a vision-only input setting where questions are embedded within images. This setting challenges AI to truly “see” and “read” simultaneously, testing a core human cognitive skill of seamlessly integrating visual and textual information. Results show that model performance is substantially lower on MMMU-Pro than on MMMU, ranging from 16.8% to 26.9% across models. We explore the impact of OCR prompts and Chain of Thought (CoT) reasoning, finding that OCR prompts have minimal effect while CoT generally improves performance. MMMU-Pro provides a more rigorous evaluation tool, closely mimicking real-world scenarios and offering valuable directions for future multimodal research.
pdf
bib
abs
Cheems: A Practical Guidance for Building and Evaluating Chinese Reward Models from Scratch
Xueru Wen
|
Jie Lou
|
Zichao Li
|
Yaojie Lu
|
XingYu XingYu
|
Yuqiu Ji
|
Guohai Xu
|
Hongyu Lin
|
Ben He
|
Xianpei Han
|
Le Sun
|
Debing Zhang
Reward models (RMs) are crucial for aligning large language models (LLMs) with human preferences. However, most RM research is centered on English and relies heavily on synthetic resources, which leads to limited and less reliable datasets and benchmarks for Chinese. To address this gap, we introduce CheemsBench, a fully human-annotated RM evaluation benchmark within Chinese contexts, and CheemsPreference, a large-scale and diverse preference dataset annotated through human-machine collaboration to support Chinese RM training. We systematically evaluate open-source discriminative and generative RMs on CheemsBench and observe significant limitations in their ability to capture human preferences in Chinese scenarios. Additionally, based on CheemsPreference, we construct an RM that achieves state-of-the-art performance on CheemsBench, demonstrating the necessity of human supervision in RM training. Our findings reveal that scaled AI-generated data struggles to fully capture human preferences, emphasizing the importance of high-quality human supervision in RM development.
pdf
bib
abs
Why Safeguarded Ships Run Aground? Aligned Large Language Models’ Safety Mechanisms Tend to Be Anchored in The Template Region
Chak Tou Leong
|
Qingyu Yin
|
Jian Wang
|
Wenjie Li
The safety alignment of large language models (LLMs) remains vulnerable, as their initial behavior can be easily jailbroken by even relatively simple attacks. Since infilling a fixed template between the input instruction and initial model output is a common practice for existing LLMs, we hypothesize that this template is a key factor behind their vulnerabilities: LLMs’ safety-related decision-making overly relies on the aggregated information from the template region, which largely influences these models’ safety behavior. We refer to this issue as template-anchored safety alignment. In this paper, we conduct extensive experiments and verify that template-anchored safety alignment is widespread across various aligned LLMs. Our mechanistic analyses demonstrate how it leads to models’ susceptibility when encountering inference-time jailbreak attacks. Furthermore, we show that detaching safety mechanisms from the template region is promising in mitigating vulnerabilities to jailbreak attacks. We encourage future research to develop more robust safety alignment techniques that reduce reliance on the template region.
pdf
bib
abs
LLaVA Steering: Visual Instruction Tuning with 500x Fewer Parameters through Modality Linear Representation-Steering
Jinhe Bi
|
Yujun Wang
|
Haokun Chen
|
Xun Xiao
|
Artur Hecker
|
Volker Tresp
|
Yunpu Ma
Multimodal Large Language Models (MLLMs) enhance visual tasks by integrating visual representations into large language models (LLMs). The textual modality, inherited from LLMs, enables instruction following and in-context learning, while the visual modality boosts downstream task performance through rich semantic content, spatial information, and grounding capabilities. These modalities work synergistically across various visual tasks. Our research reveals a persistent imbalance between these modalities, with text often dominating output generation during visual instruction tuning, regardless of using full or parameter-efficient fine-tuning (PEFT). We found that re-balancing these modalities can significantly reduce trainable parameters, inspiring further optimization of visual instruction tuning. To this end, we introduce Modality Linear Representation-Steering (MoReS), which re-balances intrinsic modalities by steering visual representations through linear transformations in the visual subspace across each model layer. We validated our approach by developing LLaVA Steering, a suite of models using MoReS. Results show that LLaVA Steering requires, on average, 500 times fewer trainable parameters than LoRA while maintaining comparable performance across three visual benchmarks and eight visual question-answering tasks. Finally, we introduce the LLaVA Steering Factory, a platform that enables rapid customization of MLLMs with a component-based architecture, seamlessly integrating state-of-the-art models and evaluating intrinsic modality imbalance. This open-source project facilitates a deeper understanding of MLLMs within the research community.
pdf
bib
abs
Efficient Long Context Language Model Retrieval with Compression
Minju Seo
|
Jinheon Baek
|
Seongyun Lee
|
Sung Ju Hwang
Long Context Language Models (LCLMs) have emerged as a new paradigm to perform Information Retrieval (IR), which enables the direct ingestion and retrieval of information by processing an entire corpus in their single context, showcasing the potential to surpass traditional sparse and dense retrieval methods. However, processing a large number of passages within in-context for retrieval is computationally expensive, and handling their representations during inference further exacerbates the processing time; thus, we aim to make LCLM retrieval more efficient and potentially more effective with passage compression. Specifically, we propose a new compression approach tailored for LCLM retrieval, which is trained to maximize the retrieval performance while minimizing the length of the compressed passages. To accomplish this, we generate the synthetic data, where compressed passages are automatically created and labeled as chosen or rejected according to their retrieval success for a given query, and we train the proposed Compression model for Long context Retrieval (CoLoR) with this data via preference optimization while adding the length regularization loss on top of it to enforce brevity. Through extensive experiments on 9 datasets, we show that CoLoR improves the retrieval performance by 6% while compressing the in-context size by a factor of 1.91. Our code is available at: https://github.com/going-doer/CoLoR.
pdf
bib
abs
Ontology-Guided Reverse Thinking Makes Large Language Models Stronger on Knowledge Graph Question Answering
Runxuan Liu
|
Luobei Luobei
|
Jiaqi Li
|
Baoxin Wang
|
Ming Liu
|
Dayong Wu
|
Shijin Wang
|
Bing Qin
Large language models (LLMs) have shown remarkable capabilities in natural language processing. However, in knowledge graph question answering tasks (KGQA), there remains the issue of answering questions that require multi-hop reasoning. Existing methods rely on entity vector matching, but the purpose of the question is abstract and difficult to match with specific entities. As a result, it is difficult to establish reasoning paths to the purpose, which leads to information loss and redundancy. To address this issue, inspired by human reverse thinking, we propose Ontology-Guided Reverse Thinking (ORT), a novel framework that constructs reasoning paths from purposes back to conditions. ORT operates in three key phases: (1) using LLM to extract purpose labels and condition labels, (2) constructing label reasoning paths based on the KG ontology, and (3) using the label reasoning paths to guide knowledge retrieval. Experiments on the WebQSP and CWQ datasets show that ORT achieves state-of-the-art performance and significantly enhances the capability of LLMs for KGQA.
pdf
bib
abs
Towards Omni-RAG: Comprehensive Retrieval-Augmented Generation for Large Language Models in Medical Applications
Zhe Chen
|
Yusheng Liao
|
Shuyang Jiang
|
Pingjie Wang
|
YiQiu Guo
|
Yanfeng Wang
|
Yu Wang
Large language models hold promise for addressing medical challenges, such as medical diagnosis reasoning, research knowledge acquisition, clinical decision-making, and consumer health inquiry support. However, they often generate hallucinations due to limited medical knowledge. Incorporating external knowledge is therefore critical, which necessitates multi-source knowledge acquisition. We address this challenge by framing it as a source planning problem, which is to formulate context-appropriate queries tailored to the attributes of diverse sources. Existing approaches either overlook source planning or fail to achieve it effectively due to misalignment between the model’s expectation of the sources and their actual content. To bridge this gap, we present MedOmniKB, a repository comprising multigenre and multi-structured medical knowledge sources. Leveraging these sources, we propose the Source Planning Optimisation method, which enhances multi-source utilisation. Our approach involves enabling an expert model to explore and evaluate potential plans while training a smaller model to learn source alignment. Experimental results demonstrate that our method substantially improves multi-source planning performance, enabling the optimised small model to achieve state-of-the-art results in leveraging diverse medical knowledge sources.
pdf
bib
abs
Predicting Turn-Taking and Backchannel in Human-Machine Conversations Using Linguistic, Acoustic, and Visual Signals
Yuxin Lin
|
Yinglin Zheng
|
Ming Zeng
|
Wangzheng Shi
This paper addresses the gap in predicting turn-taking and backchannel actions in human-machine conversations using multi-modal signals (linguistic, acoustic, and visual). To overcome the limitation of existing datasets, we propose an automatic data collection pipeline that allows us to collect and annotate over 210 hours of human conversation videos. From this, we construct a Multi-Modal Face-to-Face (MM-F2F) human conversation dataset, including over 1.5M words and corresponding turn-taking and backchannel annotations from approximately 20M frames. Additionally, we present an end-to-end framework that predicts the probability of turn-taking and backchannel actions from multi-modal signals. The proposed model emphasizes the interrelation between modalities and supports any combination of text, audio, and video inputs, making it adaptable to a variety of realistic scenarios. Our experiments show that our approach achieves state-of-the-art performance on turn-taking and backchannel prediction tasks, achieving a 10% increase in F1-score on turn-taking and a 33% increase on backchannel prediction. Our dataset and code are publicly available online to ease of subsequent research.
pdf
bib
abs
A New Formulation of Zipf’s Meaning-Frequency Law through Contextual Diversity
Ryo Nagata
|
Kumiko Tanaka-Ishii
This paper proposes formulating Zipf’s meaning-frequency law, the power law between word frequency and the number of meanings, as a relationship between word frequency and contextual diversity. The proposed formulation quantifies meaning counts as contextual diversity, which is based on the directions of contextualized word vectors obtained from a Language Model (LM). This formulation gives a new interpretation to the law and also enables us to examine it for a wider variety of words and corpora than previous studies have explored. In addition, this paper shows that the law becomes unobservable when the size of the LM used is small and that autoregressive LMs require much more parameters than masked LMs to be able to observe the law.
pdf
bib
abs
The Mirage of Model Editing: Revisiting Evaluation in the Wild
Wanli Yang
|
Fei Sun
|
Jiajun Tan
|
Xinyu Ma
|
Qi Cao
|
Dawei Yin
|
Huawei Shen
|
Xueqi Cheng
Despite near-perfect results reported in the literature, the effectiveness of model editing in real-world applications remains unclear. To bridge this gap, we introduce QAEdit, a new benchmark aligned with widely used question answering (QA) datasets, and WILD, a task-agnostic evaluation framework designed to better reflect real-world usage of model editing. Our single editing experiments show that current editing methods perform substantially worse than previously reported (38.5% vs. 96.8%). We demonstrate that it stems from issues in the synthetic evaluation practices of prior work. Among them, the most severe is the use of teacher forcing during testing, which leaks both content and length of the ground truth, leading to overestimated performance. Furthermore, we simulate practical deployment by sequential editing, revealing that current approaches fail drastically with only 1000 edits. This work calls for a shift in model editing research toward rigorous evaluation and the development of robust, scalable methods that can reliably update knowledge in LLMs for real-world use.
pdf
bib
abs
LAQuer: Localized Attribution Queries in Content-grounded Generation
Eran Hirsch
|
Aviv Slobodkin
|
David Wan
|
Elias Stengel-Eskin
|
Mohit Bansal
|
Ido Dagan
Grounded text generation models often produce content that deviates from their source material, requiring user verification to ensure accuracy. Existing attribution methods associate entire sentences with source documents, which can be overwhelming for users seeking to fact-check specific claims. In contrast, existing sub-sentence attribution methods may be more precise but fail to align with users’ interests. In light of these limitations, we introduce Localized Attribution Queries (LAQuer), a new task that localizes selected spans of generated output to their corresponding source spans, allowing fine-grained and user-directed attribution. We compare two approaches for the LAQuer task, including prompting large language models (LLMs) and leveraging LLM internal representations. We then explore a modeling framework that extends existing attributed text generation methods to LAQuer. We evaluate this framework across two grounded text generation tasks: Multi-document Summarization (MDS) and Long-form Question Answering (LFQA). Our findings show that LAQuer methods significantly reduce the length of the attributed text. Our contributions include: (1) proposing the LAQuer task to enhance attribution usability, (2) suggesting a modeling framework and benchmarking multiple baselines, and (3) proposing a new evaluation setting to promote future research on localized attribution in content-grounded generation.
pdf
bib
abs
EPO: Explicit Policy Optimization for Strategic Reasoning in LLMs via Reinforcement Learning
Xiaoqian Liu
|
Ke Wang
|
Yongbin Li
|
Yuchuan Wu
|
Wentao Ma
|
Aobo Kong
|
Fei Huang
|
Jianbin Jiao
|
Junge Zhang
Large Language Models (LLMs) have shown impressive reasoning capabilities in well-defined problems with clear solutions, such as mathematics and coding. However, they still struggle with complex real-world scenarios like business negotiations, which require strategic reasoning—an ability to navigate dynamic environments and align long-term goals amidst uncertainty.Existing methods for strategic reasoning face challenges in adaptability, scalability, and transferring strategies to new contexts.To address these issues, we propose explicit policy optimization (*EPO*) for strategic reasoning, featuring an LLM that provides strategies in open-ended action space and can be plugged into arbitrary LLM agents to motivate goal-directed behavior.To improve adaptability and policy transferability, we train the strategic reasoning model via multi-turn reinforcement learning (RL), utilizing process rewards and iterative self-play.Experiments across social and physical domains demonstrate *EPO*’s ability of long-term goal alignment through enhanced strategic reasoning, achieving state-of-the-art performance on social dialogue and web navigation tasks. Our findings reveal various collaborative reasoning mechanisms emergent in *EPO* and its effectiveness in generating novel strategies, underscoring its potential for strategic reasoning in real-world applications. Code and data are available at [https://github.com/lxqpku/EPO](https://github.com/lxqpku/EPO).
pdf
bib
abs
DCG-SQL: Enhancing In-Context Learning for Text-to-SQL with Deep Contextual Schema Link Graph
Jihyung Lee
|
Jin-Seop Lee
|
Jaehoon Lee
|
YunSeok Choi
|
Jee-Hyong Lee
Text-to-SQL, which translates a natural language question into an SQL query, has advanced with in-context learning of Large Language Models (LLMs). However, existing methods show little improvement in performance compared to randomly chosen demonstrations, and significant performance drops when smaller LLMs (e.g., Llama 3.1-8B) are used. This indicates that these methods heavily rely on the intrinsic capabilities of hyper-scaled LLMs, rather than effectively retrieving useful demonstrations. In this paper, we propose a novel approach for effectively retrieving demonstrations and generating SQL queries. We construct a Deep Contextual Schema Link Graph, which contains key information and semantic relationship between a question and its database schema items. This graph-based structure enables effective representation of Text-to-SQL samples and retrieval of useful demonstrations for in-context learning. Experimental results on the Spider benchmark demonstrate the effectiveness of our approach, showing consistent improvements in SQL generation performance and efficiency across both hyper-scaled LLMs and small LLMs. The code is available at https://github.com/jjklle/DCG-SQL.
pdf
bib
abs
PreP-OCR: A Complete Pipeline for Document Image Restoration and Enhanced OCR Accuracy
Shuhao Guan
|
Moule Lin
|
Cheng Xu
|
Xinyi Liu
|
Jinman Zhao
|
Jiexin Fan
|
Qi Xu
|
Derek Greene
This paper introduces PreP-OCR, a two-stage pipeline that combines document image restoration with semantic-aware post-OCR correction to enhance both visual clarity and textual consistency, thereby improving text extraction from degraded historical documents.First, we synthesize document-image pairs from plaintext, rendering them with diverse fonts and layouts and then applying a randomly ordered set of degradation operations. An image restoration model is trained on this synthetic data, using multi-directional patch extraction and fusion to process large images. Second, a ByT5 post-OCR model, fine-tuned on synthetic historical text pairs, addresses remaining OCR errors.Detailed experiments on 13,831 pages of real historical documents in English, French, and Spanish show that the PreP-OCR pipeline reduces character error rates by 63.9-70.3% compared to OCR on raw images. Our pipeline demonstrates the potential of integrating image restoration with linguistic error correction for digitizing historical archives.
pdf
bib
abs
Digest the Knowledge: Large Language Models empowered Message Passing for Knowledge Graph Question Answering
Junhong Wan
|
Tao Yu
|
Kunyu Jiang
|
Yao Fu
|
Weihao Jiang
|
Jiang Zhu
Despite their success, large language models (LLMs) suffer from notorious hallucination issue. By introducing external knowledge stored in knowledge graphs (KGs), existing methods use paths as the medium to represent the graph information that send into LLMs. However, paths only contain limited graph structure information and are unorganized with redundant sequentially appeared keywords, which are difficult for LLMs to digest. We aim to find a suitable medium that captures the essence of structure knowledge in KGs. Inspired by the Neural Message Passing in Graph Neural Networks, we propose Language Message Passing (LMP) that first learns a concise facts graph by iteratively aggregates neighbor entities and transforms them into semantic facts, and then we performs Topological Readout that encodes the graph structure information into multi-level lists of texts to augment LLMs. Our method serves as a brand-new innovative framework that brings a new perspective into KG-enhanced LLMs, and also offers human-level semantic explainability with significant performance improvements over existing methods on all 5 knowledge graph question answering datasets. Code is available at https://github.com/wanjunhong0/LMP.
pdf
bib
abs
RecLM: Recommendation Instruction Tuning
Yangqin Jiang
|
Yuhao Yang
|
Lianghao Xia
|
Da Luo
|
Kangyi Lin
|
Chao Huang
Modern recommender systems aim to deeply understand users’ complex preferences through their past interactions. While deep collaborative filtering approaches using Graph Neural Networks (GNNs) excel at capturing user-item relationships, their effectiveness is limited when handling sparse data or zero-shot scenarios, primarily due to constraints in ID-based embedding functions. To address these challenges, we propose a model-agnostic recommendation instruction-tuning paradigm that seamlessly integrates large language models with collaborative filtering. Our proposed Recommendation Language Model (RecLM) enhances the capture of user preference diversity through a carefully designed reinforcement learning reward function that facilitates self-augmentation of language models. Comprehensive evaluations demonstrate significant advantages of our approach across various settings, and its plug-and-play compatibility with state-of-the-art recommender systems results in notable performance enhancements.
pdf
bib
abs
DS2-ABSA: Dual-Stream Data Synthesis with Label Refinement for Few-Shot Aspect-Based Sentiment Analysis
Hongling Xu
|
Yice Zhang
|
Qianlong Wang
|
Ruifeng Xu
Recently developed large language models (LLMs) have presented promising new avenues to address data scarcity in low-resource scenarios. In few-shot aspect-based sentiment analysis (ABSA), previous efforts have explored data augmentation techniques, which prompt LLMs to generate new samples by modifying existing ones. However, these methods fail to produce adequately diverse data, impairing their effectiveness. Besides, some studies apply in-context learning for ABSA by using specific instructions and a few selected examples as prompts. Though promising, LLMs often yield labels that deviate from task requirements. To overcome these limitations, we propose DS2-ABSA, a dual-stream data synthesis framework targeted for few-shot ABSA. It leverages LLMs to synthesize data from two complementary perspectives: key-point-driven and instance-driven, which effectively generate diverse and high-quality ABSA samples in low-resource settings. Furthermore, a label refinement module is integrated to improve the synthetic labels. Extensive experiments demonstrate that DS2-ABSA significantly outperforms previous few-shot ABSA solutions and other LLM-oriented data generation methods.
pdf
bib
abs
MISP-Meeting: A Real-World Dataset with Multimodal Cues for Long-form Meeting Transcription and Summarization
HangChen HangChen
|
Chao-Han Huck Yang
|
Jia-Chen Gu
|
Sabato Marco Siniscalchi
|
Jun Du
We introduce MISP-Meeting, a new real-world, multimodal dataset that covers subject-oriented long-form content. MISP-Meeting integrates information from speech, vision, and text modalities to facilitate automatic meeting transcription and summarization (AMTS). Challenging conditions in human meetings, including far-field speech recognition, audio-visual understanding, and long-term summarization, have been carefully evaluated. We benchmark state-of-the-art automatic speech recognition (ASR) and large language models (LLMs) on this dataset, enhanced with multimodal cues. Experiments demonstrate that incorporating multimodal cues, such as lip movements and visual focus of attention, significantly enhances transcription accuracy, reducing the character error rate (CER) from 36.60% to 20.27% via guided source separation (GSS), fine-tuning, and audio-visual fusion. Furthermore, our summarization analysis reveals a direct correlation between ASR quality and summary coherence, underscoring the importance of robust multimodal modeling. Our dataset and codebase will be released as open source.
pdf
bib
abs
Learning Together to Perform Better: Teaching Small-Scale LLMs to Collaborate via Preferential Rationale Tuning
Sohan Patnaik
|
Milan Aggarwal
|
Sumit Bhatia
|
Balaji Krishnamurthy
LLMs such as GPT-4 have shown a remarkable ability to solve complex questions by generating step-by-step rationales. Prior works have utilized this capability to improve smaller and cheaper LMs (say, with 7B parameters). However, various practical constraints, such as copyright and legal issues, owing to lack of transparency in the pre-training data of large (often closed) models, prevent their use in commercial settings. Little focus has been given to improving the innate reasoning ability of smaller models without distilling information from larger LLMs. To address this, we propose COLLATE, a trainable framework that tunes a (small) LLM to generate those outputs from a pool of diverse rationales that selectively improves the downstream task. COLLATE enforces multiple instances of the same LLM to exhibit distinct behavior and employs them to generate rationales to obtain diverse outputs. The LLM is then tuned via preference optimization to choose the candidate rationale which maximizes the likelihood of ground-truth answer. COLLATE outperforms several trainable and prompting baselines on 5 datasets across 3 domains - maths problem solving, natural language inference, and commonsense reasoning. We show the efficacy of COLLATE on LLMs from different model families across varying parameter scales (1B to 8B) and demonstrate the benefit of multiple rationale providers guided by the end task through ablations. Code is released here (https://github.com/Sohanpatnaik106/collate).
pdf
bib
abs
MolRAG: Unlocking the Power of Large Language Models for Molecular Property Prediction
Ziting Xian
|
Jiawei Gu
|
Lingbo Li
|
Shangsong Liang
Recent LLMs exhibit limited effectiveness on molecular property prediction task due to the semantic gap between molecular representations and natural language, as well as the lack of domain-specific knowledge. To address these challenges, we propose MolRAG, a Retrieval-Augmented Generation framework integrating Chain-of-Thought reasoning for molecular property prediction. MolRAG operates by retrieving structurally analogous molecules as contextual references to guide stepwise knowledge reasoning through chemical structure-property relationships. This dual mechanism synergizes molecular similarity analysis with structured inference, while generating human-interpretable rationales grounded in domain knowledge. Experimental results show MolRAG outperforms pre-trained LLMs on four datasets, and even matches supervised methods, achieving performance gains of 1.1%–45.7% over direct prediction approaches, demonstrating versatile effectiveness. Our code is available at https://github.com/AcaciaSin/MolRAG.
pdf
bib
abs
SkillAggregation: Reference-free LLM-Dependent Aggregation
Guangzhi Sun
|
Anmol Kagrecha
|
Potsawee Manakul
|
Phil Woodland
|
Mark Gales
Large Language Models (LLMs) are increasingly used to assess NLP tasks due to their ability to generate human-like judgments. Single LLMs were used initially, however, recent work suggests using multiple LLMs as judges yields improved performance. An important step in exploiting multiple judgements is the combination stage, aggregation. Existing methods in NLP either assign equal weight to all LLM judgments or are designed for specific tasks such as hallucination detection. This work focuses on aggregating predictions from multiple systems where no reference labels are available. A new method called SkillAggregation is proposed, which learns to combine estimates from LLM judges without needing additional data or ground truth. It extends the Crowdlayer aggregation method, developed for image classification, to exploit the judge estimates during inference. The approach is compared to a range of standard aggregation methods on HaluEval-Dialogue, TruthfulQA and Chatbot Arena tasks. SkillAggregation outperforms Crowdlayer on all tasks, and yields the best performance over all approaches on the majority of tasks.
pdf
bib
abs
MasRouter: Learning to Route LLMs for Multi-Agent Systems
Yanwei Yue
|
Guibin Zhang
|
Boyang Liu
|
Guancheng Wan
|
Kun Wang
|
Dawei Cheng
|
Yiyan Qi
Multi-agent systems (MAS) powered by Large Language Models (LLMs) have been demonstrated to push the boundaries of LLM capabilities, yet they often incur significant costs and face challenges in dynamic LLM selection. Current LLM routing methods effectively reduce overhead in single-agent scenarios by customizing LLM selection for each query, but they overlook the critical decisions regarding collaboration modes and agent roles in MAS. In response to this challenge, we first introduce the problem of Multi-Agent System Routing (MASR), which integrates all components of MAS into a unified routing framework. Toward this goal, we propose MasRouter, the first high-performing, cost-effective, and inductive MASR solution. MasRouter employs collaboration mode determination, role allocation, and LLM routing through a cascaded controller network, progressively constructing a MAS that balances effectiveness and efficiency. Extensive experiments demonstrate that MasRouter is (1) high-performing, achieving a 1.8 improvement over the state-of-the-art method on MBPP; (2) economical, reducing overhead by up to 52.07 compared to SOTA methods on HumanEval; and (3) plug-and-play, seamlessly integrating with mainstream MAS frameworks, reducing overhead by 17.21 via customized routing.
pdf
bib
abs
Beyond Single Labels: Improving Conversational Recommendation through LLM-Powered Data Augmentation
Haozhe Xu
|
Xiaohua Wang
|
Changze Lv
|
Xiaoqing Zheng
Conversational recommender systems (CRSs) enhance recommendation quality by engaging users in multi-turn dialogues, capturing nuanced preferences through natural language interactions. However, these systems often face the false negative issue, where items that a user might like are incorrectly labeled as negative during training, leading to suboptimal recommendations. Expanding the label set through data augmentation presents an intuitive solution but faces the challenge of balancing two key aspects: ensuring semantic relevance and preserving the collaborative information inherent in CRS datasets. To address these issues, we propose a novel data augmentation framework that first leverages an LLM-based semantic retriever to identify diverse and semantically relevant items, which are then filtered by a relevance scorer to remove noisy candidates. Building on this, we introduce a two-stage training strategy balancing semantic relevance and collaborative information. Extensive experiments on two benchmark datasets and user simulators demonstrate significant and consistent performance improvements across various recommenders, highlighting the effectiveness of our approach in advancing CRS performance.
pdf
bib
abs
Beyond One-Size-Fits-All: Tailored Benchmarks for Efficient Evaluation
Peiwen Yuan
|
Yueqi Zhang
|
Shaoxiong Feng
|
Yiwei Li
|
Xinglin Wang
|
Jiayi Shi
|
Chuyi Tan
|
Boyuan Pan
|
Yao Hu
|
Kan Li
Evaluating models on large benchmarks can be very resource-intensive, especially during a period of rapid model evolution. Existing efficient evaluation methods estimate the performance of target models by testing them on a small, static coreset derived from the publicly available evaluation results of source models, which are separate from the target models. However, these approaches rely on the assumption that target models have high prediction consistency with source models, which doesn’t generalize well in practice. To fill this gap, we propose TailoredBench, a method that conducts customized evaluation tailored to each target model. Specifically, a Global-coreset is first constructed as a probe to identify the most consistent source models for each target model with an adaptive source model selection strategy. Afterwards, a scalable K-Medoids clustering algorithm is proposed to extend the Global-coreset to a tailored Native-coreset for each target model. According to the predictions on respective Native-coreset, we estimate the overall performance of target models with a calibrated estimation strategy. Comprehensive experiments on five benchmarks across over 300 models demonstrate that compared to best performing baselines, TailoredBench achieves an average reduction of 31.4% in MAE of accuracy estimates under the same inference budgets, showcasing strong effectiveness and generalizability.
pdf
bib
abs
iQUEST: An Iterative Question-Guided Framework for Knowledge Base Question Answering
Shuai Wang
|
Yinan Yu
While Large Language Models (LLMs) excel at many natural language processing tasks, they often suffer from factual inaccuracies in knowledge-intensive scenarios. Integrating external knowledge resources, particularly knowledge graphs (KGs), provides a transparent and updatable foundation for more reliable reasoning. Knowledge Base Question Answering (KBQA), which queries and reasons over KGs, is central to this effort, especially for complex, multi-hop queries. However, multi-hop reasoning poses two key challenges: (1) maintaining coherent reasoning paths, and (2) avoiding prematurely discarding critical multi-hop connections. To address these issues, we introduce iQUEST, a question-guided KBQA framework that iteratively decomposes complex queries into simpler sub-questions, ensuring a structured and focused reasoning trajectory. Additionally, we integrate a Graph Neural Network (GNN) to look ahead and incorporate 2-hop neighbor information at each reasoning step. This dual approach strengthens the reasoning process, enabling the model to explore viable paths more effectively. Detailed experiments demonstrate the consistent improvement delivered by iQUEST across four benchmark datasets and four LLMs.
pdf
bib
abs
IRT-Router: Effective and Interpretable Multi-LLM Routing via Item Response Theory
Wei Song
|
Zhenya Huang
|
Cheng Cheng
|
Weibo Gao
|
Bihan Xu
|
GuanHao Zhao
|
Fei Wang
|
Runze Wu
Large language models (LLMs) have demonstrated exceptional performance across a wide range of natural language tasks. However, selecting the optimal LLM to respond to a user query often necessitates a delicate balance between performance and cost. While powerful models deliver better results, they come at a high cost, whereas smaller models are more cost-effective but less capable. To address this trade-off, we propose IRT-Router, a multi-LLM routing framework that efficiently routes user queries to the most suitable LLM. Inspired by Item Response Theory (IRT), a psychological measurement methodology, IRT-Router explicitly models the relationship between LLM capabilities and user query attributes. This not only enables accurate prediction of response performance but also provides interpretable insights, such as LLM abilities and query difficulty. Additionally, we design an online query warm-up technique based on semantic similarity, further enhancing the online generalization capability of IRT-Router. Extensive experiments on 20 LLMs and 12 datasets demonstrate that IRT-Router outperforms most baseline methods in terms of effectiveness and interpretability. Its superior performance in cold-start scenarios further confirms the reliability and practicality of IRT-Router in real-world applications. Code is available at
https://github.com/Mercidaiha/IRT-Router.
pdf
bib
abs
MLAS-LoRA: Language-Aware Parameters Detection and LoRA-Based Knowledge Transfer for Multilingual Machine Translation
Tianyu Dong
|
Bo Li
|
Jinsong Liu
|
Shaolin Zhu
|
Deyi Xiong
Large language models (LLMs) have achieved remarkable progress in multilingual machine translation (MT), demonstrating strong performance even with limited parallel data. However, effectively fine-tuning LLMs for MT is challenging due to parameter interference, which arises from the conflicting demands of different language pairs and the risk of overwriting pre-trained knowledge. To address this issue, we propose MLAS-LoRA, a novel multiple language-aware LoRA knowledge transfer framework. MLAS-LoRA efficiently adapts LLMs to MT by selectively transferring knowledge from a large teacher to a small student model. Our approach first evaluates the awareness of neurons and extracts linguistic knowledge in the teacher model to both the general MT task and specific language pairs.We then propose a multiple language-specific LoRA architecture to inject the extracted knowledge into the student model. During fine-tuning, only the parameters of the relevant language-general and language-specific LoRA modules are updated. Experimental results on diverse multilingual language pairs demonstrate that MLAS-LoRA significantly outperforms strong baselines by +1.7 BLEU on average, including standard fine-tuning and other parameter-efficient methods.
pdf
bib
abs
M2RC-EVAL: Massively Multilingual Repository-level Code Completion Evaluation
Jiaheng Liu
|
Ken Deng
|
Congnan Liu
|
Jian Yang
|
Shukai Liu
|
He Zhu
|
Peng Zhao
|
Linzheng Chai
|
Yanan Wu
|
JinKe JinKe
|
Ge Zhang
|
Zekun Moore Wang
|
Guoan Zhang
|
Yingshui Tan
|
Bangyu Xiang
|
Zhaoxiang Zhang
|
Wenbo Su
|
Bo Zheng
Repository-level code completion has drawn great attention in software engineering, and several benchmarks have been introduced. However, existing repository-level code completion benchmarks usually focus on a limited number of languages (<5), which cannot evaluate the general code intelligence abilities across different languages for existing code Large Language Models (LLMs). Besides, the existing benchmarks usually report overall average scores of different languages, where the fine-grained abilities in different completion scenarios are ignored. Therefore, to facilitate the research of code LLMs in multilingual scenarios, we propose a massively multilingual repository-level code completion benchmark covering 18 programming languages (called M2RC-EVAL), and two types of fine-grained annotations (i.e., bucket-level and semantic-level) on different completion scenarios are provided, where we obtain these annotations based on the parsed abstract syntax tree. Moreover, we also curate a massively multilingual instruction corpora M2RC-INSTRUCT dataset to improve the repository-level code completion abilities of existing code LLMs. Comprehensive experimental results demonstrate the effectiveness of our M2RC-EVAL and M2RC-INSTRUCT.
pdf
bib
abs
Evaluating Design Decisions for Dual Encoder-based Entity Disambiguation
Susanna Rücker
|
Alan Akbik
Entity disambiguation (ED) is the task of linking mentions in text to corresponding entries in a knowledge base. Dual Encoders address this by embedding mentions and label candidates in a shared embedding space and applying a similarity metric to predict the correct label. In this work, we focus on evaluating key design decisions for Dual Encoder-based ED, such as its loss function, similarity metric, label verbalization format, and negative sampling strategy. We present the resulting model VerbalizED, a document-level Dual Encoder model that includes contextual label verbalizations and efficient hard negative sampling. Additionally, we explore an iterative prediction variant that aims to improve the disambiguation of challenging data points. To support our analysis, we first conduct comprehensive ablation experiments on specific design decisions using AIDA-Yago, followed by large-scale, multi-domain evaluation on the ZELDA benchmark.
pdf
bib
abs
How to Compare Things Properly? A Study of Argument Relevance in Comparative Question Answering
Irina Nikishina
|
Saba Anwar
|
Nikolay Dolgov
|
Maria Manina
|
Daria Ignatenko
|
Artem Shelmanov
|
Chris Biemann
Comparative Question Answering (CQA) lies at the intersection of Question Answering, Argument Mining, and Summarization. It poses unique challenges due to the inherently subjective nature of many questions and the need to integrate diverse perspectives. Although the CQA task can be addressed using recently emerged instruction-following Large Language Models (LLMs), challenges such as hallucinations in their outputs and the lack of transparent argument provenance remain significant limitations.To address these challenges, we construct a manually curated dataset comprising arguments annotated with their relevance. These arguments are further used to answer comparative questions, enabling precise traceability and faithfulness. Furthermore, we define explicit criteria for an “ideal” comparison and introduce a benchmark for evaluating the outputs of various Retrieval-Augmented Generation (RAG) models with respect to argument relevance. All code and data are publicly released to support further research.
pdf
bib
abs
FinanceReasoning: Benchmarking Financial Numerical Reasoning More Credible, Comprehensive and Challenging
Zichen Tang
|
Haihong E
|
Ziyan Ma
|
Haoyang He
|
Jiacheng Liu
|
Zhongjun Yang
|
Zihua Rong
|
Rongjin Li
|
Kun Ji
|
Qing Huang
|
Xinyang Hu
|
Yang Liu
|
Qianhe Zheng
We introduce **FinanceReasoning**, a novel benchmark designed to evaluate the reasoning capabilities of large reasoning models (LRMs) in financial numerical reasoning problems. Compared to existing benchmarks, our work provides three key advancements. (1) **Credibility**: We update 15.6% of the questions from four public datasets, annotating 908 new questions with detailed Python solutions and rigorously refining evaluation standards. This enables an accurate assessment of the reasoning improvements of LRMs. (2) **Comprehensiveness**: FinanceReasoning covers 67.8% of financial concepts and formulas, significantly surpassing existing datasets. Additionally, we construct 3,133 Python-formatted functions, which enhances LRMs’ financial reasoning capabilities through refined knowledge (*e.g.*, 83.2% → 91.6% for GPT-4o). (3) **Challenge**: Models are required to apply multiple financial formulas for precise numerical reasoning on 238 *Hard* problems. The best-performing model (*i.e.*, OpenAI o1 with PoT) achieves 89.1% accuracy, yet LRMs still face challenges in numerical precision. We demonstrate that combining Reasoner and Programmer models can effectively enhance LRMs’ performance (*e.g.*, 83.2% → 87.8% for DeepSeek-R1). Our work paves the way for future research on evaluating and improving LRMs in domain-specific complex reasoning tasks.
pdf
bib
abs
Controllable Style Arithmetic with Language Models
Weiqi Wang
|
Wengang Zhou
|
Zongmeng Zhang
|
Jie Zhao
|
Houqiang Li
Language models have shown remarkable capabilities in text generation, but precisely controlling their linguistic style remains challenging. Existing methods either lack fine-grained control, require extensive computation, or introduce significant latency. We propose Style Arithmetic (SA), a novel parameter-space approach that first extracts style-specific representations by analyzing parameter differences between models trained on contrasting styles, then incorporates these representations into a base model with precise control over style intensity. Our experiments show that SA achieves three key capabilities: controllability for precise adjustment of styles, transferability for effective style transfer across tasks, and composability for simultaneous control of multiple style dimensions. Compared to alternative methods, SA offers superior effectiveness while achieving optimal computational efficiency. Our approach opens new possibilities for flexible and efficient style control in language models.
pdf
bib
abs
Masks Can be Learned as an Alternative to Experts
Peiyu Liu
|
Tianwen Wei
|
Bo Zhu
|
Xin Zhao
|
Shuicheng Yan
In this work, we investigate how to sparsify a pre-trained dense large language model into a mixture-of-experts (MoE) architecture for faster inference. Our approach applies mask matrix to the activations for each expert, constrained by L0 regularization to minimize the number of activated parameters. Starting with all parameters active, the model is progressively sparsified during training, ensuring minimal performance loss. This approach proves more efficient than one-shot sparsification techniques, which typically require significant resources for performance recovery. Moreover, our approach automatically identifies shared, token-specific, and inactive experts, allowing for more efficient allocation of computational resources. Through extensive experiments, we achieve up to 97% performance retention on downstream tasks with only 50% of the feed-forward parameters activated in dense models. Beyond enhancing inference efficiency, this strategy of sharing computational units among experts presents a valuable framework for designing more generalized and efficient MoE architectures, opening avenues for future advancements in expert-based models.
pdf
bib
abs
Program Synthesis Benchmark for Visual Programming in XLogoOnline Environment
Chao Wen
|
Jacqueline Staub
|
Adish Singla
Large language and multimodal models have shown remarkable success on various benchmarks focused on specific skills such as general-purpose programming, math word problem-solving, and visual question answering. However, it is unclear how well these models perform on tasks that require a combination of these skills. In this paper, we curate a novel program synthesis benchmark based on the real-world tasks in the XLogoOnline visual programming environment. Each task requires a combination of different skills such as spatial planning, basic programming, and logical reasoning. Our evaluation shows that current state-of-the-art models like GPT-4V and Llama3-70B struggle to solve these tasks, achieving only 20% and 2.35% success rates, respectively. Next, we develop a fine-tuning pipeline to boost the performance of models by leveraging a large-scale synthetic training dataset with over 80,000 tasks. Moreover, we showcase how emulator-driven feedback can be used to design a curriculum over training data distribution, through which a fine-tuned Llama3-8B drastically outperforms GPT-4V and Llama3-70B models. Finally, we provide an in-depth failure analysis to understand the limitations of different models. We will publicly release the benchmark for future research on program synthesis in visual programming.
pdf
bib
abs
Removal of Hallucination on Hallucination: Debate-Augmented RAG
Wentao Hu
|
Wengyu Zhang
|
Yiyang Jiang
|
Chen Jason Zhang
|
Xiaoyong Wei
|
Li Qing
Retrieval-Augmented Generation (RAG) enhances factual accuracy by integrating external knowledge, yet it introduces a critical issue: erroneous or biased retrieval can mislead generation, compounding hallucinations, a phenomenon we term Hallucination on Hallucination. To address this, we propose Debate-Augmented RAG (DRAG), a training-free framework that integrates Multi-Agent Debate (MAD) mechanisms into both retrieval and generation stages. In retrieval, DRAG employs structured debates among proponents, opponents, and judges to refine retrieval quality and ensure factual reliability. In generation, DRAG introduces asymmetric information roles and adversarial debates, enhancing reasoning robustness and mitigating factual inconsistencies. Evaluations across multiple tasks demonstrate that DRAG improves retrieval reliability, reduces RAG-induced hallucinations, and significantly enhances overall factual accuracy. Our code is available at https://github.com/Huenao/Debate-Augmented-RAG.
pdf
bib
abs
CodeDPO: Aligning Code Models with Self Generated and Verified Source Code
Kechi Zhang
|
Ge Li
|
Yihong Dong
|
Jingjing Xu
|
Jun Zhang
|
Jing Su
|
Yongfei Liu
|
Zhi Jin
Code generation models have shown significant potential for programming tasks. However, existing training methods like supervised fine-tuning face key limitations: they do not effectively teach models to prioritize correct over incorrect solutions in ambiguous situations, nor do they effectively optimize the runtime efficiency of the generated code. To address these challenges, we propose CodeDPO, a framework that integrates preference learning into code generation to improve two key code preference factors: code correctness and efficiency. CodeDPO employs a novel dataset construction method, utilizing a self-generation-and-validation mechanism that simultaneously generates and evaluates code and test cases. The underlying assumption is that test cases executable by multiple code snippets provide more reliable validation, and code that passes more tests is more likely to be correct. Through this self-validation process, our PageRank-inspired algorithm iteratively updates the ranking score of each code snippet, ultimately creating a code preference optimization dataset based on correctness and efficiency. CodeDPO is flexible and scalable, generating diverse preference optimization data without depending on powerful models such as GPT-4. Through comprehensive evaluations of five widely used benchmarks, CodeDPO demonstrates significant improvements in correctness and efficiency compared to existing methods. Our experiments prove that CodeDPO enhances the capabilities of LLMs in code generation and provides a robust foundation for conducting code preference optimization in more complex and challenging real-world scenarios.
pdf
bib
abs
ProxAnn: Use-Oriented Evaluations of Topic Models and Document Clustering
Alexander Miserlis Hoyle
|
Lorena Calvo-Bartolomé
|
Jordan Lee Boyd-Graber
|
Philip Resnik
Topic models and document-clustering evaluations either use automated metrics that align poorly with human preferences, or require expert labels that are intractable to scale. We design a scalable human evaluation protocol and a corresponding automated approximation that reflect practitioners’ real-world usage of models. Annotators—or an LLM-based proxy—review text items assigned to a topic or cluster, infer a category for the group, then apply that category to other documents. Using this protocol, we collect extensive crowdworker annotations of outputs from a diverse set of topic models on two datasets. We then use these annotations to validate automated proxies, finding that the best LLM proxy is statistically indistinguishable from a human annotator and can therefore serve as a reasonable substitute in automated evaluations.
pdf
bib
abs
BOOKWORLD: From Novels to Interactive Agent Societies for Story Creation
Yiting Ran
|
Xintao Wang
|
Tian Qiu
|
Jiaqing Liang
|
Yanghua Xiao
|
Deqing Yang
Recent advances in large language models (LLMs) have enabled social simulation through multi-agent systems. Prior efforts focus on agent societies created from scratch, assigning agents with newly defined personas. However, simulating established fictional worlds and characters remain largely underexplored, despite its significant practical value. In this paper, we introduce BookWorld, a comprehensive system for constructing and simulating book-based multi-agent societies. BookWorld’s design covers comprehensive real-world intricacies, including diverse and dynamic characters, fictional worldviews, geographical constraints and changes, e.t.c. BookWorld enables diverse applications including story generation, interactive games and social simulation, offering novel ways to extend and explore beloved fictional works. Through extensive experiments, we demonstrate that BookWorld generates creative, high-quality stories while maintaining fidelity to the source books, surpassing previous methods with a win rate of 75.36%. The code and demo of this paper can be found at the project page: https://bookworld2025.github.io/.
pdf
bib
abs
Quantifying Lexical Semantic Shift via Unbalanced Optimal Transport
Ryo Kishino
|
Hiroaki Yamagiwa
|
Ryo Nagata
|
Sho Yokoi
|
Hidetoshi Shimodaira
Lexical semantic change detection aims to identify shifts in word meanings over time. While existing methods using embeddings from a diachronic corpus pair estimate the degree of change for target words, they offer limited insight into changes at the level of individual usage instances. To address this, we apply Unbalanced Optimal Transport (UOT) to sets of contextualized word embeddings, capturing semantic change through the excess and deficit in the alignment between usage instances. In particular, we propose Sense Usage Shift (SUS), a measure that quantifies changes in the usage frequency of a word sense at each usage instance. By leveraging SUS, we demonstrate that several challenges in semantic change detection can be addressed in a unified manner, including quantifying instance-level semantic change and word-level tasks such as measuring the magnitude of semantic change and the broadening or narrowing of meaning.
pdf
bib
abs
Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems
Hao Peng
|
Yunjia Qi
|
Xiaozhi Wang
|
Zijun Yao
|
Bin Xu
|
Lei Hou
|
Juanzi Li
Reward models (RMs) are crucial for the training and inference-time scaling up of large language models (LLMs). However, existing reward models primarily focus on human preferences, neglecting verifiable correctness signals which have shown strong potential in training LLMs. In this paper, we propose agentic reward modeling, a reward system that combines reward models with verifiable correctness signals from different aspects to provide reliable rewards. We empirically implement a reward agent, named RewardAgent, that combines human preference rewards with two verifiable signals: factuality and instruction following, to provide more reliable rewards. We conduct comprehensive experiments on existing reward model benchmarks and inference-time best-of-n searches on real-world downstream tasks. RewardAgent significantly outperforms vanilla reward models, demonstrating its effectiveness. We further construct training preference pairs using RewardAgent and train an LLM with the DPO objective, achieving superior performance on various NLP benchmarks compared to conventional reward models. Our codes are publicly released to facilitate further research.
pdf
bib
abs
Adaptive and Robust Translation from Natural Language to Multi-model Query Languages
Gengyuan Shi
|
Chaokun Wang
|
Liu Yabin
|
Jiawei Ren
Multi-model databases and polystore systems are increasingly studied for managing multi-model data holistically. As their primary interface, multi-model query languages (MMQLs) often exhibit complex grammars, highlighting the need for effective Text-to-MMQL translation methods. Despite advances in natural language translation, no effective solutions for Text-to-MMQL exist. To address this gap, we formally define the Text-to-MMQL task and present the first Text-to-MMQL dataset involving three representative MMQLs. We propose an adaptive Text-to-MMQL framework that includes both a schema embedding module for capturing multi-model schema information and an MMQL representation strategy to generate concise intermediate query formats with error correction in generated queries. Experimental results show that the proposed framework achieves over a 9% accuracy improvement over our adapted baseline methods.
pdf
bib
abs
SAKE: Steering Activations for Knowledge Editing
Marco Scialanga
|
Thibault Laugel
|
Vincent Grari
|
Marcin Detyniecki
As Large Langue Models have been shown to memorize real-world facts, the need to update this knowledge in a controlled and efficient manner arises. Designed with these constraints in mind, Knowledge Editing (KE) approaches propose to alter specific facts in pretrained models. However, they have been shown to suffer from several limitations, including their lack of contextual robustness and their failure to generalize to logical implications related to the fact. To overcome these issues, we propose SAKE, a steering activation method that models a fact to be edited as a distribution rather than a single prompt. Leveraging Optimal Transport, SAKE alters the LLM behavior over a whole fact-related distribution, defined as paraphrases and logical implications. Several numerical experiments demonstrate the effectiveness of this method: SAKE is thus able to perform more robust edits than its existing counterparts.
pdf
bib
abs
Middle-Layer Representation Alignment for Cross-Lingual Transfer in Fine-Tuned LLMs
Danni Liu
|
Jan Niehues
While large language models demonstrate remarkable capabilities at task-specific applications through fine-tuning, extending these benefits across diverse languages is essential for broad accessibility. However, effective cross-lingual transfer is hindered by LLM performance gaps across languages and the scarcity of fine-tuning data in many languages. Through analysis of LLM internal representations from over 1,000+ language pairs, we discover that middle layers exhibit the strongest potential for cross-lingual alignment. Building on this finding, we propose a middle-layer alignment objective integrated into task-specific training. Our experiments on slot filling, machine translation, and structured text generation show consistent improvements in cross-lingual transfer, especially to lower-resource languages. The method is robust to the choice of alignment languages and generalizes to languages unseen during alignment. Furthermore, we show that separately trained alignment modules can be merged with existing task-specific modules, improving cross-lingual capabilities without full re-training. The code is provided in the supplementary materials.
pdf
bib
abs
Can External Validation Tools Improve Annotation Quality for LLM-as-a-Judge?
Arduin Findeis
|
Floris Weers
|
Guoli Yin
|
Ke Ye
|
Ruoming Pang
|
Tom Gunter
Pairwise preferences over model responses are widely collected to evaluate and provide feedback to large language models (LLMs). Given two alternative model responses to the same input, a human or AI annotator selects the “better” response. This approach can provide feedback for domains where other hard-coded metrics are difficult to obtain (e.g., chat response quality), thereby helping model evaluation or training. However, for some domains high-quality pairwise comparisons can be tricky to obtain - from AI and humans. For example, for responses with many factual statements, annotators may disproportionately weigh writing quality rather than underlying facts. In this work, we explore augmenting standard AI annotator systems with additional tools to improve performance on three challenging response domains: long-form factual, math and code tasks. We propose a tool-using agentic system to provide higher quality feedback on these domains. Our system uses web-search and code execution to ground itself based on external validation, independent of the LLM’s internal knowledge and biases. We provide extensive experimental results evaluating our method across the three targeted response domains as well as general annotation tasks, using RewardBench (incl. AlpacaEval and LLMBar), RewardMath, as well as three new datasets for domains with saturated pre-existing datasets. Our results indicate that external tools can indeed improve performance in many, but not all, cases. More generally, our experiments highlight the sensitivity of performance to simple parameters (e.g., prompt) and the need for improved (non-saturated) annotator benchmarks. We share our code at https://github.com/apple/ml-agent-evaluator.
pdf
bib
abs
One for All: Update Parameterized Knowledge Across Multiple Models with Once Edit
Weitao Ma
|
Xiyuan Du
|
Xiaocheng Feng
|
Lei Huang
|
Yichong Huang
|
Huiyi Zhang
|
Xiaoliang Yang
|
Baohang Li
|
Xiachong Feng
|
Ting Liu
|
Bing Qin
Large language models (LLMs) encode vast world knowledge but struggle to stay up-to-date, often leading to errors and hallucinations. Knowledge editing offers an efficient alternative to retraining, enabling targeted modifications by updating specific model parameters. However, existing methods primarily focus on individual models, posing challenges in efficiently updating multiple models and adapting to new models. To address this, we propose OnceEdit, a novel ensemble-based approach that employs a plug-in model as the editing module, enabling stable knowledge updates across multiple models. Building on the model ensemble, OnceEdit introduces two key mechanisms to enhance its effectiveness. First, we introduce a dynamic weight mechanism through a weight token for distinguishing between edit-related and non-edit-related instances, ensuring the appropriate utilization of knowledge from integrated models. Second, we incorporate an ensemble enhancement mechanism to mitigate the excessive reliance on the central model inherent in the model ensemble technique, making it more suitable for knowledge editing. Extensive experiments on diverse LLMs demonstrate that OnceEdit consistently outperforms existing methods while achieving superior editing efficiency. Further analysis confirms its adaptability and stability in multi-model editing scenarios.
pdf
bib
abs
VLMInferSlow: Evaluating the Efficiency Robustness of Large Vision-Language Models as a Service
Xiasi Wang
|
Tianliang Yao
|
Simin Chen
|
Runqi Wang
|
Lei Ye
|
Kuofeng Gao
|
Yi Huang
|
Yuan Yao
Vision-Language Models (VLMs) have demonstrated great potential in real-world applications. While existing research primarily focuses on improving their accuracy, the efficiency remains underexplored. Given the real-time demands of many applications and the high inference overhead of VLMs, efficiency robustness is a critical issue. However, previous studies evaluate efficiency robustness under unrealistic assumptions, requiring access to the model architecture and parameters—an impractical scenario in ML-as-a-service settings, where VLMs are deployed via inference APIs. To address this gap, we propose VLMInferSlow, a novel approach for evaluating VLM efficiency robustness in a realistic black-box setting. VLMInferSlow incorporates fine-grained efficiency modeling tailored to VLM inference and leverages zero-order optimization to search for adversarial examples. Experimental results show that VLMInferSlow generates adversarial images with imperceptible perturbations, increasing the computational cost by up to 128.47%. We hope this research raises the community’s awareness about the efficiency robustness of VLMs.
pdf
bib
abs
The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs
Nitay Calderon
|
Roi Reichart
|
Rotem Dror
The “LLM-as-an-annotator” and “LLM-as-a-judge” paradigms employ Large Language Models (LLMs) as annotators, judges, and evaluators in tasks traditionally performed by humans. LLM annotations are widely used, not only in NLP research but also in fields like medicine, psychology, and social science. Despite their role in shaping study results and insights, there is no standard or rigorous procedure to determine whether LLMs can replace human annotators. In this paper, we propose a novel statistical procedure, the Alternative Annotator Test (alt-test), that requires only a modest subset of annotated examples to justify using LLM annotations. Additionally, we introduce a versatile and interpretable measure for comparing LLM annotators and judges. To demonstrate our procedure, we curated a diverse collection of ten datasets, consisting of language and vision-language tasks, and conducted experiments with six LLMs and four prompting techniques. Our results show that LLMs can sometimes replace humans with closed-source LLMs (such as GPT-4o), outperforming the open-source LLMs we examine, and that prompting techniques yield judges of varying quality. We hope this study encourages more rigorous and reliable practices.
pdf
bib
abs
CrisisTS: Coupling Social Media Textual Data and Meteorological Time Series for Urgency Classification
Romain Meunier
|
Farah Benamara
|
Véronique Moriceau
|
Zhongzheng Qiao
|
Savitha Ramasamy
This paper proposes CrisisTS, the first multimodal and multilingual dataset for urgency classification composed of benchmark crisis datasets from French and English social media about various expected (e.g., flood, storm) and sudden (e.g., earthquakes, explosions) crises that have been mapped with open source geocoded meteorological time series data. This mapping is based on a simple and effective strategy that allows for temporal and location alignment even in the absence of location mention in the text. A set of multimodal experiments have been conducted relying on transformers and LLMs to improve overall performances while ensuring model generalizability. Our results show that modality fusion outperforms text-only models.
pdf
bib
abs
How to Mitigate Overfitting in Weak-to-strong Generalization?
Junhao Shi
|
Qinyuan Cheng
|
Zhaoye Fei
|
Yining Zheng
|
Qipeng Guo
|
Xipeng Qiu
Aligning powerful AI models on tasks that surpass human evaluation capabilities is the central problem of **superalignment**. To address this problem, weak-to-strong generalization aims to elicit the capabilities of strong models through weak supervisors and ensure that the behavior of strong models aligns with the intentions of weak supervisors without unsafe behaviors such as deception. Although weak-to-strong generalization exhibiting certain generalization capabilities, strong models exhibit significant overfitting in weak-to-strong generalization: Due to the strong fit ability of strong models, erroneous labels from weak supervisors may lead to overfitting in strong models. In addition, simply filtering out incorrect labels may lead to a degeneration in question quality, resulting in a weak generalization ability of strong models on hard questions. To mitigate overfitting in weak-to-strong generalization, we propose a two-stage framework that simultaneously improves the quality of supervision signals and the quality of input questions. Experimental results in three series of large language models and two mathematical benchmarks demonstrate that our framework significantly improves PGR (Performance Gap Recovered) compared to naive weak-to-strong generalization, even achieving up to 100% PGR on some models.
pdf
bib
abs
Com2 : A Causal-Guided Benchmark for Exploring Complex Commonsense Reasoning in Large Language Models
Kai Xiong
|
Xiao Ding
|
Yixin Cao
|
Yuxiong Yan
|
Li Du
|
Yufei Zhang
|
Jinglong Gao
|
Jiaqian Liu
|
Bing Qin
|
Ting Liu
Large language models (LLMs) have mastered abundant simple and explicit commonsense knowledge through pre-training, enabling them to achieve human-like performance in simple commonsense reasoning. Nevertheless, LLMs struggle to reason with complex and implicit commonsense knowledge that is derived from simple ones (such as understanding the long-term effects of certain events), an aspect humans tend to focus on more. Existing works focus on complex tasks like math and code, while complex commonsense reasoning remains underexplored due to its uncertainty and lack of structure. To fill this gap and align with real-world concerns, we propose a benchmark Com2 focusing on complex commonsense reasoning. We first incorporate causal event graphs to serve as structured complex commonsense. Then we adopt causal theory (e.g., intervention) to modify the causal event graphs and obtain different scenarios that meet human concerns. Finally, an LLM is employed to synthesize examples with slow thinking, which is guided by the logical relationships in the modified causal graphs. Furthermore, we use detective stories to construct a more challenging subset. Experiments show that LLMs struggle in reasoning depth and breadth, while post-training and slow thinking can alleviate this. The code and data are available at https://github.com/Waste-Wood/Com2.
pdf
bib
abs
Dynamic Head Selection for Neural Lexicalized Constituency Parsing
Yang Hou
|
Zhenghua Li
Lexicalized parsing, which associates constituent nodes with lexical heads, has historically played a crucial role in constituency parsing by bridging constituency and dependency structures. Nevertheless, with the advent of neural networks, lexicalized structures have generally been neglected in favor of unlexicalized, span-based methods. In this paper, we revisit lexicalized parsing and propose a novel latent lexicalization framework that dynamically infers lexical heads during training without relying on predefined head-finding rules. Our method enables the model to learn lexical dependencies directly from data, offering greater adaptability across languages and datasets. Experiments on multiple treebanks demonstrate state-of-the-art or comparable performance. We also analyze the learned dependency structures, headword preferences, and linguistic biases.
pdf
bib
abs
My Words Imply Your Opinion: Reader Agent-Based Propagation Enhancement for Personalized Implicit Emotion Analysis
Jian Liao
|
Yu Feng
|
Yujin Zheng
|
Jun Zhao
|
Suge Wang
|
JianXing Zheng
The subtlety of emotional expressions makes implicit emotion analysis (IEA) particularly sensitive to user-specific characteristics. Current studies personalize emotion analysis by focusing on the author but neglect the impact of the intended reader on implicit emotional feedback. In this paper, we introduce Personalized IEA (PIEA) and present the RAPPIE model, which addresses subjective variability by incorporating reader feedback. In particular, (1) we create reader agents based on large language models to simulate reader feedback, overcoming the issue of “spiral of silence effect” and data incompleteness of real reader reaction. (2) We develop a role-aware multi-view graph learning to model the emotion interactive propagation process in scenarios with sparse reader information. (3) We construct two new PIEA datasets covering English and Chinese social media with detailed user metadata, addressing the text-centric limitation of existing datasets. Extensive experiments show that RAPPIE significantly outperforms state-of-the-art baselines, demonstrating the value of incorporating reader feedback in PIEA.
pdf
bib
abs
EvolveBench: A Comprehensive Benchmark for Assessing Temporal Awareness in LLMs on Evolving Knowledge
Zhiyuan Zhu
|
Yusheng Liao
|
Zhe Chen
|
Yuhao Wang
|
Yunfeng Guan
|
Yanfeng Wang
|
Yu Wang
Large language models (LLMs) are trained on extensive historical corpora, but their ability to understand time and maintain temporal awareness of time-evolving factual knowledge remains limited. Previous studies often neglect the critical aspect of utilizing knowledge from various sources. To address this gap, we introduce EvolveBench, a comprehensive benchmark that evaluates temporal competence along five key dimensions: Cognition, which examines the ability to recall and contextualize historical facts. Awareness, which tests LLMs’ awareness of temporal misalignment between external inputs and the temporal context of a query. Trustworthiness, which assesses whether models can identify and appropriately refuse queries based on invalid timestamps. Understanding, which focuses on interpreting both explicit dates and implicit historical markers. Finally, reasoning evaluates the capacity to analyze temporal relationships and draw accurate inferences. Evaluating 15 widely used LLMs on EvolveBench shows that GPT-4o achieves the highest average EM score of 79.36, while the open-source Llama3.1-70B demonstrates notable strength in handling temporally misaligned contexts with an average score of 72.47. Despite these advances, all models still struggle with handling temporal misaligned context. Our code and dataset are available at https://github.com/zzysjtuiwct/EvolveBench.
pdf
bib
abs
Enabling LLM Knowledge Analysis via Extensive Materialization
Yujia Hu
|
Tuan-Phong Nguyen
|
Shrestha Ghosh
|
Simon Razniewski
Large language models (LLMs) have majorly advanced NLP and AI, and next to their ability to perform a wide range of procedural tasks, a major success factor is their internalized factual knowledge. Since (Petroni et al., 2019), analyzing this knowledge has gained attention. However, most approaches investigate one question at a time via modest-sized pre-defined samples, introducing an “availability bias” (Tverski and Kahnemann, 1973) that prevents the analysis of knowledge (or beliefs) of LLMs beyond the experimenter’s predisposition.To address this challenge, we propose a novel methodology to comprehensively materialize an LLM’s factual knowledge through recursive querying and result consolidation. Our approach is a milestone for LLM research, for the first time providing constructive insights into the scope and structure of LLM knowledge (or beliefs).As a prototype, we extract a knowledge base (KB) comprising 101 million relational triples for over 2.9 million entities from GPT-4o-mini. We use GPTKB to exemplarily analyze GPT-4o-mini’s factual knowledge in terms of scale, accuracy, bias, cutoff and consistency, at the same time. Our resource is accessible at https://gptkb.org.
pdf
bib
abs
Rhythm Controllable and Efficient Zero-Shot Voice Conversion via Shortcut Flow Matching
Jialong Zuo
|
Shengpeng Ji
|
Minghui Fang
|
Mingze Li
|
Ziyue Jiang
|
Xize Cheng
|
Xiaoda Yang
|
Chen Feiyang
|
Xinyu Duan
|
Zhou Zhao
Zero-Shot Voice Conversion (VC) aims to transform the source speaker’s timbre into an arbitrary unseen one while retaining speech content. Most prior work focuses on preserving the source’s prosody, while fine-grained timbre information may leak through prosody, and transferring target prosody to synthesized speech is rarely studied. In light of this, we propose R-VC, a rhythm-controllable and efficient zero-shot voice conversion model. R-VC employs data perturbation techniques and discretize source speech into Hubert content tokens, eliminating much content-irrelevant information. By leveraging a Mask Generative Transformer for in-context duration modeling, our model adapts the linguistic content duration to the desired target speaking style, facilitating the transfer of the target speaker’s rhythm. Furthermore, R-VC introduces a powerful Diffusion Transformer (DiT) with shortcut flow matching during training, conditioning the network not only on the current noise level but also on the desired step size, enabling high timbre similarity and quality speech generation in fewer sampling steps, even in just two, thus minimizing latency. Experimental results show that R-VC achieves comparable speaker similarity to state-of-the-art VC methods with a smaller dataset, and surpasses them in terms of speech naturalness, intelligibility and style transfer performance.
pdf
bib
abs
Llama See, Llama Do: A Mechanistic Perspective on Contextual Entrainment and Distraction in LLMs
Jingcheng Niu
|
Xingdi Yuan
|
Tong Wang
|
Hamidreza Saghir
|
Amir H. Abdi
We observe a novel phenomenon, *contextual entrainment*, across a wide range of language models (LMs) and prompt settings, providing a new mechanistic perspective on how LMs become distracted by “irrelevant” contextual information in the input prompt. Specifically, LMs assign significantly higher logits (or probabilities) to any tokens that have previously appeared in the context prompt, even for random tokens. This suggests that contextual entrainment is a mechanistic phenomenon, occurring independently of the relevance or semantic relation of the tokens to the question or the rest of the sentence. We find statistically significant evidence that the magnitude of contextual entrainment is influenced by semantic factors. Counterfactual prompts have a greater effect compared to factual ones, suggesting that while contextual entrainment is a mechanistic phenomenon, it is modulated by semantic factors.We hypothesise that there is a circuit of attention heads — the *entrainment heads* — that corresponds to the contextual entrainment phenomenon. Using a novel entrainment head discovery method based on differentiable masking, we identify these heads across various settings. When we “turn off” these heads, i.e., set their outputs to zero, the effect of contextual entrainment is significantly attenuated, causing the model to generate output that capitulates to what it would produce if no distracting context were provided. Our discovery of contextual entrainment, along with our investigation into LM distraction via the entrainment heads, marks a key step towards the mechanistic analysis and mitigation of the distraction problem.
pdf
bib
abs
CritiQ: Mining Data Quality Criteria from Human Preferences
Honglin Guo
|
Kai Lv
|
Qipeng Guo
|
Tianyi Liang
|
Zhiheng Xi
|
Demin Song
|
Qiuyinzhe Zhang
|
Yu Sun
|
Kai Chen
|
Xipeng Qiu
|
Tao Gui
Language model heavily depends on high-quality data for optimal performance. Existing approaches rely on manually designed heuristics, the perplexity of existing models, training classifiers, orcareful prompt engineering, which require significant expert experience and human annotation effort while introduce biases. We introduce CritiQ, a novel data selection method that automatically mines criteria from human preferences for data quality with only ~30 human-annotated pairs and performs efficient data selection. The main component, CritiQ Flow, employs a manager agent to evolve quality criteria and worker agents to make pairwise judgments. We build a knowledge base that extracts quality criteria from previous work to boost CritiQ Flow. Compared to perplexity- and classifier-based methods, verbal criteria are more interpretable and have greater reusable value. After deriving the criteria, we train the CritiQ Scorer to give quality scores and perform efficient data selection. We demonstrate the effectiveness of our method in the code, math, and logic domains, achieving high accuracy on human-annotated test sets. To validate the quality of the selected data, we continually train Llama 3.2 models and observe improved performance on downstream tasks compared to uniform sampling. Ablation studies validate the benefits of the knowledge base and the reflection process. We analyze how criteria evolve and the effectiveness of majority voting.
pdf
bib
abs
Theoretical Guarantees for Minimum Bayes Risk Decoding
Yuki Ichihara
|
Yuu Jinnai
|
Kaito Ariu
|
Tetsuro Morimura
|
Eiji Uchibe
Minimum Bayes Risk (MBR) decoding optimizes output selection by maximizing the expected utility value of an underlying human distribution. While prior work has shown the effectiveness of MBR decoding through empirical evaluation, few studies have analytically investigated why the method is effective. As a result of our analysis, we show that, given the size n of the reference hypothesis set used in computation, MBR decoding approaches the optimal solution with high probability at a rate of 𝒪(n-1⁄2), under certain assumptions, even though the language space 𝒴 is significantly larger |𝒴| ≫ n.This result helps to theoretically explain the strong performance observed in several prior empirical studies on MBR decoding. In addition, we provide the performance gap for maximum-a-posteriori (MAP) decoding and compare it to MBR decoding. The result of this paper indicates that MBR decoding tends to converge to the optimal solution faster than MAP decoding in several cases.
pdf
bib
abs
Mutual-Taught for Co-adapting Policy and Reward Models
Tianyuan Shi
|
Canbin Huang
|
Fanqi Wan
|
Longguang Zhong
|
Ziyi Yang
|
Weizhou Shen
|
Xiaojun Quan
|
Ming Yan
During the preference optimization of large language models (LLMs), distribution shifts may arise between newly generated model samples and the data used to train the reward model (RM). This shift reduces the efficacy of the RM, which in turn negatively impacts the performance of the policy model (PM). To address this challenge, we propose Mutual-Taught, a self-training method that iteratively improves both the PM and RM without requiring additional human annotation. Our approach mirrors the expectation-maximization (EM) algorithm. In the E-step, the PM is updated using feedback from the current RM, guiding the PM toward a better approximation of the latent optimal preference distribution.In the M-step, we update the RM by constructing training data from the outputs of the PM before and after the E-step update. This process ensures that the RM adapts to the evolving policy distribution. Experimental results demonstrate that this iterative approach leads to consistent improvements in both models. Specifically, our 8B policy model, LLaMA-3-8B-Instruct-MT, achieves a length-controlled win rate of 54.1% on AlpacaEval-2, while our 8B reward model, FsfairX-LLaMA3-RM-MT, performs on par with GPT-4o-2024-08-06 on RewardBench.
pdf
bib
abs
Enhancing Cross-Lingual Transfer through Reversible Transliteration: A Huffman-Based Approach for Low-Resource Languages
Wenhao Zhuang
|
Yuan Sun
|
Xiaobing Zhao
As large language models (LLMs) are trained on increasingly diverse and extensive multilingual corpora, they demonstrate cross-lingual transfer capabilities. However, these capabilities often fail to effectively extend to low-resource languages, particularly those utilizing non-Latin scripts. While transliterating low-resource languages into Latin script presents a natural solution, there currently lacks a comprehensive framework for integrating transliteration into LLMs training and deployment. Taking a pragmatic approach, this paper innovatively combines character transliteration with Huffman coding to design a complete transliteration framework. Our proposed framework offers the following advantages: 1) Compression: Reduces storage requirements for low-resource language content, achieving up to 50% reduction in file size and 50-80% reduction in token count. 2) Accuracy: Guarantees 100% lossless conversion from transliterated text back to the source language. 3) Efficiency: Eliminates the need for vocabulary expansion for low-resource languages, improving training and inference efficiency. 4) Scalability: The framework can be extended to other low-resource languages. We validate the effectiveness of our framework across multiple downstream tasks, including text classification, machine reading comprehension, and machine translation. Experimental results demonstrate that our method significantly enhances the model’s capability to process low-resource languages while maintaining performance on high-resource languages. Our data and code are publicly available at https://github.com/CMLI-NLP/HuffmanTranslit.
pdf
bib
abs
Unmasking Style Sensitivity: A Causal Analysis of Bias Evaluation Instability in Large Language Models
Jiaxu Zhao
|
Meng Fang
|
Kun Zhang
|
Mykola Pechenizkiy
Natural language processing applications are increasingly prevalent, but social biases in their outputs remain a critical challenge. While various bias evaluation methods have been proposed, these assessments show unexpected instability when input texts undergo minor stylistic changes. This paper conducts a comprehensive analysis of how different style transformations impact bias evaluation results across multiple language models and bias types using causal inference techniques. Our findings reveal that formality transformations significantly affect bias scores, with informal style showing substantial bias reductions (up to 8.33% in LLaMA-2-13B). We identify appearance bias, sexual orientation bias, and religious bias as most susceptible to style changes, with variations exceeding 20%. Larger models demonstrate greater sensitivity to stylistic variations, with bias measurements fluctuating up to 3.1% more than in smaller models. These results highlight critical limitations in current bias evaluation methods and emphasize the need for reliable and fair assessments of language models.
pdf
bib
abs
MockConf: A Student Interpretation Dataset: Analysis, Word- and Span-level Alignment and Baselines
Dávid Javorský
|
Ondřej Bojar
|
François Yvon
In simultaneous interpreting, an interpreter renders the speech into another language with a very short lag, much sooner than sentences are finished. In order to understand and later reproduce this dynamic and complex task automatically, we need specialized datasets and tools for analysis, monitoring, and evaluation, such as parallel speech corpora, and tools for their automatic annotation. Existing parallel corpora of translated texts and associated alignment algorithms hardly fill this gap, as they fail to model long-range interactions between speech segments or specific types of divergences (e.g. shortening, simplification, functional generalization) between the original and interpreted speeches. In this work, we develop and explore MockConf, a student interpretation dataset that was collected from Mock Conferences run as part of the students’ curriculum. This dataset contains 7 hours of recordings in 5 European languages, transcribed and aligned at the level of spans and words. We further implement and release InterAlign, a modern web-based annotation tool for parallel word and span annotations on long inputs, suitable for aligning simultaneous interpreting. We propose metrics for the evaluation and a baseline for automatic alignment. Dataset and tools will be released to the community.
pdf
bib
abs
BMIKE-53: Investigating Cross-Lingual Knowledge Editing with In-Context Learning
Ercong Nie
|
Bo Shao
|
Mingyang Wang
|
Zifeng Ding
|
Helmut Schmid
|
Hinrich Schuetze
This paper introduces BMIKE-53, a comprehensive benchmark for cross-lingual in-context knowledge editing (IKE), spanning 53 languages and three KE datasets: zsRE, CounterFact, and WikiFactDiff. Cross-lingual KE, which requires knowledge edited in one language to generalize across diverse languages while preserving unrelated knowledge, remains underexplored. To address this, we systematically evaluate IKE under zero-shot, one-shot, and few-shot setups, including tailored metric-specific demonstrations. Our findings reveal that model scale and demonstration alignment critically govern cross-lingual editing efficacy, with larger models and tailored demonstrations significantly improving performance. Linguistic properties, particularly script type, strongly influence outcomes, with non-Latin languages underperforming due to issues like language confusion.
pdf
bib
abs
What Matters in Evaluating Book-Length Stories? A Systematic Study of Long Story Evaluation
Dingyi Yang
|
Qin Jin
In this work, we conduct systematic research in a challenging area: the automatic evaluation of book-length stories (>100K tokens). Our study focuses on two key questions: (1) understanding which evaluation aspects matter most to readers, and (2) exploring effective methods for evaluating lengthy stories. We introduce the first large-scale benchmark, **LongStoryEval**, comprising 600 newly published books with an average length of 121K tokens (maximum 397K). Each book includes its average rating and multiple reader reviews, presented as critiques organized by evaluation aspects. By analyzing all user-mentioned aspects, we propose an *evaluation criteria structure* and conduct experiments to identify the most significant aspects among the 8 top-level criteria. For evaluation methods, we compare the effectiveness of three types: *aggregation-based*, *incremental-updated*, and *summary-based* evaluations. Our findings reveal that aggregation- and summary-based evaluations perform better, with the former excelling in detail assessment and the latter offering greater efficiency. Building on these insights, we further propose **NovelCritique**, an 8B model that leverages the efficient summary-based framework to review and score stories across specified aspects. NovelCritique outperforms commercial models like GPT-4o in aligning with human evaluations. All our datasets and codes will be released to foster further research.
pdf
bib
abs
PROPER: A Progressive Learning Framework for Personalized Large Language Models with Group-Level Adaptation
Linhai Zhang
|
Jialong Wu
|
Deyu Zhou
|
Yulan He
Personalized large language models (LLMs) aim to tailor their outputs to user preferences. Recent advances in parameter-efficient fine-tuning (PEFT) methods have highlighted the effectiveness of adapting population-level LLMs to personalized LLMs by fine-tuning user-specific parameters with user history. However, user data is typically sparse, making it challenging to adapt LLMs to specific user patterns. To address this challenge, we propose PROgressive PERsonalization (PROPER), a novel progressive learning framework inspired by meso-level theory in social science. PROPER bridges population-level and user-level models by grouping users based on preferences and adapting LLMs in stages. It combines a Mixture-of-Experts (MoE) structure with Low Ranked Adaptation (LoRA), using a user-aware router to assign users to appropriate groups automatically. Additionally, a LoRA-aware router is proposed to facilitate the integration of individual user LoRAs with the group-level LoRA. Experimental results show that PROPER significantly outperforms SOTA models across multiple tasks, demonstrating the effectiveness of our approach.
pdf
bib
abs
Enhancing Event-centric News Cluster Summarization via Data Sharpening and Localization Insights
Longyin Zhang
|
Bowei Zou
|
AiTi Aw
This paper tackles the challenges of clustering news articles by main events (MEs) and summarizing these clusters, focusing on diverse languages and localized contexts. Our approach consists of four key contributions. First, we investigate the role of dynamic clustering and the integration of various ME references, including event attributions extracted by language models (LMs), in enhancing event-centric clustering. Second, we propose a data-sharpening framework that optimizes the balance between information volume and entropy in input texts, thereby optimizing generated summaries on multiple indicators. Third, we fine-tune LMs with local news articles for cross-lingual temporal question-answering and text summarization, achieving notable improvements in capturing localized contexts. Lastly, we present the first cross-lingual dataset and comprehensive evaluation metrics tailored for the event-centric news cluster summarization pipeline. Our findings enhance the understanding of news summarization across N-gram, event-level coverage, and faithfulness, providing new insights into leveraging LMs for large-scale cross-lingual and localized news analysis.
pdf
bib
abs
MMBoundary: Advancing MLLM Knowledge Boundary Awareness through Reasoning Step Confidence Calibration
Zhitao He
|
Sandeep Polisetty
|
Zhiyuan Fan
|
Yuchen Huang
|
Shujin Wu
|
Yi R. Fung
In recent years, multimodal large language models (MLLMs) have made significant progress but continue to face inherent challenges in multimodal reasoning, which requires multi-level (e.g., perception, reasoning) and multi-granular (e.g., multi-step reasoning chain) advanced inferencing. Prior work on estimating model confidence tends to focus on the overall response for training and calibration, but fails to assess confidence in each reasoning step, leading to undesirable hallucination snowballing. In this work, we present MMBoundary, a novel framework that advances the knowledge boundary awareness of MLLMs through reasoning step confidence calibration. To achieve this, we propose to incorporate complementary textual and cross-modal self-rewarding signals to estimate confidence at each step of the MLLM reasoning process. In addition to supervised fine-tuning MLLM on this set of self-rewarding confidence estimation signal for initial confidence expression warm-up, we introduce a reinforcement learning stage with multiple reward functions for further aligning model knowledge and calibrating confidence at each reasoning step, enhancing reasoning chain self-correction. Empirical results show that MMBoundary significantly outperforms existing methods across diverse domain datasets and metrics, achieving an average of 7.5% reduction in multimodal confidence calibration errors and up to 8.3% improvement in task performance.
pdf
bib
abs
LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios
Xiaodong Wu
|
Minhao Wang
|
Yichen Liu
|
Xiaoming Shi
|
He Yan
|
Lu Xiangju
|
Junmin Zhu
|
Wei Zhang
As Large Language Models (LLMs) evolve in natural language processing (NLP), their ability to stably follow instructions in long-context inputs has become critical for real-world applications. However, existing benchmarks seldom focus on instruction-following in long-context scenarios or stability on different inputs. To bridge this gap, we introduce LIFBench, a scalable dataset designed to evaluate LLMs’ instruction-following capabilities and stability across long contexts. LIFBench comprises three long-context scenarios and eleven diverse tasks, featuring 2,766 instructions generated through an automated expansion method across three dimensions: length, expression, and variables. For evaluation, we propose LIFEval, a rubric-based assessment method that enables precise, automated scoring of complex LLM responses without reliance on LLM-assisted assessments or human judgment. This method allows for a comprehensive analysis of model performance and stability from multiple perspectives. We conduct detailed experiments on 20 prominent LLMs across six length intervals. Our work contributes LIFBench and LIFEval as robust tools for assessing LLM performance in complex and long-context settings, offering valuable insights to guide future advancements in LLM development.
pdf
bib
abs
Aligning Large Language Models to Follow Instructions and Hallucinate Less via Effective Data Filtering
Shuzheng Si
|
Haozhe Zhao
|
Gang Chen
|
Cheng Gao
|
Yuzhuo Bai
|
Zhitong Wang
|
Kaikai An
|
Kangyang Luo
|
Chen Qian
|
Fanchao Qi
|
Baobao Chang
|
Maosong Sun
Training LLMs on data containing unfamiliar knowledge during the instruction tuning stage can encourage hallucinations. To address this challenge, we introduce NOVA, a novel framework designed to identify high-quality data that aligns well with the LLM’s learned knowledge to reduce hallucinations. NOVA includes Internal Consistency Probing (ICP) and Semantic Equivalence Identification (SEI) to measure how familiar the LLM is with instruction data. Specifically, ICP evaluates the LLM’s understanding of the given instruction by calculating the tailored consistency among multiple self-generated responses. SEI further assesses the familiarity of the LLM with the target response by comparing it to the generated responses, using the proposed semantic clustering and well-designed voting strategy. Finally, to ensure the quality of selected samples, we introduce an expert-aligned reward model, considering characteristics beyond just familiarity. By considering data quality and avoiding unfamiliar data, we can utilize the selected data to effectively align LLMs to follow instructions and hallucinate less. Experiments show that NOVA significantly reduces hallucinations while maintaining a competitive ability to follow instructions.
pdf
bib
abs
One-Shot is Enough: Consolidating Multi-Turn Attacks into Efficient Single-Turn Prompts for LLMs
Junwoo Ha
|
Hyunjun Kim
|
Sangyoon Yu
|
Haon Park
|
Ashkan Yousefpour
|
Yuna Park
|
Suhyun Kim
We introduce a novel framework for consolidating multi-turn adversarial “jailbreak” prompts into single-turn queries, significantly reducing the manual overhead required for adversarial testing of large language models (LLMs). While multi-turn human jailbreaks have been shown to yield high attack success rates (ASRs), they demand considerable human effort and time. Our proposed Multi-turn-to-Single-turn (M2S) methods—Hyphenize, Numberize, and Pythonize—systematically reformat multi-turn dialogues into structured single-turn prompts. Despite eliminating iterative back-and-forth interactions, these reformatted prompts preserve and often enhance adversarial potency: in extensive evaluations on the Multi-turn Human Jailbreak (MHJ) dataset, M2S methods yield ASRs ranging from 70.6 % to 95.9 % across various state-of-the-art LLMs. Remarkably, our single-turn prompts outperform the original multi-turn attacks by up to 17.5 % in absolute ASR, while reducing token usage by more than half on average. Further analyses reveal that embedding malicious requests in enumerated or code-like structures exploits “contextual blindness,” undermining both native guardrails and external input-output safeguards. By consolidating multi-turn conversations into efficient single-turn prompts, our M2S framework provides a powerful tool for large-scale red-teaming and exposes critical vulnerabilities in contemporary LLM defenses. All code, data, and conversion prompts are available for reproducibility and further investigations: https://github.com/Junuha/M2S_DATA
pdf
bib
abs
RAEmoLLM: Retrieval Augmented LLMs for Cross-Domain Misinformation Detection Using In-Context Learning Based on Emotional Information
Zhiwei Liu
|
Kailai Yang
|
Qianqian Xie
|
Christine de Kock
|
Sophia Ananiadou
|
Eduard Hovy
Misinformation is prevalent in various fields such as education, politics, health, etc., causing significant harm to society. However, current methods for cross-domain misinformation detection rely on effort- and resource-intensive fine-tuning and complex model structures. With the outstanding performance of LLMs, many studies have employed them for misinformation detection. Unfortunately, they focus on in-domain tasks and do not incorporate significant sentiment and emotion features (which we jointly call affect). In this paper, we propose RAEmoLLM, the first retrieval augmented (RAG) LLMs framework to address cross-domain misinformation detection using in-context learning based on affective information. RAEmoLLM includes three modules. (1) In the index construction module, we apply an emotional LLM to obtain affective embeddings from all domains to construct a retrieval database. (2) The retrieval module uses the database to recommend top K examples (text-label pairs) from source domain data for target domain contents. (3) These examples are adopted as few-shot demonstrations for the inference module to process the target domain content. The RAEmoLLM can effectively enhance the general performance of LLMs in cross-domain misinformation detection tasks through affect-based retrieval, without fine-tuning. We evaluate our framework on three misinformation benchmarks. Results show that RAEmoLLM achieves significant improvements compared to the other few-shot methods on three datasets, with the highest increases of 15.64%, 31.18%, and 15.73% respectively. This project is available at https://github.com/lzw108/RAEmoLLM.
pdf
bib
abs
Task-Specific Information Decomposition for End-to-End Dense Video Captioning
Zhiyue Liu
|
Xinru Zhang
|
Jinyuan Liu
Dense video captioning aims to localize events within input videos and generate concise descriptive texts for each event. Advanced end-to-end methods require both tasks to share the same intermediate features that serve as event queries, thereby enabling the mutual promotion of two tasks. However, relying on shared queries limits the model’s ability to extract task-specific information, as event semantic perception and localization demand distinct perspectives on video understanding. To address this, we propose a decomposed dense video captioning framework that derives localization and captioning queries from event queries, enabling task-specific representations while maintaining inter-task collaboration. Considering the roles of different queries, we design a contrastive semantic optimization strategy that guides localization queries to focus on event-level visual features and captioning queries to align with textual semantics. Besides, only localization information is considered in existing methods for label assignment, failing to ensure the relevance of the selected queries to descriptions. We jointly consider localization and captioning losses to achieve a semantically balanced assignment process. Extensive experiments on the YouCook2 and ActivityNet Captions datasets demonstrate that our framework achieves state-of-the-art performance.
pdf
bib
abs
CalibraEval: Calibrating Prediction Distribution to Mitigate Selection Bias in LLMs-as-Judges
Haitao Li
|
Junjie Chen
|
Qingyao Ai
|
Zhumin Chu
|
Yujia Zhou
|
Qian Dong
|
Yiqun Liu
The use of large language models (LLMs) as automated evaluation tools to assess the quality of generated natural language, known as ”LLMs-as-Judges”, has demonstrated promising capabilities and is rapidly gaining widespread attention. However, when applied to pairwise comparisons of candidate responses, LLM-based evaluators often exhibit selection bias. Specifically, their judgments may become inconsistent when the option positions or ID tokens are swapped, compromising the effectiveness and fairness of the evaluation result. To address this challenge, we introduce CalibraEval, a novel label-free method for mitigating selection bias during inference. Specifically, CalibraEval reformulates debiasing as an optimization task aimed at adjusting observed prediction distributions to align with unbiased prediction distributions. To solve this optimization problem, we propose a non-parametric order-preserving algorithm (NOA). This algorithm leverages the partial order relationships between model prediction distributions, thereby eliminating the need for explicit labels and precise mathematical function modeling. Empirical evaluations of LLMs in multiple representative benchmarks demonstrate that CalibraEval effectively mitigates selection bias and improves performance compared to existing debiasing methods. This work marks a step toward building more robust and unbiased automated evaluation frameworks, paving the way for improved reliability in AI-driven assessments. The code can be found at https://github.com/CSHaitao/CalibraEval.
pdf
bib
abs
Explaining Matters: Leveraging Definitions and Semantic Expansion for Sexism Detection
Sahrish Khan
|
Arshad Jhumka
|
Gabriele Pergola
The detection of sexism in online content remains an open problem, as harmful language disproportionately affects women and marginalized groups. While automated systems for sexism detection have been developed, they still face two key challenges: data sparsity and the nuanced nature of sexist language. Even in large, well-curated datasets like the Explainable Detection of Online Sexism (EDOS), severe class imbalance hinders model generalization. Additionally, the overlapping and ambiguous boundaries of fine-grained categories introduce substantial annotator disagreement, reflecting the difficulty of interpreting nuanced expressions of sexism. To address these challenges, we propose two prompt-based data augmentation techniques: Definition-based Data Augmentation (DDA), which leverages category-specific definitions to generate semantically-aligned synthetic examples, and Contextual Semantic Expansion (CSE), which targets systematic model errors by enriching examples with task-specific semantic features. To further improve reliability in fine-grained classification, we introduce an ensemble strategy that resolves prediction ties by aggregating complementary perspectives from multiple language models. Our experimental evaluation on the EDOS dataset demonstrates state-of-the-art performance across all tasks, with notable improvements of macro F1 by 1.5 points for binary classification (Task A) and 4.1 points for fine-grained classification (Task C).
pdf
bib
abs
Private Memorization Editing: Turning Memorization into a Defense to Strengthen Data Privacy in Large Language Models
Elena Sofia Ruzzetti
|
Giancarlo A. Xompero
|
Davide Venditti
|
Fabio Massimo Zanzotto
Large Language Models (LLMs) memorize, and thus, among huge amounts of uncontrolled data, may memorize Personally Identifiable Information (PII), which should not be stored and, consequently, not leaked. In this paper, we introduce Private Memorization Editing (PME), an approach for preventing private data leakage that turns an apparent limitation, that is, the LLMs’ memorization ability, into a powerful privacy defense strategy. While attacks against LLMs have been performed exploiting previous knowledge regarding their training data, our approach aims to exploit the same kind of knowledge in order to make a model more robust. We detect a memorized PII and then mitigate the memorization of PII by editing a model knowledge of its training data. We verify that our procedure does not affect the underlying language model while making it more robust against privacy Training Data Extraction attacks. We demonstrate that PME can effectively reduce the number of leaked PII in a number of configurations, in some cases even reducing the accuracy of the privacy attacks to zero.
pdf
bib
abs
PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning
Xinyu Zhang
|
Yuxuan Dong
|
Yanrui Wu
|
Jiaxing Huang
|
Chengyou Jia
|
Basura Fernando
|
Mike Zheng Shou
|
Lingling Zhang
|
Jun Liu
Large language models demonstrate remarkable capabilities across various domains, especially mathematics and logic reasoning. However, current evaluations overlook physics-based reasoning - a complex task requiring physics theorems and constraints. We present PhysReason, a 1,200-problem benchmark comprising knowledge-based (25%) and reasoning-based (75%) problems, where the latter are divided into three difficulty levels (easy, medium, hard). Notably, problems require an average of 8.1 solution steps, with hard requiring 15.6, reflecting the complexity of physics-based reasoning. We propose the Physics Solution Auto Scoring Framework, incorporating efficient answer-level and comprehensive step-level evaluations. Top-performing models like Deepseek-R1, Gemini-2.0-Flash-Thinking, and o3-mini-high achieve less than 60% on answer-level evaluation, with performance dropping from knowledge questions (75.11%) to hard problems (31.95%). Through step-level evaluation, we identified four key bottlenecks: Physics Theorem Application, Physics Process Understanding, Calculation, and Physics Condition Analysis. These findings position PhysReason as a novel and comprehensive benchmark for evaluating physics-based reasoning capabilities in large language models.
pdf
bib
abs
Does Time Have Its Place? Temporal Heads: Where Language Models Recall Time-specific Information
Yein Park
|
Chanwoong Yoon
|
Jungwoo Park
|
Minbyul Jeong
|
Jaewoo Kang
While the ability of language models to elicit facts has been widely investigated, how they handle temporally changing facts remains underexplored. We discover Temporal Heads, specific attention heads that primarily handle temporal knowledge, through circuit analysis. We confirm that these heads are present across multiple models, though their specific locations may vary, and their responses differ depending on the type of knowledge and its corresponding years. Disabling these heads degrades the model’s ability to recall time-specific knowledge while maintaining its general capabilities without compromising time-invariant and question-answering performances. Moreover, the heads are activated not only numeric conditions (“In 2004”) but also textual aliases (“In the year ...”), indicating that they encode a temporal dimension beyond simple numerical representation. Furthermore, we expand the potential of our findings by demonstrating how temporal knowledge can be edited by adjusting the values of these heads.
pdf
bib
abs
Velocitune: A Velocity-based Dynamic Domain Reweighting Method for Continual Pre-training
Zheheng Luo
|
Xin Zhang
|
Xiao Liu
|
Haoling Li
|
Yeyun Gong
|
Qi Chen
|
Peng Cheng
It is well-known that a diverse corpus is critical for training large language models, which are typically constructed from a mixture of various domains. In general, previous efforts resort to either sampling training data from different domains with static proportions or dynamically adjusting these proportions during training to optimise pretraining performance. However, few methods addressed the complexities of domain-adaptive continual pre-training. To fill this gap, we propose Velocitune, a novel framework that dynamically assesses learning velocity and adjusts data proportions accordingly, favouring slower learning domains while de-emphasising faster learning ones, which is guided by a scaling law to estimate the desired learning goal for each domain with a less associated cost. To evaluate the effectiveness of Velocitune, we conduct experiments on a dataset focused on reasoning tasks with CodeLlama, as well as on a corpus of system commands using Llama3 and Mistral. Velocitune achieves performance gains in both math and code reasoning tasks and command-line generation benchmarks. Further analysis reveals that key factors driving Velocitune’s effectiveness include target estimation and data ordering.
pdf
bib
abs
Sheep’s Skin, Wolf’s Deeds: Are LLMs Ready for Metaphorical Implicit Hate Speech?
Jingjie Zeng
|
Liang Yang
|
Zekun Wang
|
Yuanyuan Sun
|
Hongfei Lin
Implicit hate speech has become a significant challenge for online platforms, as it often avoids detection by large language models (LLMs) due to its indirectly expressed hateful intent. This study identifies the limitations of LLMs in detecting implicit hate speech, particularly when disguised as seemingly harmless expressions in a rhetorical device. To address this challenge, we employ a Jailbreaking strategy and Energy-based Constrained Decoding techniques, and design a small model for measuring the energy of metaphorical rhetoric. This approach can lead to LLMs generating metaphorical implicit hate speech. Our research reveals that advanced LLMs, like GPT-4o, frequently misinterpret metaphorical implicit hate speech, and fail to prevent its propagation effectively. Even specialized models, like ShieldGemma and LlamaGuard, demonstrate inadequacies in blocking such content, often misclassifying it as harmless speech. This work points out the vulnerability of current LLMs to implicit hate speech, and emphasizes the improvements to address hate speech threats better.
pdf
bib
abs
Neuron-Level Sequential Editing for Large Language Models
Houcheng Jiang
|
Junfeng Fang
|
Tianyu Zhang
|
Baolong Bi
|
An Zhang
|
Ruipeng Wang
|
Tao Liang
|
Xiang Wang
This work explores sequential model editing in large language models (LLMs), a critical task that involves modifying internal knowledge within LLMs continuously through multi-round editing, each incorporating updates or corrections to adjust the model’s outputs without the need for costly retraining. Existing model editing methods, especially those that alter model parameters, typically focus on single-round editing and often face significant challenges in sequential model editing-most notably issues of model forgetting and failure. To address these challenges, we introduce a new model editing method, namely
Neuron-level
Sequential
Editing (NSE), tailored for supporting sequential model editing. Specifically, we optimize the target layer’s hidden states using the model’s original weights to prevent model failure. Furthermore, we iteratively select neurons in multiple layers for editing based on their activation values to mitigate model forgetting. Our empirical experiments demonstrate that NSE significantly outperforms current modifying parameters model editing methods, marking a substantial advancement in the field of sequential model editing. Our code is released on
https://anonymous.4open.science/r/NSE-0A8D/.
pdf
bib
abs
Automatic Expert Discovery in LLM Upcycling via Sparse Interpolated Mixture-of-Experts
Shengzhuang Chen
|
Ying Wei
|
Jonathan Richard Schwarz
We present Sparse Interpolated Mixture-of-Experts (SIMoE) instruction-tuning, an end-to-end algorithm designed to fine-tune a dense pre-trained Large Language Model (LLM) into a MoE-style model that possesses capabilities in multiple specialized domains. During instruction-tuning, SIMoE automatically identifies multiple specialized experts under a specified sparsity constraint, with each expert representing a structurally sparse subset of the seed LLM’s parameters that correspond to domain-specific knowledge within the data. SIMoE simultaneously learns an input-dependent expert merging strategy via a router network, leveraging rich cross-expert knowledge for superior downstream generalization that surpasses existing baselines. Empirically, SIMoE consistently achieves state-of-the-art performance on common instruction-tuning benchmarks while maintaining an optimal performance-compute trade-off compared to all baselines.
pdf
bib
abs
SimulS2S-LLM: Unlocking Simultaneous Inference of Speech LLMs for Speech-to-Speech Translation
Keqi Deng
|
Wenxi Chen
|
Xie Chen
|
Phil Woodland
Simultaneous speech translation (SST) outputs translations in parallel with streaming speech input, balancing translation quality and latency. While large language models (LLMs) have been extended to handle the speech modality, streaming remains challenging as speech is pre-pended as a prompt for the entire generation process. To unlock LLM streaming capability, this paper proposes SimulS2S-LLM, which trains speech LLMs offline and employs a test-time policy to guide simultaneous inference. SimulS2S-LLM alleviates the mismatch between training and inference by extracting boundary-aware speech prompts that allows it to be better matched with text input data. SimulS2S-LLM achieves simultaneous speech-to-speech translation (Simul-S2ST) by predicting discrete output speech tokens and then synthesising output speech using a pre-trained vocoder. An incremental beam search is designed to expand the search space of speech token prediction without increasing latency. Experiments on the CVSS speech data show that SimulS2S-LLM offers a better translation quality-latency trade-off than existing methods that use the same training data, such as improving ASR-BLEU scores by 3 points at similar latency.
pdf
bib
abs
VoxEval: Benchmarking the Knowledge Understanding Capabilities of End-to-End Spoken Language Models
Wenqian Cui
|
Xiaoqi Jiao
|
Ziqiao Meng
|
Irwin King
With the rising need for speech-based interaction models, end-to-end Spoken Language Models (SLMs) have emerged as a promising solution. While these models require comprehensive world knowledge for meaningful and reliable human interactions, existing question-answering (QA) benchmarks fall short in evaluating SLMs’ knowledge understanding due to their inability to support end-to-end speech evaluation and account for varied input audio conditions. To address these limitations, we present VoxEval, a novel SpeechQA benchmark that assesses SLMs’ knowledge understanding through pure speech interactions. Our benchmark uniquely maintains speech format for both inputs and outputs, evaluates model robustness across diverse input audio conditions, and pioneers the assessment of complex tasks like mathematical reasoning in spoken format. Through systematic evaluation, we demonstrate that current SLMs exhibit poor performance on VoxEval, show sensitivity to varying audio conditions, and possess limited reasoning capabilities, highlighting critical areas for future development. VoxEval dataset is available at: https://github.com/dreamtheater123/VoxEval
pdf
bib
abs
RetroLLM: Empowering Large Language Models to Retrieve Fine-grained Evidence within Generation
Xiaoxi Li
|
Jiajie Jin
|
Yujia Zhou
|
Yongkang Wu
|
Zhonghua Li
|
Ye Qi
|
Zhicheng Dou
Large language models (LLMs) exhibit remarkable generative capabilities but often suffer from hallucinations. Retrieval-augmented generation (RAG) offers an effective solution by incorporating external knowledge, but existing methods still face several limitations: additional deployment costs of separate retrievers, redundant input tokens from retrieved text chunks, and the lack of joint optimization of retrieval and generation. To address these issues, we propose RetroLLM, a unified framework that integrates retrieval and generation into a single, auto-regressive process, enabling LLMs to directly generate fine-grained evidence from the corpus with constrained decoding. Moreover, to mitigate false pruning in the process of constrained evidence generation, we introduce (1) hierarchical FM-Index constraints, which generate corpus-constrained clues to identify a subset of relevant documents before evidence generation, reducing irrelevant decoding space; and (2) a forward-looking constrained decoding strategy, which considers the relevance of future sequences to improve evidence accuracy. Extensive experiments on five open-domain QA datasets demonstrate RetroLLM’s superior performance across both in-domain and out-of-domain tasks. The code is available at https://anonymous.4open.science/r/RetroLLM-D95A.
pdf
bib
abs
The Role of Deductive and Inductive Reasoning in Large Language Models
Chengkun Cai
|
Xu Zhao
|
Haoliang Liu
|
Zhongyu Jiang
|
Tianfang Zhang
|
Zongkai Wu
|
Jenq-Neng Hwang
|
Lei Li
Large Language Models (LLMs) have demonstrated impressive capabilities in reasoning tasks, yet their reliance on static prompt structures and limited adaptability to complex scenarios remains a major challenge. In this paper, we propose the **Deductive and Inductive (DID)** method, a novel framework that enhances LLM reasoning by dynamically integrating both deductive and inductive reasoning approaches. Drawing from cognitive science principles, DID implements a dual-metric complexity evaluation system that combines Littlestone dimension and information entropy to precisely assess task difficulty and guide decomposition strategies. DID enables the model to progressively adapt its reasoning pathways based on problem complexity, mirroring human cognitive processes. We evaluate DID’s effectiveness across multiple benchmarks, including the AIW, MR-GSM8K, and our custom Holiday Puzzle dataset for temporal reasoning. Our results demonstrate great improvements in reasoning quality and solution accuracy - achieving 70.3% accuracy on AIW (compared to 62.2% for Tree of Thought), while maintaining lower computational costs.
pdf
bib
abs
Disentangling the Roles of Representation and Selection in Data Pruning
Yupei Du
|
Yingjin Song
|
Hugh Mee Wong
|
Daniil Ignatev
|
Albert Gatt
|
Dong Nguyen
Data pruning—selecting small but impactful subsets—offers a promising way to efficiently scale NLP model training. However, existing methods often involve many different design choices, which have not been systematically studied. This limits future developments. In this work, we decompose data pruning into two key components: data representation and selection algorithm, and systematically analyze their influence on selected instances. Our theoretical and empirical results highlight the crucial role of representations: better representations, e.g., training gradients, generally lead to better selected instances, regardless of the chosen selection algorithm. Furthermore, different selection algorithms excel in different settings, and none consistently outperform the others. Moreover, the selection algorithms do not always align with their intended objectives: for example, algorithms designed for the same objective can select drastically different instances, highlighting the need for careful evaluation.
pdf
bib
abs
FRACTAL: Fine-Grained Scoring from Aggregate Text Labels
Yukti Makhija
|
Priyanka Agrawal
|
Rishi Saket
|
Aravindan Raghuveer
Fine-Tuning of LLMs using RLHF / RLAIF has been shown as a critical step to improve the performance of LLMs in complex generation tasks. These methods typically use response-level human or model feedback for alignment. Recent works indicate that finer sentence or span-level labels provide more accurate and interpretable feedback for LLM optimization. In this work, we propose FRACTAL, a suite of models to disaggregate response-level labels into sentence-level (pseudo-)labels through Multiple Instance Learning (MIL) and Learning from Label Proportions (LLP) formulations, novel usage of prior information, and maximum likelihood calibration. We perform close to 2000 experiments across 6 datasets and 4 tasks that show that FRACTAL can reach up to 93% of the performance of the fully supervised baseline while requiring only around 10% of the gold labels. Furthermore, in a downstream eval, employing step-level pseudo scores in RLHF for a math reasoning task leads to 5% absolute improvement in performance. Our work is the first to develop response-level feedback to sentence-level scoring techniques leveraging sentence-level prior information, along with comprehensive evaluations on multiple tasks as well as end-to-end finetuning evaluations.
pdf
bib
abs
ACT: Knowledgeable Agents to Design and Perform Complex Tasks
Makoto Nakatsuji
|
Shuhei Tateishi
|
Yasuhiro Fujiwara
|
Ayaka Matsumoto
|
Narichika Nomoto
|
Yoshihide Sato
Large language models enhance collaborative task execution in multi-agent systems. Current studies break complex task into manageable tasks, but agents lack understanding of the overall task and how others approach their tasks, hindering synergy and integration.We propose a method called knowledgeable Agents to design and perform Complex Tasks (ACT), where: (1) Agents independently manage their knowledge and tasks while collaboratively design the complex task into a more comprehensible form. In parallel, each agent also acquires knowledge of others, defined as a structured description of how other agents approach their tasks based on the agent’s own task resolution. (2) Each agent updates its knowledge and refines its task through interactions with others. By referencing structured knowledge, they effectively integrate their tasks to collaboratively solve the complex task.Three evaluations including creative writing and tool utilization, show that ACT accurately outperforms existing methods in solving complex tasks.
pdf
bib
abs
Logical forms complement probability in understanding language model (and human) performance
Yixuan Wang
|
Freda Shi
With the increasing interest in using large language models (LLMs) for planning in natural language, understanding their behaviors becomes an important research question. This work conducts a systematic investigation of LLMs’ ability to perform logical reasoning in natural language. We introduce a controlled dataset of hypothetical and disjunctive syllogisms in propositional and modal logic and use it as the testbed for understanding LLM performance. Our results lead to novel insights in predicting LLM behaviors: in addition to the probability of input, logical forms should be considered as important factors. In addition, we show similarities and discrepancies between the logical reasoning performances of humans and LLMs by collecting and comparing behavioral data from both.
pdf
bib
abs
Length Controlled Generation for Black-box LLMs
Yuxuan Gu
|
Wenjie Wang
|
Xiaocheng Feng
|
Weihong Zhong
|
Kun Zhu
|
Lei Huang
|
Ting Liu
|
Bing Qin
|
Tat-Seng Chua
Large language models (LLMs) have demonstrated impressive instruction following capabilities, while still struggling to accurately manage the length of the generated text, which is a fundamental requirement in many real-world applications. Existing length control methods involve fine-tuning the parameters of LLMs, which is inefficient and suboptimal for practical use. In this paper, we propose a novel iterative sampling framework for text length control, integrating the Metropolis-Hastings algorithm with an importance sampling acceleration strategy. This framework efficiently and reliably regulates LLMs to generate length-constrained text without modifying the underlying parameters, thereby preserving the original capabilities of LLMs. Experimental results demonstrate that our framework achieves almost 100% success rates of length control on Llama3.1 for tasks such as length-controlled abstractive summarization and length-constrained instruction following, with minimal additional computational overhead. This also highlights the significant potential of our method for precise length control across a broader range of applications, without compromising the versatility of LLMs.
pdf
bib
abs
Improving Contextual Faithfulness of Large Language Models via Retrieval Heads-Induced Optimization
Lei Huang
|
Xiaocheng Feng
|
Weitao Ma
|
Yuchun Fan
|
Xiachong Feng
|
Yangfan Ye
|
Weihong Zhong
|
Yuxuan Gu
|
Baoxin Wang
|
Dayong Wu
|
Guoping Hu
|
Bing Qin
Ensuring contextual faithfulness in retrieval-augmented large language models (LLMs) is crucial for building trustworthy information-seeking systems, particularly in long-form question-answering (LFQA) scenarios. In this work, we identify a salient correlation between LFQA faithfulness and retrieval heads, a set of attention heads responsible for retrieving contextual information. Leveraging this insight, we propose RHIO, a framework designed to teach LLMs to explicitly discriminate between faithful and unfaithful generations. RHIO first augments unfaithful samples that simulate realistic model-intrinsic errors by selectively masking retrieval heads. Then, these samples are incorporated into joint training, enabling the model to distinguish unfaithful outputs from faithful ones conditioned on control tokens. Furthermore, these control tokens are leveraged to self-induce contrastive outputs, amplifying their difference through contrastive decoding. Additionally, to facilitate the evaluation of contextual faithfulness, we also introduce GroundBench, a comprehensive benchmark compiled from five existing LFQA datasets. Extensive experimental results on GroundBench demonstrate that RHIO significantly improves faithfulness, even outperforming GPT-4o.
pdf
bib
abs
Global Eye: Breaking the “Fixed Thinking Pattern” during the Instruction Expansion Process
Wenxuan Lu
|
Wei Liu
|
Jian Luan
|
Bin Wang
|
Songhao Jiang
|
Tianning Zang
An extensive high-quality instruction dataset is crucial for the instruction tuning process of Large Language Models (LLMs). Recent instruction expansion methods have demonstrated their capability to improve the quality and quantity of existing datasets, by prompting high-performance LLM to generate multiple new instructions from the original ones. However, existing methods focus on constructing multi-perspective prompts (e.g., increasing complexity or difficulty) to expand instructions, overlooking the “Fixed Thinking Pattern” issue of LLMs. This issue arises when repeatedly using the same set of prompts, causing LLMs to rely on a limited set of certain expressions to expand all instructions, potentially compromising the diversity of the final expanded dataset. This paper theoretically analyzes the causes of the “Fixed Thinking Pattern”, and corroborates this phenomenon through multi-faceted empirical research. Furthermore, we propose a novel method based on dynamic prompt updating: Global Eye. Specifically, after a fixed number of instruction expansions, we analyze the statistical characteristics of newly generated instructions and then update the prompts. Experimental results show that our method enables Llama3-8B and Llama2-13B to surpass the performance of open-source LLMs and GPT3.5 across various metrics. Our code and data are submitted to the Software & Data option.
pdf
bib
abs
On Synthesizing Data for Context Attribution in Question Answering
Gorjan Radevski
|
Kiril Gashteovski
|
Shahbaz Syed
|
Christopher Malon
|
Sebastien Nicolas
|
Chia-Chien Hung
|
Timo Sztyler
|
Verena Heußer
|
Wiem Ben Rim
|
Masafumi Enomoto
|
Kunihiro Takeoka
|
Masafumi Oyamada
|
Goran Glavaš
|
Carolin Lawrence
Question Answering (QA) accounts for a significant portion of LLM usage in the wild”. However, LLMs sometimes produce false or misleading responses, also known as hallucinations”. Therefore, grounding the generated answers in contextually provided information—i.e., providing evidence for the generated text—is paramount for LLMs’ trustworthiness. Providing this information is the task of context attribution. In this paper, we systematically study LLM-based approaches for this task, namely we investigate (i) zero-shot inference, (ii) LLM ensembling, and (iii) fine-tuning of small LMs on synthetic data generated by larger LLMs. Our key contribution is SynQA: a novel generative strategy for synthesizing context attribution data. Given selected context sentences, an LLM generates QA pairs that are supported by these sentences. This leverages LLMs’ natural strengths in text generation while ensuring clear attribution paths in the synthetic training data. We show that the attribution data synthesized via SynQA is highly effective for fine-tuning small LMs for context attribution in different QA tasks and domains. Finally, with a user study, we validate the usefulness of small LMs (fine-tuned on synthetic data from SynQA) in context attribution for QA.
pdf
bib
abs
TST: A Schema-Based Top-Down and Dynamic-Aware Agent of Text-to-Table Tasks
Peiwen Jiang
|
Haitong Jiang
|
Ruhui Ma
|
Yvonne Jie Chen
|
Jinhua Cheng
As a bridge between natural texts and information systems like structured storage, statistical analysis, retrieving, and recommendation, the text-to-table task has received widespread attention recently. Existing researches have gone through a paradigm shift from traditional bottom-up IE (Information Extraction) to top-down LLMs-based question answering with RAG (Retrieval-Augmented Generation). Furthermore, these methods mainly adopt end-to-end models or use multi-stage pipelines to extract text content based on static table structures. However, they neglect to deal with precise inner-document evidence extraction and dynamic information such as multiple entities and events, which can not be defined in static table head format and are very common in natural texts.To address this issue, we propose a two-stage dynamic content extraction agent framework called TST (Text-Schema-Table), which uses type recognition methods to extract context evidences with the conduction of domain schema sequentially. Based on the evidence, firstly we quantify the total instances of each dynamic object and then extract them with ordered numerical prompts. Through extensive comparisons with existing methods across different datasets, our extraction framework exhibits state-of-the-art (SOTA) performance. Our codes are available at
https://github.com/jiangpw41/TST.
pdf
bib
abs
EventRAG: Enhancing LLM Generation with Event Knowledge Graphs
Zairun Yang
|
Yilin Wang
|
Zhengyan Shi
|
Yuan Yao
|
Lei Liang
|
Keyan Ding
|
Emine Yilmaz
|
Huajun Chen
|
Qiang Zhang
Retrieval-augmented generation (RAG) systems often struggle with narrative-rich documents and event-centric reasoning, particularly when synthesizing information across multiple sources. We present EventRAG, a novel framework that enhances text generation through structured event representations. We first construct an Event Knowledge Graph by extracting events and merging semantically equivalent nodes across documents, while expanding under-connected relationships. We then employ an iterative retrieval and inference strategy that explicitly captures temporal dependencies and logical relationships across events. Experiments on UltraDomain and MultiHopRAG benchmarks show EventRAG’s superiority over baseline RAG systems, with substantial gains in generation effectiveness, logical consistency, and multi-hop reasoning accuracy. Our work advances RAG systems by integrating structured event semantics with iterative inference, particularly benefiting scenarios requiring temporal and logical reasoning across documents.
pdf
bib
abs
Analyzing the Rapid Generalization of SFT via the Perspective of Attention Head Activation Patterns
Yang Zhao
|
Li Du
|
Xiao Ding
|
Kai Xiong
|
Ting Liu
|
Bing Qin
LLMs’ performance on complex tasks is still unsatisfactory. A key issue is that presently LLMs learn in a data-driven schema, while the instructions about these complex tasks are both scarce and hard to collect or construct. On the contrary, a prominent phenomenon is that LLMs can learn rather fast on simpler tasks with adequate prior knowledge captured during pretraining stage. Thus, if the prerequisite and mechanism of such rapid generalization could be elucidated, it could enhance the efficiency and effectiveness of the LLM’s ability to learn complex tasks. Thus, in this paper, we employ a gradient-based method, to dissect the process that the SFT process adapts LLMs to downstream tasks via the perspective of attention patterns. We find that: (1) LLMs selectively activate task-specific attention heads during SFT; (2) activation patterns for complex tasks are combinations of basic task patterns; and (3) changes in a few parameters can significantly impact activation patterns after SFT on a small number of samples.Based on these insights, experiments are conducted to actually enhance the efficiency and effectiveness of SFT.
pdf
bib
abs
Can’t See the Forest for the Trees: Benchmarking Multimodal Safety Awareness for Multimodal LLMs
Wenxuan Wang
|
Xiaoyuan Liu
|
Kuiyi Gao
|
Jen-tse Huang
|
Youliang Yuan
|
Pinjia He
|
Shuai Wang
|
Zhaopeng Tu
Multimodal Large Language Models (MLLMs) have expanded the capabilities of traditional language models by enabling interaction through both text and images. However, ensuring the safety of these models remains a significant challenge, particularly in accurately identifying whether multimodal content is safe or unsafe—a capability we term safety awareness. In this paper, we introduce MMSafeAware, the first comprehensive multimodal safety awareness benchmark designed to evaluate MLLMs across 29 safety scenarios with 1,500 carefully curated image-prompt pairs. MMSafeAware includes both unsafe and over-safety subsets to assess models’ abilities to correctly identify unsafe content and avoid over-sensitivity that can hinder helpfulness. Evaluating nine widely used MLLMs using MMSafeAware reveals that current models are not sufficiently safe and often overly sensitive; for example, GPT-4V misclassifies 36.1% of unsafe inputs as safe and 59.9% of benign inputs as unsafe. We further explore three methods to improve safety awareness—prompting-based approaches, visual contrastive decoding, and vision-centric reasoning fine-tuning—but find that none achieve satisfactory performance. Our findings highlight the profound challenges in developing MLLMs with robust safety awareness, underscoring the need for further research in this area. All the code and data will be publicly available to facilitate future research.
pdf
bib
abs
Mis-prompt: Benchmarking Large Language Models for Proactive Error Handling
Jiayi Zeng
|
Yizhe Feng
|
Mengliang He
|
Wenhui Lei
|
Wei Zhang
|
Zeming Liu
|
Xiaoming Shi
|
Aimin Zhou
Large language models (LLMs) have demonstrated significant advancements in error handling. Current error-handling works are performed in a passive manner, with explicit error-handling instructions. However, in real-world scenarios, explicit error-handling instructions are usually unavailable. In this paper, our work identifies this challenge as how to conduct proactive error handling without explicit error handling instructions. To promote further research, this work introduces a new benchmark, termed Mis-prompt, consisting of four evaluation tasks, an error category taxonomy, and a new evaluation dataset. Furthermore, this work analyzes current LLMs’ performance on the benchmark, and the experimental results reveal that current LLMs show poor performance on proactive error handling, and SFT on error handling instances improves LLMs’ proactive error handling capabilities. The dataset will be publicly available.
pdf
bib
abs
TripCraft: A Benchmark for Spatio-Temporally Fine Grained Travel Planning
Soumyabrata Chaudhuri
|
Pranav Purkar
|
Ritwik Raghav
|
Shubhojit Mallick
|
Manish Gupta
|
Abhik Jana
|
Shreya Ghosh
Recent advancements in probing Large Language Models (LLMs) have explored their latent potential as personalized travel planning agents, though this remains a rather nascent field. Existing benchmarks, such as TravelPlanner and TravelPlanner+, rely on semi-synthetic data as well ignoring several key components of travel planning, limiting their real-world applicability. Therefore, we introduce TripCraft, a spatio-temporally coherent travel planning dataset incorporating real-world constraints, including public transit schedules, public events, varied attraction categories, and user personas for enhanced personalization. Our dataset enables more detailed trip itinerary generation (including duration spent at each point of interest based on users’ persona, transit between two points of interest, etc.) while ensuring spatio-temporal consistency. Further, we propose novel evaluation metrics (temporal meal score, attraction score, spatial score, ordering score, and persona score) to assess LLM-generated plans across temporal, spatial, sequential, and personal dimensions, overcoming the limitations of commonsense and hard constraint metrics. Interestingly, our parameter-informed setting significantly enhances meal scheduling, improving performance from 61% to 80% in the 7-day scenario- as quantified by a 19% gain in our temporal meal score. Moreover, TripCraft serves as a high-quality benchmark for advancing personalized LLM-driven travel planning.
pdf
bib
abs
DualGuard: A Parameter Space Transformation Approach for Bidirectional Defense in Split-Based LLM Fine-Tuning
Zihan Liu
|
Yizhen Wang
|
Rui Wang
|
Sai Wu
Integrating split learning with large language model fine-tuning (LLM-FT) enables secure collaboration between a trusted local client and a well-equipped remote server, but it is vulnerable to data reconstruction attacks (DRAs) that exploit transmitted activations and gradients. Current defense methods, like adding noise to activations or gradients, often sacrifice task-specific model performance under strict privacy constraints. This paper introduces DualGuard, a bidirectional defense mechanism against DRAs for split-based LLM-FT. DualGuard proposes a local warm-up parameter space transformation to alter client-side model parameters before training, using multi-task learning to strike a balance between privacy protection and model performance. Additionally, a global fine-tuning parameter space retention strategy prevents the model from reverting to vulnerable states during formal fine-tuning. Experiments show that DualGuard outperforms current defense methods against various DRAs, while maintaining task performance. Our code will be made publicly available.
pdf
bib
abs
Movie101v2: Improved Movie Narration Benchmark
Zihao Yue
|
Yepeng Zhang
|
Ziheng Wang
|
Qin Jin
Automatic movie narration aims to generate video-aligned plot descriptions to assist visually impaired audiences. Unlike standard video captioning, it involves not only describing key visual details but also inferring plots that unfold across multiple movie shots, presenting distinct and complex challenges. To advance this field, we introduce Movie101v2, a large-scale, bilingual dataset with enhanced data quality specifically designed for movie narration. Revisiting the task, we propose breaking down the ultimate goal of automatic movie narration into three progressive stages, offering a clear roadmap with corresponding evaluation metrics. Based on our new benchmark, we baseline a range of large vision-language models and conduct an in-depth analysis of the challenges in movie narration generation. Our findings highlight that achieving applicable movie narration generation is a fascinating goal that requires significant research.
pdf
bib
abs
Can LLMs Evaluate Complex Attribution in QA? Automatic Benchmarking using Knowledge Graphs
Nan Hu
|
Jiaoyan Chen
|
Yike Wu
|
Guilin Qi
|
Hongru Wang
|
Sheng Bi
|
Yongrui Chen
|
Tongtong Wu
|
Jeff Z. Pan
Attributed Question Answering (AQA) has attracted wide attention, but there are still several limitations in evaluating the attributions, including lacking fine-grained attribution categories, relying on manual annotations, and failing to compare attributions with only subtle differences. To bridge these gaps, we introduce Complex Attributed Question Answering (CAQA), a large-scale benchmark containing comprehensive attribution categories, automatically generated using Knowledge Graphs (KGs), and complex attribution scenarios. We have conducted extensive experiments to verify the effectiveness of CAQA, including the benchmarking of 25 automatic evaluators, their comparison with human evaluators, the testing of LLM evaluators fine-tuned by CAQA and so on. These experiments also lead to a series of important findings that can benefit the future research of AQA.
pdf
bib
abs
Value Portrait: Assessing Language Models’ Values through Psychometrically and Ecologically Valid Items
Jongwook Han
|
Dongmin Choi
|
Woojung Song
|
Eun-Ju Lee
|
Yohan Jo
The importance of benchmarks for assessing the values of language models has been pronounced due to the growing need of more authentic, human-aligned responses. However, existing benchmarks rely on human or machine annotations that are vulnerable to value-related biases. Furthermore, the tested scenarios often diverge from real-world contexts in which models are commonly used to generate text and express values. To address these issues, we propose the Value Portrait benchmark, a reliable framework for evaluating LLMs’ value orientations with two key characteristics. First, the benchmark consists of items that capture real-life user-LLM interactions, enhancing the relevance of assessment results to real-world LLM usage. Second, each item is rated by human subjects based on its similarity to their own thoughts, and correlations between these ratings and the subjects’ actual value scores are derived. This psychometrically validated approach ensures that items strongly correlated with specific values serve as reliable items for assessing those values. Through evaluating 44 LLMs with our benchmark, we find that these models prioritize Benevolence, Security, and Self-Direction values while placing less emphasis on Tradition, Power, and Achievement values. Also, our analysis reveals biases in how LLMs perceive various demographic groups, deviating from real human data.
pdf
bib
abs
FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation
Wei Li
|
Xin Zhang
|
Zhongxin Guo
|
Shaoguang Mao
|
Wen Luo
|
Guangyue Peng
|
Yangyu Huang
|
Houfeng Wang
|
Scarlett Li
Implementing new features in repository-level codebases is a crucial application of code generation models. However, current benchmarks lack a dedicated evaluation framework for this capability. To fill this gap, we introduce FEA-Bench, a benchmark designed to assess the ability of large language models (LLMs) to perform incremental development within code repositories. We collect pull requests from 83 GitHub repositories and use rule-based and intent-based filtering to construct task instances focused on new feature development. Each task instance containing code changes is paired with relevant unit test files to ensure that the solution can be verified. The feature implementation requires LLMs to simultaneously possess code completion capabilities for new components and code editing abilities for other relevant parts in the code repository, providing a more comprehensive evaluation method of LLMs’ automated software engineering capabilities.Experimental results show that LLMs perform significantly worse in the FEA-Bench, highlighting considerable challenges in such repository-level incremental code development.
pdf
bib
abs
Do not Abstain! Identify and Solve the Uncertainty
Jingyu Liu
|
JingquanPeng JingquanPeng
|
Xiaopeng Wu
|
Xubin Li
|
Tiezheng Ge
|
Bo Zheng
|
Yong Liu
Despite the widespread application of Large Language Models (LLMs) across various domains, they frequently exhibit overconfidence when encountering uncertain scenarios, yet existing solutions primarily rely on evasive responses (e.g., “I don’t know”) overlooks the opportunity of identifying and addressing the uncertainty to generate more satisfactory responses. To systematically investigate and improve LLMs’ ability of recognizing and addressing the source of uncertainty, we introduce ConfuseBench, a benchmark mainly focus on three types of uncertainty: document scarcity, limited capability, and query ambiguity. Experiments with ConfuseBench reveal that current LLMs struggle to accurately identify the root cause of uncertainty and solve it. They prefer to attribute uncertainty to query ambiguity while overlooking capability limitations, especially for those weaker models. To tackle this challenge, we first generate context-aware inquiries that highlight the confusing aspect of the original query. Then we judge the source of uncertainty based on the uniqueness of the inquiry’s answer. Further we use an on-policy training method, InteractDPO to generate better inquiries. Experimental results demonstrate the efficacy of our approach.
pdf
bib
abs
Decoding by Contrasting Knowledge: Enhancing Large Language Model Confidence on Edited Facts
Baolong Bi
|
Shenghua Liu
|
Lingrui Mei
|
Yiwei Wang
|
Junfeng Fang
|
Pengliang Ji
|
Xueqi Cheng
The knowledge within large language models (LLMs) may become outdated quickly. While in-context editing (ICE) is currently the most effective method for knowledge editing (KE), it is constrained by the black-box modeling of LLMs and thus lacks interpretability. Our work aims to elucidate the superior performance of ICE in KE by analyzing the impacts of in-context new knowledge on token-wise distributions. We observe that despite a significant boost in logits of the new knowledge, the performance of ICE is still hindered by stubborn knowledge. We propose a novel approach termed Decoding by Contrasting Knowledge (DeCK). DeCK derives the distribution of the next token by contrasting the logits obtained from the newly edited knowledge guided by ICE with those from the unedited parametric knowledge. Our experiments demonstrate that DeCK enhances the confidence of LLMs in edited facts. For instance, it improves the performance of LLaMA3-8B-instruct on MQuAKE by up to 219%, demonstrating its capability to strengthen ICE. DeCK can be easily integrated into any ICE method as a decoding component to enhance editing capabilities.
pdf
bib
abs
ImpliHateVid: A Benchmark Dataset and Two-stage Contrastive Learning Framework for Implicit Hate Speech Detection in Videos
Mohammad Zia Ur Rehman
|
Anukriti Bhatnagar
|
Omkar Kabde
|
Shubhi Bansal
|
Dr. Nagendra Kumar
The existing research has primarily focused on text and image-based hate speech detection, video-based approaches remain underexplored. In this work, we introduce a novel dataset, ImpliHateVid, specifically curated for implicit hate speech detection in videos. ImpliHateVid consists of 2,009 videos comprising 509 implicit hate videos, 500 explicit hate videos, and 1,000 non-hate videos, making it one of the first large-scale video datasets dedicated to implicit hate detection. We also propose a novel two-stage contrastive learning framework for hate speech detection in videos. In the first stage, we train modality-specific encoders for audio, text, and image using contrastive loss by concatenating features from the three encoders. In the second stage, we train cross-encoders using contrastive learning to refine multimodal representations. Additionally, we incorporate sentiment, emotion, and caption-based features to enhance implicit hate detection. We evaluate our method on two datasets, ImpliHateVid for implicit hate speech detection and another dataset for general hate speech detection in videos, HateMM dataset, demonstrating the effectiveness of the proposed multimodal contrastive learning for hateful content detection in videos and the significance of our dataset.
pdf
bib
abs
Improving Chain-of-Thought Reasoning via Quasi-Symbolic Abstractions
Leonardo Ranaldi
|
Marco Valentino
|
Andre Freitas
Chain-of-Though (CoT) represents a common strategy for reasoning in Large Language Models (LLMs) by decomposing complex tasks into intermediate inference steps. However, explanations generated via CoT are susceptible to content biases that negatively affect their robustness and faithfulness. To mitigate existing limitations, recent work has proposed using logical formalisms coupled with external symbolic solvers. However, fully symbolic approaches possess the bottleneck of requiring a complete translation from natural language to formal languages, a process that affects efficiency and flexibility. To achieve a trade-off, this paper investigates methods to disentangle content from logical reasoning without a complete formalisation. In particular, we present QuaSAR (for Quasi-Symbolic Abstract Reasoning), a variation of CoT that guides LLMs to operate at a higher level of abstraction via quasi-symbolic explanations. Our framework leverages the capability of LLMs to formalise only relevant variables and predicates, enabling the coexistence of symbolic elements with natural language. We show the impact of QuaSAR for in-context learning and for constructing demonstrations to improve the reasoning capabilities of smaller models. Our experiments show that quasi-symbolic abstractions can improve CoT-based methods by up to 8% accuracy, enhancing robustness and consistency on challenging adversarial variations on both natural language (i.e. MMLU-Redux) and symbolic reasoning tasks (i.e., GSM-Symbolic).
pdf
bib
abs
Information Extraction from Visually Rich Documents using LLM-based Organization of Documents into Independent Textual Segments
Aniket Bhattacharyya
|
Anurag Tripathi
|
Ujjal Das
|
Archan Karmakar
|
Amit Pathak
|
Maneesh Gupta
Information extraction (IE) from Visually Rich Documents (VRDs) containing layout features along with text is a critical and well-studied task. Specialized non-LLM NLP-based solutions typically involve training models using both textual and geometric information to label sequences/tokens as named entities or answers to specific questions. However, these approaches lack reasoning, are not able to infer values not explicitly present in documents, and do not generalize well to new formats. Generative LLMs-based approaches proposed recently are capable of reasoning, but struggle to comprehend clues from document layout especially in previously unseen document formats, and do not show competitive performance in heterogeneous VRD benchmark datasets. In this paper, we propose BLOCKIE, a novel LLM-based approach that organizes VRDs into localized, reusable semantic textual segments called semantic blocks, which are processed independently. Through focused and more generalizable reasoning,our approach outperforms the state-of-the-art on public VRD benchmarks by 1-3% in F1 scores, is resilient to document formats previously not encountered and shows abilities to correctly extract information not explicitly present in documents.
pdf
bib
abs
Enhancing Open-Domain Task-Solving Capability of LLMs via Autonomous Tool Integration from GitHub
Bohan Lyu
|
Xin Cong
|
Heyang Yu
|
Pan Yang
|
Cheng Qian
|
Zihe Wang
|
Yujia Qin
|
Yining Ye
|
Yaxi Lu
|
Chen Qian
|
Zhong Zhang
|
Yukun Yan
|
Yankai Lin
|
Zhiyuan Liu
|
Maosong Sun
Large Language Models (LLMs) excel in traditional natural language processing tasks but struggle with problems that require complex domain-specific calculations or simulations. While equipping LLMs with external tools to build LLM-based agents can enhance their capabilities, existing approaches lack the flexibility to address diverse and ever-evolving user queries in open domains. Currently, there is also no existing dataset that evaluates LLMs on open-domain knowledge that requires tools to solve. To this end, we introduce OpenAct benchmark to evaluate the open-domain task-solving capability, which is built on human expert consultation and repositories in GitHub. It comprises 339 questions spanning 7 diverse domains that need to be solved with domain-specific methods. In our experiments, even state-of-the-art LLMs and LLM-based agents demonstrate unsatisfactory success rates, underscoring the need for a novel approach.Furthermore, we present OpenAgent, a novel LLM-based agent system that can tackle evolving queries in open domains through autonomously integrating specialized tools from GitHub. OpenAgent employs 1) a hierarchical framework where specialized agents handle specific tasks and can assign tasks to inferior agents, 2) a bi-level experience learning mechanism to learn from both humans’ and its own experiences to tackle tool flaws. Experiments demonstrate its superior effectiveness and efficiency, which significantly outperforms baselines. Our data and code are open-source at https://github.com/OpenBMB/OpenAct.
pdf
bib
abs
LLMs Can Simulate Standardized Patients via Agent Coevolution
Zhuoyun Du
|
LujieZheng LujieZheng
|
Renjun Hu
|
Yuyang Xu
|
Xiawei Li
|
Ying Sun
|
Wei Chen
|
Jian Wu
|
Haolei Cai
|
Haochao Ying
Training medical personnel using standardized patients (SPs) remains a complex challenge, requiring extensive domain expertise and role-specific practice. Most research on Large Language Model (LLM)-based simulated patients focuses on improving data retrieval accuracy or adjusting prompts through human feedback. However, this focus has overlooked the critical need for patient agents to learn a standardized presentation pattern that transforms data into human-like patient responses through unsupervised simulations. To address this gap, we propose EvoPatient, a novel simulated patient framework in which a patient agent and doctor agents simulate the diagnostic process through multi-turn dialogues, simultaneously gathering experience to improve the quality of both questions and answers, ultimately enabling human doctor training. Extensive experiments on various cases demonstrate that, by providing only overall SP requirements, our framework improves over existing reasoning methods by more than 10% in requirement alignment and better human preference, while achieving an optimal balance of resource consumption after evolving over 200 cases for 10 hours, with excellent generalizability. Our system will be available at https://github.com/ZJUMAI/EvoPatient
pdf
bib
abs
Donate or Create? Comparing Data Collection Strategies for Emotion-labeled Multimodal Social Media Posts
Christopher Bagdon
|
Aidan Combs
|
Carina Silberer
|
Roman Klinger
Accurate modeling of subjective phenomena such as emotion expression requires data annotated with authors’ intentions. Commonly such data is collected by asking study participants to donate and label genuine content produced in the real world, or create content fitting particu- lar labels during the study. Asking participants to create content is often simpler to implement and presents fewer risks to participant privacy than data donation. However, it is unclear if and how study-created content may differ from genuine content, and how differences may impact models. We collect study-created and genuine multimodal social media posts labeled for emotion and compare them on several dimen- sions, including model performance. We find that compared to genuine posts, study-created posts are longer, rely more on their text and less on their images for emotion expression, and focus more on emotion-prototypical events. The samples of participants willing to donate versus create posts are demographically different. Study-created data is valuable to train models that generalize well to genuine data, but realistic effectiveness estimates require genuine data.
pdf
bib
abs
Which Demographics do LLMs Default to During Annotation?
Johannes Schäfer
|
Aidan Combs
|
Christopher Bagdon
|
Jiahui Li
|
Nadine Probol
|
Lynn Greschner
|
Sean Papay
|
Yarik Menchaca Resendiz
|
Aswathy Velutharambath
|
Amelie Wuehrl
|
Sabine Weber
|
Roman Klinger
Demographics and cultural background of annotators influence the labels they assign in text annotation – for instance, an elderly woman might find it offensive to read a message addressed to a “bro”, but a male teenager might find it appropriate. It is therefore important to acknowledge label variations to not under-represent members of a society. Two research directions developed out of this observation in the context of using large language models (LLM) for data annotations, namely (1) studying biases and inherent knowledge of LLMs and (2) injecting diversity in the output by manipulating the prompt with demographic information. We combine these two strands of research and ask the question to which demographics an LLM resorts to when no demographics is given. To answer this question, we evaluate which attributes of human annotators LLMs inherently mimic. Furthermore, we compare non-demographic conditioned prompts and placebo-conditioned prompts (e.g., “you are an annotator who lives in house number 5”) to demographics-conditioned prompts (“You are a 45 year old man and an expert on politeness annotation. How do you rate instance”). We study these questions for politeness and offensiveness annotations on the POPQUORN data set, a corpus created in a controlled manner to investigate human label variations based on demographics which has not been used for LLM-based analyses so far. We observe notable influences related to gender, race, and age in demographic prompting, which contrasts with previous studies that found no such effects.
pdf
bib
abs
Can You Really Trust Code Copilot? Evaluating Large Language Models from a Code Security Perspective
Yutao Mou
|
Xiao Deng
|
Yuxiao Luo
|
Shikun Zhang
|
Wei Ye
Code security and usability are both essential for various coding assistant applications driven by large language models (LLMs). Current code security benchmarks focus solely on single evaluation task and paradigm, such as code completion and generation, lacking comprehensive assessment across dimensions like secure code generation, vulnerability repair and discrimination. In this paper, we first propose CoV-Eval, a multi-task benchmark covering various tasks such as code completion, vulnerability repair, vulnerability detection and classification, for comprehensive evaluation of LLM code security. Besides, we developed VC-Judge, an improved judgment model that aligns closely with human experts and can review LLM-generated programs for vulnerabilities in a more efficient and reliable way. We conduct a comprehensive evaluation of 20 proprietary and open-source LLMs. Overall, while most LLMs identify vulnerable codes well, they still tend to generate insecure codes and struggle with recognizing specific vulnerability types and performing repairs. Extensive experiments and qualitative analyses reveal key challenges and optimization directions, offering insights for future research in LLM code security.
pdf
bib
abs
From Sub-Ability Diagnosis to Human-Aligned Generation: Bridging the Gap for Text Length Control via MarkerGen
Peiwen Yuan
|
Chuyi Tan
|
Shaoxiong Feng
|
Yiwei Li
|
Xinglin Wang
|
Yueqi Zhang
|
Jiayi Shi
|
Boyuan Pan
|
Yao Hu
|
Kan Li
Despite the rapid progress of large language models (LLMs), their length-controllable text generation (LCTG) ability remains below expectations, posing a major limitation for practical applications. Existing methods mainly focus on end-to-end training to reinforce adherence to length constraints. However, the lack of decomposition and targeted enhancement of LCTG sub-abilities restricts further progress. To bridge this gap, we conduct a bottom-up decomposition of LCTG sub-abilities with human patterns as reference and perform a detailed error analysis. On this basis, we propose MarkerGen, a simple-yet-effective plug-and-play approach that: (1) mitigates LLM fundamental deficiencies via external tool integration; (2) conducts explicit length modeling with dynamically inserted markers; (3) employs a three-stage generation scheme to better align length constraints while maintaining content quality. Comprehensive experiments demonstrate that MarkerGen significantly improves LCTG across various settings, exhibiting outstanding effectiveness and generalizability.
pdf
bib
abs
AGD: Adversarial Game Defense Against Jailbreak Attacks in Large Language Models
Shilong Pan
|
Zhiliang Tian
|
Zhen Huang
|
Wanlong Yu
|
Zhihua Wen
|
Xinwang Liu
|
Kai Lu
|
Minlie Huang
|
Dongsheng Li
LLMs demonstrate remarkable utility but remain vulnerable to jailbreak attacks that aim to elicit harmful responses. Existing defenses, including post-training alignment and prompt engineering, rely on training on safety-annotated datasets and safe prompt templates, struggling with adaptability to out-of-distribution (OOD) attacks. Steering internal representations of LLMs provides real-time adjustments to defend against OOD attacks. However, it struggles with maintaining model utility, since modifying the representation disrupts the forward pass of inference. It barely considers the competitive objectives of helpfulness and harmlessness in LLMs. We argue that adversarial game-based approaches promise a solution for conflicts between the two objectives. In this paper, we propose **A**dversarial **G**ame **D**efense (AGD), an adversarial game-based defense method that dynamically adjusts LLMs’ internal representations to achieve a balanced trade-off between helpfulness and harmlessness. AGD first proposes an interquartile range (IQR) method to detect abnormal attention weights and correct the abnormal weights via adversarial training. AGD adopts a bi-level optimization to play a two-player variable-sum game to approach Nash Equilibrium (NE), where the two players adversarially refine head activations for helpfulness and harmlessness respectively. Furthermore, AGD applies an expert model to next-token sampling to generate safer responses. Experiments show that AGD significantly improves LLMs’ safety over all baselines.
pdf
bib
abs
SCOP: Evaluating the Comprehension Process of Large Language Models from a Cognitive View
Yongjie Xiao
|
Hongru Liang
|
Peixin Qin
|
Yao Zhang
|
Wenqiang Lei
Despite the great potential of large language models (LLMs) in machine comprehension, it is still disturbing to fully count on them in real-world scenarios. This is probably because there is no rational explanation for whether the comprehension process of LLMs is aligned with that of experts. In this paper, we propose SCOP to carefully examine how LLMs perform during the comprehension process from a cognitive view. Specifically, it is equipped with a systematical definition of five requisite skills during the comprehension process, a strict framework to construct testing data for these skills, and a detailed analysis of advanced open-sourced and closed-sourced LLMs using the testing data. With SCOP, we find that it is still challenging for LLMs to perform an expert-level comprehension process. Even so, we notice that LLMs share some similarities with experts, e.g., performing better at comprehending local information than global information. Further analysis reveals that LLMs can be somewhat unreliable — they might reach correct answers through flawed comprehension processes. Based on SCOP, we suggest that one direction for improving LLMs is to focus more on the comprehension process, ensuring all comprehension skills are thoroughly developed during training.
pdf
bib
abs
Table-Critic: A Multi-Agent Framework for Collaborative Criticism and Refinement in Table Reasoning
Peiying Yu
|
Guoxin Chen
|
Jingjing Wang
Despite the remarkable capabilities of large language models (LLMs) in various reasoning tasks, they still struggle with table reasoning tasks, particularly in maintaining consistency throughout multi-step reasoning processes. While existing approaches have explored various decomposition strategies, they often lack effective mechanisms to identify and correct errors in intermediate reasoning steps, leading to cascading error propagation. To address these issues, we propose Table-Critic, a novel multi-agent framework that facilitates collaborative criticism and iterative refinement of the reasoning process until convergence to correct solutions. Our framework consists of four specialized agents: a Judge for error identification, a Critic for comprehensive critiques, a Refiner for process improvement, and a Curator for pattern distillation. To effectively deal with diverse and unpredictable error types, we introduce a self-evolving template tree that systematically accumulates critique knowledge through experience-driven learning and guides future reflections. Extensive experiments have demonstrated that Table-Critic achieves substantial improvements over existing methods, achieving superior accuracy and error correction rates while maintaining computational efficiency and lower solution degradation rate.
pdf
bib
abs
An Expanded Massive Multilingual Dataset for High-Performance Language Technologies (HPLT)
Laurie Burchell
|
Ona De Gibert Bonet
|
Nikolay Arefyev
|
Mikko Aulamo
|
Marta Bañón
|
Pinzhen Chen
|
Mariia Fedorova
|
Liane Guillou
|
Barry Haddow
|
Jan Hajič
|
Jindřich Helcl
|
Erik Henriksson
|
Mateusz Klimaszewski
|
Ville Komulainen
|
Andrey Kutuzov
|
Joona Kytöniemi
|
Veronika Laippala
|
Petter Mæhlum
|
Bhavitvya Malik
|
Farrokh Mehryary
|
Vladislav Mikhailov
|
Nikita Moghe
|
Amanda Myntti
|
Dayyán O’Brien
|
Stephan Oepen
|
Proyag Pal
|
Jousia Piha
|
Sampo Pyysalo
|
Gema Ramírez-Sánchez
|
David Samuel
|
Pavel Stepachev
|
Jörg Tiedemann
|
Dušan Variš
|
Tereza Vojtěchová
|
Jaume Zaragoza-Bernabeu
Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual monolingual and parallel corpora, extending prior work of the HPLT project. The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. We document the entire data pipeline and release the code to reproduce it. We provide extensive analysis of the quality and characteristics of our data. Finally, we evaluate the performance of language models and machine translation systems trained on HPLT v2, demonstrating its value.
pdf
bib
abs
Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation
Yue Yang
|
Ajay Patel
|
Matt Deitke
|
Tanmay Gupta
|
Luca Weihs
|
Andrew Head
|
Mark Yatskar
|
Chris Callison-Burch
|
Ranjay Krishna
|
Aniruddha Kembhavi
|
Christopher Clark
Reasoning about images with rich text, such as charts and documents, is a critical application of vision-language models (VLMs). However, VLMs often struggle in these domains due to the scarcity of diverse text-rich vision-language data. To address this challenge, we present CoSyn, a framework that leverages the coding capabilities of text-only large language models (LLMs) to automatically create synthetic text-rich multimodal data. Given input text describing a target domain (e.g., “nutrition fact labels”), CoSyn prompts an LLM to generate code (Python, HTML, LaTeX, etc.) for rendering synthetic images. With the underlying code as textual representations of the synthetic images, CoSyn can generate high-quality instruction-tuning data, again relying on a text-only LLM. Using CoSyn, we constructed a dataset comprising 400K images and 2.7M rows of vision-language instruction-tuning data. Comprehensive experiments on seven benchmarks demonstrate that models trained on our synthetic data achieve state-of-the-art performance among competitive open-source models, including Llama 3.2, and surpass proprietary models such as GPT-4V and Gemini 1.5 Flash. Furthermore, CoSyn can produce synthetic pointing data, enabling VLMs to ground information within input images, showcasing its potential for developing multimodal agents capable of acting in real-world environments.
pdf
bib
abs
Hierarchical Attention Generates Better Proofs
Jianlong Chen
|
Chao Li
|
Yang Yuan
|
Andrew C Yao
Large language models (LLMs) have shown promise in formal theorem proving, but their token-level processing often fails to capture the inherent hierarchical nature of mathematical proofs. We introduce Hierarchical Attention, a regularization method that aligns LLMs’ attention mechanisms with mathematical reasoning structures. Our approach establishes a five-level hierarchy from foundational elements to high-level concepts, ensuring structured information flow in proof generation. Experiments demonstrate that our method improves proof success rates by 2.05% on miniF2F and 1.69% on ProofNet while reducing proof complexity by 23.81% and 16.50% respectively. The code and models will be available.
pdf
bib
abs
Agent-RewardBench: Towards a Unified Benchmark for Reward Modeling across Perception, Planning, and Safety in Real-World Multimodal Agents
Tianyi Men
|
Zhuoran Jin
|
Pengfei Cao
|
Yubo Chen
|
Kang Liu
|
Jun Zhao
As Multimodal Large Language Models (MLLMs) advance, multimodal agents show promise in real-world tasks like web navigation and embodied intelligence. However, due to limitations in a lack of external feedback, these agents struggle with self-correction and generalization. A promising approach is to use reward models as external feedback, but there is no clear on how to select reward models for agents. Thus, there is an urgent need to build a reward bench targeted at agents. To address these challenges, we propose Agent-RewardBench, a benchmark designed to evaluate reward modeling ability in MLLMs. The benchmark is characterized by three key features: (1) Multiple dimensions and real-world agent scenarios evaluation. It covers perception, planning, and safety with 7 scenarios; (2) Step-level reward evaluation. It allows for the assessment of agent capabilities at the individual steps of a task, providing a more granular view of performance during the planning process; and (3) Appropriately difficulty and high-quality. We carefully sample from 10 diverse models, difficulty control to maintain task challenges, and manual verification to ensure the integrity of the data. Experiments demonstrate that even state-of-the-art multimodal models show limited performance, highlighting the need for specialized training in agent reward modeling. Code is available at github.
pdf
bib
abs
It’s Not Bragging If You Can Back It Up: Can LLMs Understand Braggings?
Jingjie Zeng
|
Huayang Li
|
Liang Yang
|
Yuanyuan Sun
|
Hongfei Lin
Bragging, as a pervasive social-linguistic phenomenon, reflects complex human interaction patterns. However, the understanding and generation of appropriate bragging behavior in large language models (LLMs) remains underexplored. In this paper, we propose a comprehensive study that combines analytical and controllable approaches to examine bragging in LLMs. We design three tasks, bragging recognition, bragging explanation, and bragging generation, along with novel evaluation metrics to assess the models’ ability to identify bragging intent, social appropriateness, and account for context sensitivity. Our analysis reveals the challenges of bragging in the social context, such as recognizing bragging and responding appropriately with bragging in conversation. This work provides new insights into how LLMs process bragging and highlights the need for more research on generating contextually appropriate behavior in LLMs.
pdf
bib
abs
A Troublemaker with Contagious Jailbreak Makes Chaos in Honest Towns
Tianyi Men
|
Pengfei Cao
|
Zhuoran Jin
|
Yubo Chen
|
Kang Liu
|
Jun Zhao
With the development of large language models, they are widely used as agents in various fields. A key component of agents is memory, which stores vital information but is susceptible to jailbreak attacks. Existing research mainly focuses on single-agent attacks and shared memory attacks. However, real-world scenarios often involve independent memory. In this paper, we propose the Troublemaker Makes Chaos in Honest Town (TMCHT) task, a large-scale, multi-agent, multi-topology text-based attack evaluation framework. TMCHT involves one attacker agent attempting to mislead an entire society of agents. We identify two major challenges in multi-agent attacks: (1) Non-complete graph structure, (2) Large-scale systems. We attribute these challenges to a phenomenon we term toxicity disappearing. To address these issues, we propose an Adversarial Replication Contagious Jailbreak (ARCJ) method, which optimizes the retrieval suffix to make poisoned samples more easily retrieved and optimizes the replication suffix to make poisoned samples have contagious ability. We demonstrate the superiority of our approach in TMCHT, with 23.51%, 18.95%, and 52.93% improvements in line, star topologies, and 100-agent settings. It reveals potential contagion risks in widely used multi-agent architectures.
pdf
bib
abs
Meta-Learning Neural Mechanisms rather than Bayesian Priors
Michael Eric Goodale
|
Salvador Mascarenhas
|
Yair Lakretz
Children acquire language despite being exposed to several orders of magnitude less data than large language models require. Meta-learning has been proposed as a way to integrate human-like learning biases into neural-network architectures, combining both the structured generalizations of symbolic models with the scalability of neural-network models. But what does meta-learning exactly imbue the model with? We investigate the meta-learning of formal languages and find that, contrary to previous claims, meta-trained models are not learning simplicity-based priors when meta-trained on datasets organised around simplicity. Rather, we find evidence that meta-training imprints neural mechanisms (such as counters) into the model, which function like cognitive primitives for the network on downstream tasks. Most surprisingly, we find that meta-training on a *single* formal language can provide as much improvement to a model as meta-training on 5000 different formal languages, provided that the formal language incentivizes the learning of useful neural mechanisms. Taken together, our findings provide practical implications for efficient meta-learning paradigms and new theoretical insights into linking symbolic theories and neural mechanisms.
pdf
bib
abs
Shifting from Ranking to Set Selection for Retrieval Augmented Generation
Dahyun Lee
|
Yongrae Jo
|
Haeju Park
|
Moontae Lee
Retrieval in Retrieval-Augmented Generation (RAG) must ensure that retrieved passages are not only individually relevant but also collectively form a comprehensive set.Existing approaches primarily rerank top-k passages based on their individual relevance, often failing to meet the information needs of complex queries in multi-hop question answering.In this work, we propose a set-wise passage selection approach and introduce SetR, which explicitly identifies the information requirements of a query through Chain-of-Thought reasoning and selects an optimal set of passages that collectively satisfy those requirements.Experiments on multi-hop RAG benchmarks show that SetR outperforms both proprietary LLM-based rerankers and open-source baselines in terms of answer correctness and retrieval quality, providing an effective and efficient alternative to traditional rerankers in RAG systems.The code is available at https://github.com/LGAI-Research/SetR
pdf
bib
abs
Understanding Large Language Model Vulnerabilities to Social Bias Attacks
Jiaxu Zhao
|
Meng Fang
|
Fanghua Ye
|
Ke Xu
|
Qin Zhang
|
Joey Tianyi Zhou
|
Mykola Pechenizkiy
Large Language Models (LLMs) have become foundational in human-computer interaction, demonstrating remarkable linguistic capabilities across various tasks. However, there is a growing concern about their potential to perpetuate social biases present in their training data. In this paper, we comprehensively investigate the vulnerabilities of contemporary LLMs to various social bias attacks, including prefix injection, refusal suppression, and learned attack prompts. We evaluate popular models such as LLaMA-2, GPT-3.5, and GPT-4 across gender, racial, and religious bias types. Our findings reveal that models are generally more susceptible to gender bias attacks compared to racial or religious biases. We also explore novel aspects such as cross-bias and multiple-bias attacks, finding varying degrees of transferability across bias types. Additionally, our results show that larger models and pretrained base models often exhibit higher susceptibility to bias attacks. These insights contribute to the development of more inclusive and ethically responsible LLMs, emphasizing the importance of understanding and mitigating potential bias vulnerabilities. We offer recommendations for model developers and users to enhance the robustness of LLMs against social bias attacks.
pdf
bib
abs
ChatSOP: An SOP-Guided MCTS Planning Framework for Controllable LLM Dialogue Agents
Zhigen Li
|
Jianxiang Peng
|
Yanmeng Wang
|
Yong Cao
|
Tianhao Shen
|
Minghui Zhang
|
Linxi Su
|
Shang Wu
|
Yihang Wu
|
YuQian Wang
|
Ye Wang
|
Wei Hu
|
Jianfeng Li
|
Shaojun Wang
|
Jing Xiao
|
Deyi Xiong
Dialogue agents powered by Large Language Models (LLMs) show superior performance in various tasks. Despite the better user understanding and human-like responses, their **lack of controllability** remains a key challenge, often leading to unfocused conversations or task failure. To address this, we introduce Standard Operating Procedure (SOP) to regulate dialogue flow. Specifically, we propose **ChatSOP**, a novel SOP-guided Monte Carlo Tree Search (MCTS) planning framework designed to enhance the controllability of LLM-driven dialogue agents. To enable this, we curate a dataset comprising SOP-annotated multi-scenario dialogues, generated using a semi-automated role-playing system with GPT-4o and validated through strict manual quality control. Additionally, we propose a novel method that integrates Chain of Thought reasoning with supervised fine-tuning for SOP prediction and utilizes SOP-guided Monte Carlo Tree Search for optimal action planning during dialogues. Experimental results demonstrate the effectiveness of our method, such as achieving a 27.95% improvement in action accuracy compared to baseline models based on GPT-3.5 and also showing notable gains for open-source models. Dataset and codes are publicly available.
pdf
bib
abs
Pixel-Level Reasoning Segmentation via Multi-turn Conversations
Dexian Cai
|
Xiaocui Yang
|
YongKang Liu
|
Daling Wang
|
Shi Feng
|
Yifei Zhang
|
Soujanya Poria
Existing visual perception systems focus on region-level segmentation in single-turn dialogues, relying on complex and explicit query instructions. Such systems cannot reason at the pixel level and comprehend dynamic user intent that changes over interaction. Our work tackles this issue by introducing a novel task, Pixel-level Reasoning Segmentation (Pixel-level RS) based on multi-turn conversations, tracking evolving user intent via multi-turn interactions for fine-grained segmentation. To establish a benchmark for this novel task, we build a Pixel-level ReasonIng Segmentation Dataset Based on Multi-Turn Conversations (PRIST), comprising 24k utterances from 8.3k multi-turn conversational scenarios with segmentation targets. Building on PRIST, we further propose MIRAS, a Multi-turn Interactive ReAsoning Segmentation framework, integrates pixel-level segmentation with robust multi-turn conversation understanding, generating pixel-grounded explanations aligned with user intent. The PRIST dataset and MIRSA framework fill the gap in pixel-level reasoning segmentation. Experimental results on the PRIST dataset demonstrate that our method outperforms current segmentation-specific baselines in terms of segmentation and LLM-based reasoning metrics. The code and data are available at: https://anonymous.4open.science/r/PixelRS/.
pdf
bib
abs
Fixing Distribution Shifts of LLM Self-Critique via On-Policy Self-Play Training
Rong Bao
|
Donglei Yu
|
Kai Fan
|
Minpeng Liao
Self-critique mechanisms significantly improve the performance of language models in complex reasoning tasks by giving them the ability to correct errors, conduct induction and deduction, and switch thinking insights. However, synthetic data methods often require human-introduced errors or sampling of the model’s reasoning results from the previous moment, and the current output distribution of the model cannot be obtained, makes the data for critique and reasoning face the problem of distribution shifts. In this work, we propose an on-policy reinforcement learning framework to synchronize the reasoning and critique capabilities of language models. To alleviate reward hacking caused by outcome-based supervision, we design a deliberate reward framework for different purposes. The reward framework not only supervises the model reasoning process based on the results, but also uses Monte Carlo sampling to give appropriate rewards to the critique content according to the success rate of the model’s correction after critique. In addition, we introduce a rule-based reward function to impose penalties on the model when it generates hallucinatory critiques. When our approach is applied to the DeepSeek-Math-7B-Base and Qwen2.5-7B-Base models, model performance improves 5.40 and 3.66 points, respectively, compared to the best baseline approach. This validates the significant advantages of our method in improving model’s reasoning and self-critique capability. Code will be made available at https://github.com/rbao2018/SCOP
pdf
bib
abs
Inferring Functionality of Attention Heads from their Parameters
Amit Elhelo
|
Mor Geva
Attention heads are one of the building blocks of large language models (LLMs). Prior work on investigating their operation mostly focused on analyzing their behavior during inference for specific circuits or tasks. In this work, we seek a comprehensive mapping of the operations they implement in a model. We propose MAPS (Mapping Attention head ParameterS), an efficient framework that infers the functionality of attention heads from their parameters, without any model training or inference. We showcase the utility of MAPS for answering two types of questions: (a) given a predefined operation, mapping how strongly heads across the model implement it, and (b) given an attention head, inferring its salient functionality. Evaluating MAPS on 20 operations across 6 popular LLMs shows its estimations correlate with the head’s outputs during inference and are causally linked to the model’s predictions. Moreover, its mappings reveal attention heads of certain operations that were overlooked in previous studies, and valuable insights on function universality and architecture biases in LLMs. Next, we present an automatic pipeline and analysis that leverage MAPS to characterize the salient operations of a given head. Our pipeline produces plausible operation descriptions for most heads, as assessed by human judgment, while revealing diverse operations.
pdf
bib
abs
Faithful and Robust LLM-Driven Theorem Proving for NLI Explanations
Xin Quan
|
Marco Valentino
|
Louise A. Dennis
|
Andre Freitas
Natural language explanations play a fundamental role in Natural Language Inference (NLI) by revealing how premises logically entail hypotheses. Recent work has shown that the interaction of large language models (LLMs) with theorem provers (TPs) can help verify and improve the validity of NLI explanations. However, TPs require translating natural language into machine-verifiable formal representations, a process that introduces the risk of semantic information loss and unfaithful interpretation, an issue compounded by LLMs’ challenges in capturing critical logical structures with sufficient precision. Moreover, LLMs are still limited in their capacity for rigorous and robust proof construction within formal verification frameworks. To mitigate issues related to faithfulness and robustness, this paper investigates strategies to (1) alleviate semantic loss during autoformalisation, (2) efficiently identify and correct syntactic errors in logical representations, (3) explicitly use logical expressions to guide LLMs in generating structured proof sketches, and (4) increase LLMs’ capacity of interpreting TP’s feedback for iterative refinement. Our empirical results on e-SNLI, QASC and WorldTree using different LLMs demonstrate that the proposed strategies yield significant improvements in autoformalisation (+18.46%, +34.2%, +39.77%) and explanation refinement (+29.5%, +51.5%, +41.25%) over the state-of-the-art model. Moreover, we show that specific interventions on the hybrid LLM-TP architecture can substantially improve efficiency, drastically reducing the number of iterations required for successful verification.
pdf
bib
abs
Revealing the Deceptiveness of Knowledge Editing: A Mechanistic Analysis of Superficial Editing
Jiakuan Xie
|
Pengfei Cao
|
Yubo Chen
|
Kang Liu
|
Jun Zhao
Knowledge editing, which aims to update the knowledge encoded in language models, can be deceptive. Despite the fact that many existing knowledge editing algorithms achieve near-perfect performance on conventional metrics, the models edited by them are still prone to generating original knowledge. This paper introduces the concept of "**superficial editing**” to describe this phenomenon. Our comprehensive evaluation reveals that this issue presents a significant challenge to existing algorithms. Through systematic investigation, we identify and validate two key factors contributing to this issue: (1) the residual stream at the last subject position in earlier layers and (2) specific attention modules in later layers. Notably, certain attention heads in later layers, along with specific left singular vectors in their output matrices, encapsulate the original knowledge and exhibit a causal relationship with superficial editing. Furthermore, we extend our analysis to the task of superficial unlearning, where we observe consistent patterns in the behavior of specific attention heads and their corresponding left singular vectors, thereby demonstrating the robustness and broader applicability of our methodology and conclusions. Our code is available at https://github.com/jiakuan929/superficial-editing.
pdf
bib
abs
Masking in Multi-hop QA: An Analysis of How Language Models Perform with Context Permutation
Wenyu Huang
|
Pavlos Vougiouklis
|
Mirella Lapata
|
Jeff Z. Pan
Multi-hop Question Answering (MHQA) adds layers of complexity to question answering, making it more challenging. When Language Models (LMs) are prompted with multiple search results, they are tasked not only with retrieving relevant information but also employing multi-hop reasoning across the information sources. Although LMs perform well on traditional question-answering tasks, the causal mask can hinder their capacity to reason across complex contexts. In this paper, we explore how LMs respond to multi-hop questions by permuting search results (retrieved documents) under various configurations. Our study reveals interesting findings as follows: 1) Encoder-decoder models, such as the ones in the Flan-T5 family, generally outperform causal decoder-only LMs in MHQA tasks, despite being significantly smaller in size; 2) altering the order of gold documents reveals distinct trends in both Flan T5 models and fine-tuned decoder-only models, with optimal performance observed when the document order aligns with the reasoning chain order; 3) enhancing causal decoder-only models with bi-directional attention by modifying the causal mask can effectively boost their end performance. In addition to the above, we conduct a thorough investigation of the distribution of LM attention weights in the context of MHQA. Our experiments reveal that attention weights tend to peak at higher values when the resulting answer is correct. We leverage this finding to heuristically improve LMs’ performance on this task. Our code is publicly available at https://github.com/hwy9855/MultiHopQA-Reasoning.
pdf
bib
abs
From Human Reading to NLM Understanding: Evaluating the Role of Eye-Tracking Data in Encoder-Based Models
Luca Dini
|
Lucia Domenichelli
|
Dominique Brunato
|
Felice Dell’Orletta
Cognitive signals, particularly eye-tracking data, offer valuable insights into human language processing. Leveraging eye-gaze data from the Ghent Eye-Tracking Corpus, we conducted a series of experiments to examine how integrating knowledge of human reading behavior impacts Neural Language Models (NLMs) across multiple dimensions: task performance, attention mechanisms, and the geometry of their embedding space. We explored several fine-tuning methodologies to inject eye-tracking features into the models. Our results reveal that incorporating these features does not degrade downstream task performance, enhances alignment between model attention and human attention patterns, and compresses the geometry of the embedding space.
pdf
bib
abs
Optimizing Question Semantic Space for Dynamic Retrieval-Augmented Multi-hop Question Answering
Linhao Ye
|
Lang Yu
|
Zhikai Lei
|
Qin Chen
|
Jie Zhou
|
Liang He
Retrieval-augmented generation (RAG) is usually integrated into large language models (LLMs) to mitigate hallucinations and knowledge obsolescence. Whereas, conventional one-step retrieve-and-read methods are insufficient for multi-hop question answering, facing challenges of retrieval semantic mismatching and the high cost in handling interdependent subquestions. In this paper, we propose Optimizing Question Semantic Space for Dynamic Retrieval-Augmented Multi-hop Question Answering (Q-DREAM). Q-DREAM consists of three key modules: (1) the Question Decomposition Module (QDM), which decomposes multi-hop questions into fine-grained subquestions; (2) the Subquestion Dependency Optimizer Module (SDOM), which models the interdependent relations of subquestions for better understanding; and (3) the Dynamic Passage Retrieval Module (DPRM), which aligns subquestions with relevant passages by optimizing the semantic embeddings.Experimental results across various benchmarks demonstrate that Q-DREAM significantly outperforms existing RAG methods, achieving state-of-the-art performance in both in-domain and out-of-domain settings. Notably, Q-DREAM also improves retrieval efficiency while maintaining high accuracy compared with recent baselines.
pdf
bib
abs
Insight Over Sight: Exploring the Vision-Knowledge Conflicts in Multimodal LLMs
Xiaoyuan Liu
|
Wenxuan Wang
|
Youliang Yuan
|
Jen-tse Huang
|
Qiuzhi Liu
|
Pinjia He
|
Zhaopeng Tu
This paper explores the problem of commonsense level vision-knowledge conflict in Multimodal Large Language Models (MLLMs), where visual information contradicts model’s internal commonsense knowledge. To study this issue, we introduce an automated framework, augmented with human-in-the-loop quality control, to generate inputs designed to simulate and evaluate these conflicts in MLLMs. Using this framework, we have crafted a diagnostic benchmark consisting of 374 original images and 1,122 high-quality question-answer (QA) pairs. The benchmark covers two aspects of conflict and three question types, providing a thorough assessment tool. We apply this benchmark to assess the conflict-resolution capabilities of nine representative MLLMs from various model families. Our results indicate an evident over-reliance on parametric knowledge for approximately 20% of all queries, especially among Yes-No and action-related problems. Based on these findings, we evaluate the effectiveness of existing approaches to mitigating the conflicts and compare them to our “Focus-on-Vision” prompting strategy. Despite some improvement, the vision-knowledge conflict remains unresolved and can be further scaled through our data construction framework. Our proposed framework, benchmark, and analysis contribute to the understanding and mitigation of vision-knowledge conflicts in MLLMs.
pdf
bib
abs
SceneGenAgent: Precise Industrial Scene Generation with Coding Agent
Xiao Xia
|
Dan Zhang
|
Zibo Liao
|
Zhenyu Hou
|
Tianrui Sun
|
Jing Li
|
Ling Fu
|
Yuxiao Dong
The modeling of industrial scenes is essential for simulations in industrial manufacturing. While large language models (LLMs) have shown significant progress in generating general 3D scenes from textual descriptions, generating industrial scenes with LLMs poses a unique challenge due to their demand for precise measurements and positioning, requiring complex planning over spatial arrangement. To address this challenge, we introduce SceneGenAgent, an LLM-based agent for generating industrial scenes through C# code. SceneGenAgent ensures precise layout planning through a structured and calculable format, layout verification, and iterative refinement to meet the quantitative requirements of industrial scenarios. Experiment results demonstrate that LLMs powered by SceneGenAgent exceed their original performance, reaching up to 81.0% success rate in real-world industrial scene generation tasks and effectively meeting most scene generation requirements. To further enhance accessibility, we construct SceneInstruct, a dataset designed for fine-tuning open-source LLMs to integrate into SceneGenAgent. Experiments show that fine-tuning open-source LLMs on SceneInstruct yields significant performance improvements, with Llama3.1-70B approaching the capabilities of GPT-4o. Our code and dataset are available at https://github.com/THUDM/SceneGenAgent.
pdf
bib
abs
ToolCoder: A Systematic Code-Empowered Tool Learning Framework for Large Language Models
Hanxing Ding
|
Shuchang Tao
|
Liang Pang
|
Zihao Wei
|
Jinyang Gao
|
Bolin Ding
|
Huawei Shen
|
Xueqi Cheng
Tool learning has emerged as a crucial capability for large language models (LLMs) to solve complex real-world tasks through interaction with external tools. Existing approaches face significant challenges, including reliance on hand-crafted prompts, difficulty in multi-step planning, and lack of precise error diagnosis and reflection mechanisms. We propose ToolCoder, a novel framework that reformulates tool learning as a code generation task. Inspired by software engineering principles, ToolCoder transforms natural language queries into structured Python function scaffold and systematically breaks down tasks with descriptive comments, enabling LLMs to leverage coding paradigms for complex reasoning and planning. It then generates and executes function implementations to obtain final responses. Additionally, ToolCoder stores successfully executed functions in a repository to promote code reuse, while leveraging error traceback mechanisms for systematic debugging, optimizing both execution efficiency and robustness. Experiments demonstrate that ToolCoder achieves superior performance in task completion accuracy and execution reliability compared to existing approaches, establishing the effectiveness of code-centric approaches in tool learning.
pdf
bib
abs
Enhancing Text Editing for Grammatical Error Correction: Arabic as a Case Study
Bashar Alhafni
|
Nizar Habash
Text editing frames grammatical error correction (GEC) as a sequence tagging problem, where edit tags are assigned to input tokens, and applying these edits results in the corrected text. This approach has gained attention for its efficiency and interpretability. However, while extensively explored for English, text editing remains largely underexplored for morphologically rich languages like Arabic. In this paper, we introduce a text editing approach that derives edit tags directly from data, eliminating the need for language-specific edits. We demonstrate its effectiveness on Arabic, a diglossic and morphologically rich language, and investigate the impact of different edit representations on model performance. Our approach achieves SOTA results on two Arabic GEC benchmarks and performs on par with SOTA on two others. Additionally, our models are over six times faster than existing Arabic GEC systems, making our approach more practical for real-world applications. Finally, we explore ensemble models, demonstrating how combining different models leads to further performance improvements. We make our code, data, and pretrained models publicly available.
pdf
bib
abs
From Isolates to Families: Using Neural Networks for Automated Language Affiliation
Frederic Blum
|
Steffen Herbold
|
Johann-Mattis List
In historical linguistics, the affiliation of languages to a common language family is traditionally carried out using a complex workflow that relies on manually comparing individual languages. Large-scale standardized collections of multilingual wordlists and grammatical language structures might help to improve this and open new avenues for developing automated language affiliation workflows. Here, we present neural network models that use lexical and grammatical data from a worldwide sample of more than 1,200 languages with known affiliations to classify individual languages into families. In line with the traditional assumption of most linguists, our results show that models trained on lexical data alone outperform models solely based on grammatical data, whereas combining both types of data yields even better performance. In additional experiments, we show how our models can identify long-ranging relations between entire subgroups, how they can be employed to investigate potential relatives of linguistic isolates, and how they can help us to obtain first hints on the affiliation of so far unaffiliated languages. We conclude that models for automated language affiliation trained on lexical and grammatical data provide comparative linguists with a valuable tool for evaluating hypotheses about deep and unknown language relations.
pdf
bib
abs
ELBA-Bench: An Efficient Learning Backdoor Attacks Benchmark for Large Language Models
Xuxu Liu
|
Siyuan Liang
|
Mengya Han
|
Yong Luo
|
Aishan Liu
|
Xiantao Cai
|
Zheng He
|
Dacheng Tao
Generative large language models are crucial in natural language processing, but they are vulnerable to backdoor attacks, where subtle triggers compromise their behavior. Although backdoor attacks against LLMs are constantly emerging, existing benchmarks remain limited in terms of sufficient coverage of attack, metric system integrity, backdoor attack alignment. And existing pre-trained backdoor attacks are idealized in practice due to resource access constraints. Therefore we establish ELBA-Bench, a comprehensive and unified framework that allows attackers to inject backdoor through parameter efficient fine-tuning (e.g., LoRA) or without fine-tuning techniques (e.g., In-context-learning). ELBA-Bench provides over 1300 experiments encompassing the implementations of 12 attack methods, 18 datasets, and 12 LLMs. Extensive experiments provide new invaluable findings into the strengths and limitations of various attack strategies. For instance, PEFT attack consistently outperform without fine-tuning approaches in classification tasks while showing strong cross-dataset generalization with optimized triggers boosting robustness; Task-relevant backdoor optimization techniques or attack prompts along with clean and adversarial demonstrations can enhance backdoor attack success while preserving model performance on clean samples. Additionally, we introduce a universal toolbox designed for standardized backdoor attack research at https://github.com/NWPUliuxx/ELBA_Bench, with the goal of propelling further progress in this vital area.
pdf
bib
abs
Less, but Better: Efficient Multilingual Expansion for LLMs via Layer-wise Mixture-of-Experts
Xue Zhang
|
Yunlong Liang
|
Fandong Meng
|
Songming Zhang
|
Yufeng Chen
|
Jinan Xu
|
Jie Zhou
Continually expanding new languages for existing large language models (LLMs) is a promising yet challenging approach to building powerful multilingual LLMs.The biggest challenge is to make the model continuously learn new languages while preserving the proficient ability of old languages.To achieve this, recent work utilizes the Mixture-of-Experts (MoE) architecture to expand new languages by adding new experts and avoid catastrophic forgetting of old languages by routing corresponding tokens to the original model backbone (old experts).Although intuitive, this kind of method is parameter-costly when expanding new languages and still inevitably impacts the performance of old languages.To address these limitations, we analyze the language characteristics of different layers in LLMs and propose a layer-wise expert allocation algorithm (LayerMoE) to determine the appropriate number of new experts for each layer.Specifically, we find different layers in LLMs exhibit different representation similarities between languages and then utilize the similarity as the indicator to allocate experts for each layer, i.e., the higher similarity, the fewer experts.Additionally, to further mitigate the forgetting of old languages, we add a classifier in front of the router network on the layers with higher similarity to guide the routing of old language tokens.Experimental results show that our method outperforms the previous state-of-the-art baseline with 60% fewer experts in the single-expansion setting and with 33.3% fewer experts in the lifelong-expansion setting, demonstrating the effectiveness of our method.
pdf
bib
abs
When Harry Meets Superman: The Role of The Interlocutor in Persona-Based Dialogue Generation
Daniela Occhipinti
|
Marco Guerini
|
Malvina Nissim
Endowing dialogue agents with persona information has proven to significantly improve the consistency and diversity of their generations. While much focus has been placed on aligning dialogues with provided personas, the adaptation to the interlocutor’s profile remains largely underexplored. In this work, we investigate three key aspects: (1) a model’s ability to align responses with both the provided persona and the interlocutor’s; (2) its robustness when dealing with familiar versus unfamiliar interlocutors and topics, and (3) the impact of additional fine-tuning on specific persona-based dialogues. We evaluate dialogues generated with diverse speaker pairings and topics, framing the evaluation as an author identification task and employing both LLM-as-a-judge and human evaluations. By systematically masking or disclosing information about interlocutor, we assess its impact on dialogue generation. Results show that access to the interlocutor’s persona improves the recognition of the target speaker, while masking it does the opposite. Although models generalise well across topics, they struggle with unfamiliar interlocutors. Finally, we found that in zero-shot settings, LLMs often copy biographical details, facilitating identification but trivialising the task.
pdf
bib
abs
ICR Probe: Tracking Hidden State Dynamics for Reliable Hallucination Detection in LLMs
Zhenliang Zhang
|
Xinyu Hu
|
Huixuan Zhang
|
Junzhe Zhang
|
Xiaojun Wan
Large language models (LLMs) excel at various natural language processing tasks, but their tendency to generate hallucinations undermines their reliability. Existing hallucination detection methods leveraging hidden states predominantly focus on static and isolated representations, overlooking their dynamic evolution across layers, which limits efficacy. To address this limitation, we shift the focus to the hidden state update process and introduce a novel metric, the **ICR** Score (**I**nformation **C**ontribution to **R**esidual Stream), which quantifies the contribution of modules to the hidden states’ update. We empirically validate that the ICR Score is effective and reliable in distinguishing hallucinations. Building on these insights, we propose a hallucination detection method, the ICR Probe, which captures the cross-layer evolution of hidden states. Experimental results show that the ICR Probe achieves superior performance with significantly fewer parameters. Furthermore, ablation studies and case analyses offer deeper insights into the underlying mechanism of this method, improving its interpretability.
pdf
bib
abs
Revisit Self-Debugging with Self-Generated Tests for Code Generation
Xiancai Chen
|
Zhengwei Tao
|
Kechi Zhang
|
Changzhi Zhou
|
Xinyu Zhang
|
Wanli Gu
|
Yuanpeng He
|
Mengdi Zhang
|
Xunliang Cai
|
Haiyan Zhao
|
Zhi Jin
Large language models (LLMs) have demonstrated significant advancements in code generation, yet they still face challenges when tackling tasks that extend beyond their basic capabilities. Recently, the concept of self-debugging has been proposed as a way to enhance code generation performance by leveraging execution feedback from tests. However, the availability of high-quality tests in real-world scenarios is often limited. In this context, self-debugging with self-generated tests emerges as a promising solution, though its limitations and practical potential have not been fully explored. To address this gap, we investigate the efficacy of self-debugging in code generation tasks. We propose and analyze two distinct paradigms for the self-debugging process: post-execution and in-execution self-debugging. Our findings reveal that post-execution self-debugging struggles with the test bias introduced by self-generated tests, which can lead to misleading feedback. In contrast, in-execution self-debugging enables LLMs to mitigate this bias and leverage intermediate states during program execution. By focusing on runtime information rather than relying solely on potentially flawed self-generated tests, this approach demonstrates significant promise for improving the robustness and accuracy of LLMs in code generation tasks.
pdf
bib
abs
InSerter: Speech Instruction Following with Unsupervised Interleaved Pre-training
Dingdong Wang
|
Jin Xu
|
Ruihang Chu
|
Zhifang Guo
|
Xiong Wang
|
Jincenzi Wu
|
Dongchao Yang
|
Shengpeng Ji
|
Junyang Lin
Recent advancements in speech large language models (SpeechLLMs) have attracted considerable attention. Nonetheless, current methods exhibit suboptimal performance in adhering to speech instructions. Notably, the intelligence of models significantly diminishes when processing speech-form input as compared to direct text-form input. Prior work has attempted to mitigate this semantic inconsistency between speech and text representations through techniques such as representation and behavior alignment, which involve the meticulous design of data pairs during the post-training phase. In this paper, we introduce a simple and scalable training method called InSerter, which stands for Interleaved Speech-Text Representation Pre-training. InSerter is designed to pre-train large-scale unsupervised speech-text sequences, where the speech is synthesized from randomly selected segments of an extensive text corpus using text-to-speech conversion. Consequently, the model acquires the ability to generate textual continuations corresponding to the provided speech segments, obviating the need for intensive data design endeavors. To systematically evaluate speech instruction-following capabilities, we introduce SpeechInstructBench, the first comprehensive benchmark specifically designed for speech-oriented instruction-following tasks. Our proposed model InSerter achieves SOTA performance in SpeechInstructBench and demonstrates superior or competitive results across diverse speech processing tasks.
pdf
bib
abs
Exploring LLMs’ Ability to Spontaneously and Conditionally Modify Moral Expressions through Text Manipulation
Candida Maria Greco
|
Lucio La Cava
|
Lorenzo Zangari
|
Andrea Tagarelli
Morality serves as the foundation of societal structure, guiding legal systems, shaping cultural values, and influencing individual self-perception. With the rise and pervasiveness of generative AI tools, and particularly Large Language Models (LLMs), concerns arise regarding how these tools capture and potentially alter moral dimensions through machine-generated text manipulation. Based on the Moral Foundation Theory, our work investigates this topic by analyzing the behavior of 12 LLMs among the most widely used Open and uncensored (i.e., ”abliterated”) models, and leveraging human-annotated datasets used in moral-related analysis. Results have shown varying levels of alteration of moral expressions depending on the type of text modification task and moral-related conditioning prompt.
pdf
bib
abs
Mixture of Ordered Scoring Experts for Cross-prompt Essay Trait Scoring
Po-Kai Chen
|
Bo-Wei Tsai
|
Shao Kuan Wei
|
Chien-Yao Wang
|
Jia-Ching Wang
|
Yi-Ting Huang
Automated Essay Scoring (AES) plays a crucial role in language assessment. In particular, cross-prompt essay trait scoring provides learners with valuable feedback to improve their writing skills. However, due to the scarcity of prompts, most existing methods overlook critical information, such as content from prompts or essays, resulting in incomplete assessment perspectives. In this paper, we propose a robust AES framework, the Mixture of Ordered Scoring Experts (MOOSE), which integrates information from both prompts and essays. MOOSE employs three specialized experts to evaluate (1) the overall quality of an essay, (2) the relative quality across multiple essays, and (3) the relevance between an essay and its prompt. MOOSE introduces the ordered aggregation of assessment results from these experts along with effective feature learning techniques. Experimental results demonstrate that MOOSE achieves exceptionally stable and state-of-the-art performance in both cross-prompt scoring and multi-trait scoring on the ASAP++ dataset. The source code is released at https://github.com/antslabtw/MOOSE-AES.
pdf
bib
abs
Sparse Logit Sampling: Accelerating Knowledge Distillation in LLMs
Anshumann Anshumann
|
Mohd Abbas Zaidi
|
Akhil Kedia
|
Jinwoo Ahn
|
Taehwak Kwon
|
Kangwook Lee
|
Haejun Lee
|
Joohyung Lee
Knowledge distillation can be a cost-effective technique to distill knowledge in Large Language Models, if the teacher output logits can be pre-computed and cached. However, successfully applying this to pre-training remains largely unexplored. In this work, we prove that naive approaches for sparse knowledge distillation such as caching Top-K probabilities, while intuitive, provide biased estimates of teacher probability distribution to the student, resulting in suboptimal performance and calibration. We propose an importance-sampling-based method ‘Random Sampling Knowledge Distillation’, which provides unbiased estimates, preserves the gradient in expectation, and requires storing significantly sparser logits. Our method enables faster training of student models with marginal overhead (<10%) compared to cross-entropy based training, while maintaining competitive performance compared to full distillation, across a range of model sizes from 300M to 3B.
pdf
bib
abs
Enhancing Spoken Discourse Modeling in Language Models Using Gestural Cues
Varsha Suresh
|
M. Hamza Mughal
|
Christian Theobalt
|
Vera Demberg
Research in linguistics shows that non-verbal cues, such as gestures, play a crucial role in spoken discourse. For example, speakers perform hand gestures to indicate topic shifts, helping listeners identify transitions in discourse. In this work, we investigate whether the joint modeling of gestures using human motion sequences and language can improve spoken discourse modeling in language models. To integrate gestures into language models, we first encode 3D human motion sequences into discrete gesture tokens using a VQ-VAE. These gesture token embeddings are then aligned with text embeddings through feature alignment, mapping them into the text embedding space. To evaluate the gesture-aligned language model on spoken discourse, we construct text infilling tasks targeting three key discourse cues grounded in linguistic research: discourse connectives, stance markers, and quantifiers. Results show that incorporating gestures enhances marker prediction accuracy across the three tasks, highlighting the complementary information that gestures can offer in modeling spoken discourse. We view this work as an initial step toward leveraging non-verbal cues to advance spoken language modeling in language models.
pdf
bib
abs
ExploraCoder: Advancing Code Generation for Multiple Unseen APIs via Planning and Chained Exploration
Yunkun Wang
|
Yue Zhang
|
Zhen Qin
|
Chen Zhi
|
Binhua Li
|
Fei Huang
|
Yongbin Li
|
Shuiguang Deng
Large language models face intrinsic limitations in coding with APIs that are unseen in their training corpora. As libraries continuously evolve, it becomes impractical to exhaustively retrain LLMs with new API knowledge. This limitation hampers LLMs from solving programming problems which require newly introduced or privately maintained libraries. Inspired by exploratory programming paradigm in human behavior, we propose **ExploraCoder**, a training-free framework that empowers LLMs to invoke multiple unseen APIs in code solution by (1) planning a complex problem into several API invocation subtasks, and (2) experimenting with correct API usage at intermediate steps through a novel chain-of-API-exploration. We conduct evaluation on program synthesizing tasks involving complex API interactions. Experimental results demonstrate that ExploraCoder significantly improves performance for models lacking prior API knowledge, achieving absolute increases of up to 11.99% over retrieval-based approaches and 17.28% over pretraining-based methods in pass@10.
pdf
bib
abs
Segment First or Comprehend First? Explore the Limit of Unsupervised Word Segmentation with Large Language Models
Zihong Zhang
|
Liqi He
|
Zuchao Li
|
Lefei Zhang
|
Hai Zhao
|
Bo Du
Word segmentation stands as a cornerstone of Natural Language Processing (NLP). Based on the concept of “comprehend first, segment later”, we propose a new framework to explore the limit of unsupervised word segmentation with Large Language Models (LLMs) and evaluate the semantic understanding capabilities of LLMs based on word segmentation. We employ current mainstream LLMs to perform word segmentation across multiple languages to assess LLMs’ “comprehension”. Our findings reveal that LLMs are capable of following simple prompts to segment raw text into words. There is a trend suggesting that models with more parameters tend to perform better on multiple languages. Additionally, we introduce a novel unsupervised method, termed LLACA (Large Language Model-Inspired Aho-Corasick Automaton). Leveraging the advanced pattern recognition capabilities of Aho-Corasick automata, LLACA innovatively combines these with the deep insights of well-pretrained LLMs. This approach not only enables the construction of a dynamic n-gram model that adjusts based on contextual information but also integrates the nuanced understanding of LLMs, offering significant improvements over traditional methods. Our source code is available at https://github.com/hkr04/LLACA
pdf
bib
abs
RUBY: An Effective Framework for Multi-Constraint Multi-Hop Question Generation
Wenzhuo Zhao
|
Shuangyin Li
Inspired by theories in language psychology, it is natural to consider more constraints, such as intentions, logic, knowledge, etc., when a complex or multi-hop question is generated. As the subtask of Multi-Hop Question Generation (MHQG), the task of Multi-Constraint Multi-Hop Question Generation (MCHQG) is more aligned with human question theories. However, it is hard to determine how to bring various high-dimensional semantic constraints, and how to integrate each constraint across all hops when a multi-hop question is being generating. To address these challenges, we introduce an effective framework which includes constraint dimensionality reduction and divide-and-conquer-based dynamic projection; we call it RUBY. The proposed RUBY contains a module of high-dimensional semantic constraint dimension reduction and a module of sub-question answer pairs-based multi-hop question generation. Meanwhile, a Reasoning Dynamic Projection strategy is tailored to effectively incorporate the constraints into every hop of the multi-hop question. The experimental results demonstrate that RUBY consistently outperforms baseline models, which suggest that RUBY is able to effectively capture and integrate semantic constraints, leading to more accurate and human-like multi-hop question generation. Our code and data are available.
pdf
bib
abs
Can Indirect Prompt Injection Attacks Be Detected and Removed?
Yulin Chen
|
Haoran Li
|
Yuan Sui
|
Yufei He
|
Yue Liu
|
Yangqiu Song
|
Bryan Hooi
Prompt injection attacks manipulate large language models (LLMs) by misleading them to deviate from the original input instructions and execute maliciously injected instructions, because of their instruction-following capabilities and inability to distinguish between the original input instructions and maliciously injected instructions. To defend against such attacks, recent studies have developed various detection mechanisms. If we restrict ourselves specifically to works which perform detection rather than direct defense, most of them focus on direct prompt injection attacks, while there are few works for the indirect scenario, where injected instructions are indirectly from external tools, such as a search engine. Moreover, current works mainly investigate injection detection methods and pay less attention to the post-processing method that aims to mitigate the injection after detection.In this paper, we investigate the feasibility of detecting and removing indirect prompt injection attacks, and we construct a benchmark dataset for evaluation. For detection, we assess the performance of existing LLMs and open-source detection models, and we further train detection models using our crafted training datasets. For removal, we evaluate two intuitive methods: (1) the *segmentation removal method*, which segments the injected document and removes parts containing injected instructions, and (2) the *extraction removal method*, which trains an extraction model to identify and remove injected instructions.
pdf
bib
abs
Identifying Open Challenges in Language Identification
Rob Van Der Goot
Automatic language identification is a core problem of many Natural LanguageProcessing (NLP) pipelines. A wide variety of architectures and benchmarks havebeen proposed with often near-perfect performance. Although previousstudies have focused on certain challenging setups (i.e. cross-domain, shortinputs), a systematic comparison is missing. We propose a benchmark that allows us to test for the effect of input size, training data size, domain, number oflanguages, scripts, and language families on performance. We evaluatefive popular models on this benchmark and identify which open challengesremain for this task as well as which architectures achieve robust performance. Wefind that cross-domain setups are the most challenging (although arguably mostrelevant), and that number of languages, variety in scripts, and variety inlanguage families have only a small impact on performance. We also contributepractical takeaways: training with 1,000 instances per language and a maximuminput length of 100 characters is enough for robust language identification.Based on our findings, we train an accurate (94.41%) multi-domain languageidentification model on 2,034 languages, for which we also provide an analysisof the remaining errors.
pdf
bib
abs
The Distracting Effect: Understanding Irrelevant Passages in RAG
Chen Amiraz
|
Florin Cuconasu
|
Simone Filice
|
Zohar Karnin
A well-known issue with Retrieval Augmented Generation (RAG) is that retrieved passages that are irrelevant to the query sometimes distract the answer-generating LLM, causing it to provide an incorrect response. In this paper, we shed light on this core issue and formulate the distracting effect of a passage w.r.t. a query (and an LLM). We provide a quantifiable measure of the distracting effect of a passage and demonstrate its robustness across LLMs. Our research introduces novel methods for identifying and using hard distracting passages to improve RAG systems. By fine-tuning LLMs with these carefully selected distracting passages, we achieve up to a 7.5% increase in answering accuracy compared to counterparts fine-tuned on conventional RAG datasets. Our contribution is two-fold: first, we move beyond the simple binary classification of irrelevant passages as either completely unrelated vs. distracting, and second, we develop and analyze multiple methods for finding hard distracting passages. To our knowledge, no other research has provided such a comprehensive framework for identifying and utilizing hard distracting passages.
pdf
bib
abs
Multilingual Encoder Knows more than You Realize: Shared Weights Pretraining for Extremely Low-Resource Languages
Zeli Su
|
Ziyin Zhang
|
Guixian Xu
|
Jianing Liu
|
Xu Han
|
Ting Zhang
|
Yushuang Dong
While multilingual language models like XLM-R have advanced multilingualism in NLP, they still perform poorly in extremely low-resource languages. This situation is exacerbated by the fact that modern LLMs such as LLaMA and Qwen support far fewer languages than XLM-R, making text generation models non-existent for many languages in the world. To tackle this challenge, we propose a novel framework for adapting multilingual encoders to text generation in extremely low-resource languages. By reusing the weights between the encoder and the decoder, our framework allows the model to leverage the learned semantic space of the encoder, enabling efficient learning and effective generalization in low-resource languages. Applying this framework to four Chinese minority languages, we present XLM-SWCM, and demonstrate its superior performance on various downstream tasks even when compared with much larger models.
pdf
bib
abs
Graphically Speaking: Unmasking Abuse in Social Media with Conversation Insights
Célia Nouri
|
Chloé Clavel
|
Jean-Philippe Cointet
Detecting abusive language in social media conversations poses significant challenges, as identifying abusiveness often depends on the conversational context, characterized by the content and topology of preceding comments. Traditional Abusive Language Detection (ALD) models often overlook this context, which can lead to unreliable performance metrics. Recent Natural Language Processing (NLP) approaches that incorporate conversational context often rely on limited or overly simplified representations of this context, leading to inconsistent and sometimes inconclusive results. In this paper, we propose a novel approach that utilizes graph neural networks (GNNs) to model social media conversations as graphs, where nodes represent comments, and edges capture reply structures. We systematically investigate various graph representations and context windows to identify the optimal configurations for ALD. Our GNN model outperforms both context-agnostic baselines and linear context-aware methods, achieving significant improvements in F1 scores. These findings demonstrate the critical role of structured conversational context and establish GNNs as a robust framework for advancing context-aware ALD.
pdf
bib
abs
CodeTool: Enhancing Programmatic Tool Invocation of LLMs via Process Supervision
YifeiLu YifeiLu
|
Fanghua Ye
|
Jian Li
|
Qiang Gao
|
Cheng Liu
|
Haibo Luo
|
Nan Du
|
Xiaolong Li
|
Feiliang Ren
Tool invocation significantly enhances the capabilities of Large Language Models (LLMs), yet challenges persist, particularly in complex task scenarios. Current methods, such as instruction-enhanced reasoning and supervised fine-tuning, often result in unnecessarily long reasoning paths and face difficulties in verifying the correctness of intermediate steps. In this paper, we propose CodeTool, a novel framework for stepwise code generation that improves LLM tool invocation by leveraging the concise and easily verifiable nature of code. CodeTool incorporates two distinct process rewards: the On-the-spot Reward, which provides immediate feedback on the accuracy of each tool invocation, and the Latent Reward, which assesses the contribution of each step toward overall task completion. By maximizing the cumulative reward of the On-the-spot and Latend Rewards at each step, LLMs are guided to follow efficient and accurate reasoning paths. Extensive experiments on StableToolBench and RestBench-TMDB demonstrate the superiority of CodeTool over existing approaches.
pdf
bib
abs
RARE: Retrieval-Augmented Reasoning Enhancement for Large Language Models
Hieu Tran
|
Zonghai Yao
|
Zhichao Yang
|
Junda Wang
|
Yifan Zhang
|
Shuo Han
|
Feiyun Ouyang
|
Hong Yu
This work introduces RARE (Retrieval-Augmented Reasoning Enhancement), a versatile extension to the mutual reasoning framework (rStar), aimed at enhancing reasoning accuracy and factual integrity across large language models (LLMs) for complex, knowledge-intensive tasks such as medical and commonsense reasoning. RARE incorporates two innovative actions within the Monte Carlo Tree Search (MCTS) framework: (A6), which generates search queries based on the initial problem statement, performs information retrieval using those queries, and augments reasoning with the retrieved data to formulate the final answer; and (A7), which leverages information retrieval specifically for generated sub-questions and re-answers these sub-questions with the relevant contextual information. Additionally, a Retrieval-Augmented Factuality Scorer is proposed to replace the original discriminator, prioritizing reasoning paths that meet high standards of factuality. Experimental results with LLaMA 3.1 show that RARE enables open-source LLMs to achieve competitive performance with top closed-source models like GPT-4 and GPT-4o. This research establishes RARE as a scalable solution for improving LLMs in domains where logical coherence and factual integrity are critical.
pdf
bib
abs
Defense Against Prompt Injection Attack by Leveraging Attack Techniques
Yulin Chen
|
Haoran Li
|
Zihao Zheng
|
Dekai Wu
|
Yangqiu Song
|
Bryan Hooi
With the advancement of technology, large language models (LLMs) have achieved remarkable performance across various natural language processing (NLP) tasks, powering LLM-integrated applications like Microsoft Copilot. However, as LLMs continue to evolve, new vulnerabilities, especially prompt injection attacks arise. These attacks trick LLMs into deviating from the original input instructions and executing the attacker’s instructions injected in data content, such as retrieved results. Recent attack methods leverage LLMs’ instruction-following abilities and their inabilities to distinguish instructions injected in the data content, and achieve a high attack success rate (ASR). When comparing the attack and defense methods, we interestingly find that they share similar design goals, of inducing the model to ignore unwanted instructions and instead to execute wanted instructions. Therefore, we raise an intuitive question: *Could these attack techniques be utilized for defensive purposes?* In this paper, we invert the intention of prompt injection methods to develop novel defense methods based on previous training-free attack methods, by repeating the attack process but with the original input instruction rather than the injected instruction. Our comprehensive experiments demonstrate that our defense techniques outperform existing defense approaches, achieving state-of-the-art results.
pdf
bib
abs
Acquisition and Application of Novel Knowledge in Large Language Models
Ziyu Shang
|
Jianghan Liu
|
Zhizhao Luo
|
Peng Wang
|
Wenjun Ke
|
Jiajun Liu
|
Zijie Xu
|
Guozheng Li
Recent advancements in large language models (LLMs) have demonstrated their impressive generative capabilities, primarily due to their extensive parameterization, which enables them to encode vast knowledge. However, effectively integrating new knowledge into LLMs remains a major challenge. Current research typically first constructs novel knowledge datasets and then injects this knowledge into LLMs through various techniques. However, existing methods for constructing new datasets either rely on timestamps, which lack rigor, or use simple templates for synthesis, which are simplistic and do not accurately reflect the real world. To address this issue, we propose a novel knowledge dataset construction approach that simulates biological evolution using knowledge graphs to generate synthetic entities with diverse attributes, resulting in a dataset, NovelHuman. Systematic analysis on NovelHuman reveals that the intra-sentence position of knowledge significantly affects the acquisition of knowledge. Therefore, we introduce an intra-sentence permutation to enhance knowledge acquisition. Furthermore, given that potential conflicts exist between autoregressive (AR) training objectives and permutation-based learning, we propose PermAR, a permutation-based language modeling framework for AR models. PermAR seamlessly integrates with mainstream AR architectures, endowing them with bidirectional knowledge acquisition capabilities. Extensive experiments demonstrate the superiority of PermAR, outperforming knowledge augmentation methods by 3.3%-38%.
pdf
bib
abs
DNCASR: End-to-End Training for Speaker-Attributed ASR
Xianrui Zheng
|
Chao Zhang
|
Phil Woodland
This paper introduces DNCASR, a novel end-to-end trainable system designed for joint neural speaker clustering and automatic speech recognition (ASR), enabling speaker-attributed transcription of long multi-party meetings. DNCASR uses two separate encoders to independently encode global speaker characteristics and local waveform information, along with two linked decoders to generate speaker-attributed transcriptions. The use of linked decoders allows the entire system to be jointly trained under a unified loss function. By employing a serialised training approach, DNCASR effectively addresses overlapping speech in real-world meetings, where the link improves the prediction of speaker indices in overlapping segments. Experiments on the AMI-MDM meeting corpus demonstrate that the jointly trained DNCASR outperforms a parallel system that does not have links between the speaker and ASR decoders. Using cpWER to measure the speaker-attributed word error rate, DNCASR achieves a 9.0% relative reduction on the AMI-MDM Eval set.
pdf
bib
abs
Exploring Persona Sentiment Sensitivity in Personalized Dialogue Generation
Yonghyun Jun
|
Hwanhee Lee
Personalized dialogue systems have advanced considerably with the integration of user-specific personas into large language models (LLMs). However, while LLMs can effectively generate personalized responses, the influence of persona sentiment on dialogue quality remains underexplored. In this work, we conduct a large-scale analysis of dialogues generated using a range of polarized user profiles. Our experiments reveal that dialogues involving negatively polarized users tend to overemphasize persona attributes. In contrast, positively polarized profiles yield dialogues that selectively incorporate persona information, resulting in smoother interactions. Furthermore, we find that personas with weak or neutral sentiment generally produce lower-quality dialogues. Motivated by these findings, we propose a dialogue generation approach that explicitly accounts for persona polarity by combining a turn-based generation strategy with a profile ordering mechanism and sentiment-aware prompting. Our study provides new insights into the sensitivity of LLMs to persona sentiment and offers guidance for developing more robust and nuanced personalized dialogue systems.
pdf
bib
abs
AntiLeakBench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Real-World Knowledge
Xiaobao Wu
|
Liangming Pan
|
Yuxi Xie
|
Ruiwen Zhou
|
Shuai Zhao
|
Yubo Ma
|
Mingzhe Du
|
Rui Mao
|
Anh Tuan Luu
|
William Yang Wang
Data contamination hinders fair LLM evaluation by introducing test data into newer models’ training sets. Existing studies solve this challenge by updating benchmarks with newly collected data. However, they fail to guarantee contamination-free evaluation as the newly collected data may contain pre-existing knowledge, and their benchmark updates rely on intensive human labor. To address these issues, we in this paper propose AntiLeak-Bench, an automated anti-leakage benchmarking framework. Instead of simply using newly collected data, we construct samples with explicitly new knowledge absent from LLMs’ training sets, which thus ensures strictly contamination-free evaluation. We further design a fully automated workflow to build and update our benchmark without human labor. This significantly reduces the cost of benchmark maintenance to accommodate emerging LLMs. Through extensive experiments, we highlight that data contamination likely exists before LLMs’ cutoff time and demonstrate that AntiLeak-Bench effectively overcomes this challenge.
pdf
bib
abs
LLM-Guided Semantic-Aware Clustering for Topic Modeling
Jianghan Liu
|
Ziyu Shang
|
Wenjun Ke
|
Peng Wang
|
Zhizhao Luo
|
Jiajun Liu
|
Guozheng Li
|
Yining Li
Topic modeling aims to discover the distribution of topics within a corpus. The advanced comprehension and generative capabilities of large language models (LLMs) have introduced new avenues for topic modeling, particularly by prompting LLMs to generate topics and refine them by merging similar ones. However, this approach necessitates that LLMs generate topics with consistent granularity, thus relying on the exceptional instruction-following capabilities of closed-source LLMs (such as GPT-4) or requiring additional training. Moreover, merging based only on topic words and neglecting the fine-grained semantics within documents might fail to fully uncover the underlying topic structure. In this work, we propose a semi-supervised topic modeling method, LiSA, that combines LLMs with clustering to improve topic generation and distribution. Specifically, we begin with prompting LLMs to generate a candidate topic word for each document, thereby constructing a topic-level semantic space. To further utilize the mutual complementarity between them, we first cluster documents and candidate topic words, and then establish a mapping from document to topic in the LLM-guided assignment stage. Subsequently, we introduce a collaborative enhancement strategy to align the two semantic spaces and establish a better topic distribution. Experimental results demonstrate that LiSA outperforms state-of-the-art methods that utilize GPT-4 on topic alignment, and exhibits competitive performance compared to Neural Topic Models on topic quality. The codes are available at https://github.com/ljh986/LiSA.
pdf
bib
abs
Hierarchical Bracketing Encodings for Dependency Parsing as Tagging
Ana Ezquerro
|
David Vilares
|
Anssi Yli-Jyrä
|
Carlos Gómez-Rodríguez
We present a family of encodings for sequence labeling dependency parsing, based on the concept of hierarchical bracketing. We show that the existing 4-bit projective encoding belongs to this family, but it is suboptimal in the number of labels used to encode a tree. We derive an optimal hierarchical bracketing, which minimizes the number of symbols used and encodes projective trees using only 12 distinct labels (vs. 16 for the 4-bit encoding). We also extend optimal hierarchical bracketing to support arbitrary non-projectivity in a more compact way than previous encodings. Our new encodings yield competitive accuracy on a diverse set of treebanks.
pdf
bib
abs
OASIS: Order-Augmented Strategy for Improved Code Search
Gao Zuchen
|
Zizheng Zhan
|
Xianming Li
|
Erxin Yu
|
Haotian Zhang
|
Chenbin Chenbin
|
Yuqun Zhang
|
Jing Li
Code embeddings capture the semantic representations of code and are crucial for various code-related large language model (LLM) applications, such as code search. Previous training primarily relies on optimizing the InfoNCE loss by comparing positive natural language (NL)-code pairs with in-batch negatives. However, due to the sparse nature of code contexts, training solely by comparing the major differences between positive and negative pairs may fail to capture deeper semantic nuances. To address this issue, we propose a novel order-augmented strategy for improved code search (OASIS). It leverages order-based similarity labels to train models to capture subtle differences in similarity among negative pairs. Extensive benchmark evaluations demonstrate that our OASIS model significantly outperforms previous state-of-the-art models focusing solely on major positive-negative differences. It underscores the value of exploiting subtle differences among negative pairs with order labels for effective code embedding training.
pdf
bib
abs
Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?
Yancheng He
|
Shilong Li
|
Jiaheng Liu
|
Weixun Wang
|
Xingyuan Bu
|
Ge Zhang
|
Z.y. Peng
|
Zhaoxiang Zhang
|
Zhicheng Zheng
|
Wenbo Su
|
Bo Zheng
Recently, o1-like models have drawn significant attention, where these models produce the long Chain-of-Thought (CoT) reasoning steps to improve the reasoning abilities of existing Large Language Models (LLMs). In this paper, to understand the qualities of these long CoTs and measure the critique abilities of existing LLMs on these long CoTs, we introduce the DeltaBench including the generated long CoTs from different o1-like models (e.g., QwQ, DeepSeek-R1) for different reasoning tasks (e.g., Math, Code, General Reasoning), to measure the ability to detect errors in long COT reasoning. Based on DeltaBench, we first perform fine-grained analysis of the generated long CoTs to discover the effectiveness and efficiency of different o1-like models. Then, we conduct extensive evaluations of existing process reward models (PRMs) and critic models to detect the errors of each annotated process, which aims to investigate the boundaries and limitations of existing PRMs and critic models. Finally, we hope that DeltaBench could guide developers to better understand the long CoT reasoning abilities of their models.
pdf
bib
abs
OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference
Xiangyu Zhao
|
Shengyuan Ding
|
Zicheng Zhang
|
Haian Huang
|
Maosongcao Maosongcao
|
Jiaqi Wang
|
Weiyun Wang
|
Xinyu Fang
|
Wenhai Wang
|
Guangtao Zhai
|
Hua Yang
|
Haodong Duan
|
Kai Chen
Recent advancements in open-source multi-modal large language models (MLLMs) have primarily focused on enhancing foundational capabilities, leaving a significant gap in human preference alignment. This paper introduces OmniAlign-V, a comprehensive dataset of 200K high-quality training samples featuring diverse images, complex questions, and varied response formats to improve MLLMs’ alignment with human preferences. We also present MM-AlignBench, a human-annotated benchmark specifically designed to evaluate MLLMs’ alignment with human values. Experimental results show that finetuning MLLMs with OmniAlign-V, using Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO), significantly enhances human preference alignment while maintaining or enhancing performance on standard VQA benchmarks, preserving their fundamental capabilities.
pdf
bib
abs
Tree-KG: An Expandable Knowledge Graph Construction Framework for Knowledge-intensive Domains
Songjie Niu
|
Kaisen Yang
|
Rui Zhao
|
Yichao Liu
|
Zonglin Li
|
Hongning Wang
|
Wenguang Chen
In knowledge-intensive domains like scientific research, effective decisions rely on organizing and retrieving intricate data. Knowledge graphs (KGs) help by structuring entities, relations, and contextual dependencies, but building KGs in such domains is challenging due to inherent complexity, manual effort, and rapid evolution. Inspired by how humans organize knowledge hierarchically, we propose Tree-KG, an expandable framework that combines structured domain texts with advanced semantic techniques. First, Tree-KG builds a tree-like graph from textbook structures using large language models (LLMs) and domain-specific entities, creating an explicit KG. Then, through iterative expansion with flexible, predefined operators, it uncovers hidden KG while preserving semantic coherence. Experiments demonstrate that Tree-KG consistently surpasses competing methods, achieving the highest F1 scores (12–16% above the second-best), with notable performance (F1 0.81) on the Text-Annotated dataset, highlighting its effectiveness in extracting high-quality information from source texts. Additionally, Tree-KG provides superior structural alignment, domain-specific extraction, and cost-efficiency, delivering robust results with reduced token usage and adaptable, resource-conscious deployment.
pdf
bib
abs
Measuring Data Diversity for Instruction Tuning: A Systematic Analysis and A Reliable Metric
Yuming Yang
|
Yang Nan
|
Junjie Ye
|
Shihan Dou
|
Xiao Wang
|
Shuo Li
|
Huijie Lv
|
Tao Gui
|
Qi Zhang
|
Xuanjing Huang
Data diversity is crucial for the instruction tuning of large language models. Existing studies have explored various diversity-aware data selection methods to construct high-quality datasets and enhance model performance. However, the fundamental problem of precisely defining and measuring data diversity remains underexplored, limiting clear guidance for data engineering. To address this, we systematically analyze 11 existing diversity measurement methods by evaluating their correlation with model performance through extensive fine-tuning experiments. Our results indicate that a reliable diversity measure should properly account for both inter-sample differences and the information density in the sample space. Building on this, we propose NovelSum, a new diversity metric based on sample-level “novelty.” Experiments on both simulated and real-world data show that NovelSum accurately captures diversity variations and achieves a 0.97 correlation with instruction-tuned model performance, highlighting its value in guiding data engineering practices. With NovelSum as an optimization objective, we further develop a greedy, diversity-oriented data selection strategy that outperforms existing approaches, validating both the effectiveness and practical significance of our metric.
pdf
bib
abs
Micro-Act: Mitigate Knowledge Conflict in Question Answering via Actionable Self-Reasoning
Nan Huo
|
Jinyang Li
|
Bowen Qin
|
Ge Qu
|
Xiaolong Li
|
Xiaodong Li
|
Chenhao Ma
|
Reynold Cheng
Retrieval-Augmented Generation (RAG) systems commonly suffer from **Knowledge Conflicts**, where retrieved external knowledge contradicts the inherent, parametric knowledge of large language models (LLMs). It adversely affects performance on downstream tasks such as question answering (QA). Existing approaches often attempt to mitigate conflicts by directly comparing two knowledge sources in a side-by-side manner, but this can overwhelm LLMs with extraneous or lengthy contexts, ultimately hindering their ability to identify and mitigate inconsistencies. To address this issue, we propose **Micro-Act** a framework with a hierarchical action space that automatically perceives context complexity and adaptively decomposes each knowledge source into a sequence of fine-grained comparisons. These comparisons are represented as actionable steps, enabling reasoning beyond the superficial context. Through extensive experiments on five benchmark datasets, Micro-Act consistently achieves significant increase in QA accuracy over state-of-the-art baselines across all 5 datasets and 3 conflict types, especially in temporal and semantic types where all baselines fail significantly. More importantly, Micro-Act exhibits robust performance on non-conflict questions simultaneously, highlighting its practical value in real-world RAG applications.
pdf
bib
abs
Minimal Pair-Based Evaluation of Code-Switching
Igor Sterner
|
Simone Teufel
There is a lack of an evaluation methodology that estimates the extent to which large language models (LLMs) use code-switching (CS) in the same way as bilinguals. Existing methods do not have wide language coverage, fail to account for the diverse range of CS phenomena, or do not scale. We propose an intervention based on minimal pairs of CS. Each minimal pair contains one naturally occurring CS sentence and one minimally manipulated variant. We collect up to 1,000 such pairs each for 11 language pairs. Our human experiments show that, for every language pair, bilinguals consistently prefer the naturally occurring CS sentence. Meanwhile our experiments with current LLMs show that the larger the model, the more consistently it assigns higher probability to the naturally occurring CS sentence than to the variant. In accordance with theoretical claims, the largest probability differences arise in those pairs where the manipulated material consisted of closed-class words.
pdf
bib
abs
DNASpeech: A Contextualized and Situated Text-to-Speech Dataset with Dialogues, Narratives and Actions
Chuanqi Cheng
|
Hongda Sun
|
Bo Du
|
Shuo Shang
|
Xinrong Hu
|
Rui Yan
In this paper, we propose contextualized and situated text-to-speech (CS-TTS), a novel TTS task to promote more accurate and customized speech generation using prompts with Dialogues, Narratives, and Actions (DNA). While prompt-based TTS methods facilitate controllable speech generation, existing TTS datasets lack situated descriptive prompts aligned with speech data. To address this data scarcity, we develop an automatic annotation pipeline enabling multifaceted alignment among speech clips, content text, and their respective descriptions. Based on this pipeline, we present DNASpeech, a novel CS-TTS dataset with high-quality speeches with DNA prompt annotations. DNASpeech contains 2,395 distinct characters, 4,452 scenes, and 22,975 dialogue utterances, along with over 18 hours of high-quality speech recordings. To accommodate more specific task scenarios, we establish a leaderboard featuring two new subtasks for evaluation: CS-TTS with narratives and CS-TTS with dialogues. We also design an intuitive baseline model for comparison with existing state-of-the-art TTS methods on our leaderboard. Comprehensive experimental results demonstrate the quality and effectiveness of DNASpeech, validating its potential to drive advancements in the TTS field.
pdf
bib
abs
LLaMA-Omni 2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis
Qingkai Fang
|
Yan Zhou
|
Shoutao Guo
|
Shaolei Zhang
|
Yang Feng
Real-time, intelligent, and natural speech interaction is an essential part of the next-generation human-computer interaction. Recent advancements have showcased the potential of building intelligent spoken chatbots based on large language models (LLMs). In this paper, we introduce LLaMA-Omni 2, a series of speech language models (SpeechLMs) ranging from 0.5B to 14B parameters, capable of achieving high-quality real-time speech interaction. LLaMA-Omni 2 is built upon the Qwen2.5 series models, integrating a speech encoder and an autoregressive streaming speech decoder. Despite being trained on only 200K multi-turn speech dialogue samples, LLaMA-Omni 2 demonstrates strong performance on several spoken question answering and speech instruction following benchmarks, surpassing previous state-of-the-art SpeechLMs like GLM-4-Voice, which was trained on millions of hours of speech data.
pdf
bib
abs
Error Comparison Optimization for Large Language Models on Aspect-Based Sentiment Analysis
Qianlong Wang
|
Keyang Ding
|
Hengxin Gao
|
Hui Wang
|
Ruifeng Xu
Supervised fine-tuning (SFT) has enabled large language models (LLMs) to exhibit promising performance on various tasks. However, this fine-tuning process only compares current predictions and labels on each sample, yet fails to perceive and understand its error outputs from different degrees, which may potentially produce a large percentage of serious errors. This poses a problem for aspect-based sentiment analysis (ABSA) in that these serious errors bring a greater negative impact than acceptable ones. Humans tend to compare mistakes to understand the varying degrees of mistakes, thus avoiding major bad decisions. Inspired by this, we propose a simple yet effective framework that could perceive and understand the degree of different errors by learning from comparative error pairs. It utilizes the SFT model to yield multiple outputs on each sample and selects acceptable and severe errors based on the acceptable scores. Together with the labels, we construct two comparative error pairs and exploit their calibration losses to optimize parameters. We conduct comprehensive experiments on ABSA datasets to demonstrate the effectiveness of our framework over baselines.
pdf
bib
abs
The AI Gap: How Socioeconomic Status Affects Language Technology Interactions
Elisa Bassignana
|
Amanda Cercas Curry
|
Dirk Hovy
Socioeconomic status (SES) fundamentally influences how people interact with each other and, more recently, with digital technologies like large language models (LLMs). While previous research has highlighted the interaction between SES and language technology, it was limited by reliance on proxy metrics and synthetic data. We survey 1,000 individuals from ‘diverse socioeconomic backgrounds’ about their use of language technologies and generative AI, and collect 6,482 prompts from their previous interactions with LLMs. We find systematic differences across SES groups in language technology usage (i.e., frequency, performed tasks), interaction styles, and topics. Higher SES entail a higher level of abstraction, convey requests more concisely, and topics like ‘inclusivity’ and ‘travel’. Lower SES correlates with higher anthropomorphization of LLMs (using ”hello” and ”thank you”) and more concrete language. Our findings suggest that while generative language technologies are becoming more accessible to everyone, socioeconomic linguistic differences still stratify their use to create a digital divide. These differences underscore the importance of considering SES in developing language technologies to accommodate varying linguistic needs rooted in socioeconomic factors and limit the AI Gap across SES groups.
pdf
bib
abs
Probing LLMs for Multilingual Discourse Generalization Through a Unified Label Set
Florian Eichin
|
Yang Janet Liu
|
Barbara Plank
|
Michael A. Hedderich
Discourse understanding is essential for many NLP tasks, yet most existing work remains constrained by framework-dependent discourse representations. This work investigates whether large language models (LLMs) capture discourse knowledge that generalizes across languages and frameworks. We address this question along two dimensions: (1) developing a unified discourse relation label set to facilitate cross-lingual and cross-framework discourse analysis, and (2) probing LLMs to assess whether they encode generalizable discourse abstractions. Using multilingual discourse relation classification as a testbed, we examine a comprehensive set of 23 LLMs of varying sizes and multilingual capabilities. Our results show that LLMs, especially those with multilingual training corpora, can generalize discourse information across languages and frameworks. Further layer-wise analyses reveal that language generalization at the discourse level is most salient in the intermediate layers. Lastly, our error analysis provides an account of challenging relation classes.
pdf
bib
abs
Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia
Samuel Cahyawijaya
|
Holy Lovenia
|
Joel Ruben Antony Moniz
|
Tack Hwa Wong
|
Mohammad Rifqi Farhansyah
|
Thant Thiri Maung
|
Frederikus Hudi
|
David Anugraha
|
Muhammad Ravi Shulthan Habibi
|
Muhammad Reza Qorib
|
Amit Agarwal
|
Joseph Marvin Imperial
|
Hitesh Laxmichand Patel
|
Vicky Feliren
|
Bahrul Ilmi Nasution
|
Manuel Antonio Rufino
|
Genta Indra Winata
|
Rian Adam Rajagede
|
Carlos Rafael Catalan
|
Mohamed Fazli Mohamed Imam
|
Priyaranjan Pattnayak
|
Salsabila Zahirah Pranida
|
Kevin Pratama
|
Yeshil Bangera
|
Adisai Na-Thalang
|
Patricia Nicole Monderin
|
Yueqi Song
|
Christian Simon
|
Lynnette Hui Xian Ng
|
Richardy Lobo Sapan
|
Taki Hasan Rafi
|
Bin Wang
|
Supryadi
|
Kanyakorn Veerakanjana
|
Piyalitt Ittichaiwong
|
Matthew Theodore Roque
|
Karissa Vincentio
|
Takdanai Kreangphet
|
Phakphum Artkaew
|
Kadek Hendrawan Palgunadi
|
Yanzhi Yu
|
Rochana Prih Hastuti
|
William Nixon
|
Mithil Bangera
|
Adrian Xuan Wei Lim
|
Aye Hninn Khine
|
Hanif Muhammad Zhafran
|
Teddy Ferdinan
|
Audra Aurora Izzani
|
Ayushman Singh
|
Evan Evan
|
Jauza Akbar Krito
|
Michael Anugraha
|
Fenal Ashokbhai Ilasariya
|
Haochen Li
|
John Amadeo Daniswara
|
Filbert Aurelian Tjiaranata
|
Eryawan Presma Yulianrifat
|
Can Udomcharoenchaikit
|
Fadil Risdian Ansori
|
Mahardika Krisna Ihsani
|
Giang Nguyen
|
Anab Maulana Barik
|
Dan John Velasco
|
Rifo Ahmad Genadi
|
Saptarshi Saha
|
Chengwei Wei
|
Isaiah Edri W. Flores
|
Kenneth Chen Ko Han
|
Anjela Gail D. Santos
|
Wan Shen Lim
|
Kaung Si Phyo
|
Tim Santos
|
Meisyarah Dwiastuti
|
Jiayun Luo
|
Jan Christian Blaise Cruz
|
Ming Shan Hee
|
Ikhlasul Akmal Hanif
|
M.Alif Al Hakim
|
Muhammad Rizky Sya’ban
|
Kun Kerdthaisong
|
Lester James Validad Miranda
|
Fajri Koto
|
Tirana Noor Fatyanosa
|
Alham Fikri Aji
|
Jostin Jerico Rosal
|
Jun Kevin
|
Robert Wijaya
|
Onno P. Kampman
|
Ruochen Zhang
|
Börje F. Karlsson
|
Peerat Limkonchotiwat
Despite Southeast Asia’s (SEA) extraordinary linguistic and cultural diversity, the region remains significantly underrepresented in vision-language (VL) research, resulting in AI models that inadequately capture SEA cultural nuances. To fill this gap, we present SEA-VL, an open-source initiative dedicated to developing culturally relevant high-quality datasets for SEA languages. By involving contributors from SEA countries, SEA-VL ensures better cultural relevance and diversity, fostering greater inclusivity of underrepresented languages and cultural depictions in VL research. Our methodology employed three approaches: community-driven crowdsourcing with SEA contributors, automated image crawling, and synthetic image generation. We evaluated each method’s effectiveness in capturing cultural relevance. We found that image crawling achieves approximately ~85% cultural relevance while being more cost- and time-efficient than crowdsourcing, whereas synthetic image generation failed to accurately reflect SEA cultural nuances and contexts. Collectively, we gathered 1.28 million SEA culturally relevant images, more than 50 times larger than other existing datasets. This work bridges the representation gap in SEA, establishes a foundation for developing culturally aware AI systems for this region, and provides a replicable framework for addressing representation gaps in other underrepresented regions.
pdf
bib
abs
Soundwave: Less is More for Speech-Text Alignment in LLMs
Yuhao Zhang
|
Zhiheng Liu
|
Fan Bu
|
Ruiyu Zhang
|
Benyou Wang
|
Haizhou Li
Existing end-to-end speech large language models (LLMs) usually rely on large-scale annotated data for training, while data-efficient training has not been discussed in depth. We focus on two fundamental problems between speech and text: the representation space gap and sequence length inconsistency. We propose Soundwave, which utilizes an efficient training strategy and a novel architecture to address these issues. Results show that Soundwave outperforms other advanced speech LLMs in speech translation and AIR-Bench speech tasks with only a fraction of the training data. Further analysis shows that Soundwave still retains its intelligence during conversation.
pdf
bib
abs
RoToR: Towards More Reliable Responses for Order-Invariant Inputs
Soyoung Yoon
|
Dongha Ahn
|
Youngwon Lee
|
Minkyu Jung
|
HyungJoo Jang
|
Seung-won Hwang
Mitigating positional bias of language models (LMs) for listwise inputs is a well-known and important problem (e.g., lost-in-the-middle). While zero-shot order-invariant LMs have been proposed to solve this issue, their success on practical listwise problems has been limited. In this work, as a first contribution, we identify and overcome two limitations to make zero-shot invariant LMs more practical: (1) training and inference distribution mismatch arising from modifying positional ID assignments to enforce invariance, and (2) failure to adapt to mixture of order-invariant and sensitive inputs in practical listwise problems. Then, to overcome these issues we propose (1) RoToR, a zero-shot invariant LM for genuinely order-invariant inputs with minimal modifications of positional IDs, and (2) Selective Routing, an adaptive framework that handles both order-invariant and order-sensitive inputs in listwise tasks. On the Lost in the middle (LitM), Knowledge Graph QA (KGQA), and MMLU benchmarks, we show that RoToR with Selective Routing can effectively handle practical listwise input tasks in a zero-shot manner (https://github.com/soyoung97/RoToR)
pdf
bib
abs
Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation
Shivalika Singh
|
Angelika Romanou
|
Clémentine Fourrier
|
David Ifeoluwa Adelani
|
Jian Gang Ngui
|
Daniel Vila-Suero
|
Peerat Limkonchotiwat
|
Kelly Marchisio
|
Wei Qi Leong
|
Yosephine Susanto
|
Raymond Ng
|
Shayne Longpre
|
Sebastian Ruder
|
Wei-Yin Ko
|
Antoine Bosselut
|
Alice Oh
|
Andre Martins
|
Leshem Choshen
|
Daphne Ippolito
|
Enzo Ferrante
|
Marzieh Fadaee
|
Beyza Ermis
|
Sara Hooker
Reliable multilingual evaluation is difficult, and culturally appropriate evaluation is even harder to achieve.A common practice to fill this gap is to machine-translate English evaluation sets. However, translation introduces language bias and carries over cultural and regional assumptions from the original questions – often testing knowledge irrelevant to the target audience. In this work, we highlight the extent and impact of these biases and present a multilingual evaluation framework that aims to mitigate them through improved translations and annotation practices.Through a large-scale study involving professional and community translators and annotators, we show that state-of-the-art models excel primarily by learning Western-centric concepts. Notably, we find that model rankings on the full MMLU change when evaluated on a subset of questions explicitly marked as culturally sensitive.We release Global MMLU, a multilingual extension of MMLU across 42 languages, featuring improved translation quality, expanded language coverage, and designated subsets labeled as culturally sensitive and culturally agnostic to enable a more comprehensive and equitable benchmark for evaluating language models across diverse linguistic and cultural contexts.
pdf
bib
abs
Improving Dialogue Discourse Parsing through Discourse-aware Utterance Clarification
Yaxin Fan
|
Peifeng Li
|
Qiaoming Zhu
Dialogue discourse parsing aims to identify and analyze discourse relations between the utterances within dialogues. However, linguistic features in dialogues, such as omission and idiom, frequently introduce ambiguities that obscure the intended discourse relations, posing significant challenges for parsers. To address this issue, we propose a Discourse-aware Clarification Module (DCM) to enhance the performance of the dialogue discourse parser. DCM employs two distinct reasoning processes: clarification type reasoning and discourse goal reasoning. The former analyzes linguistic features, while the latter distinguishes the intended relation from the ambiguous one. Furthermore, we introduce Contribution-aware Preference Optimization (CPO) to mitigate the risk of erroneous clarifications, thereby reducing cascading errors. CPO enables the parser to assess the contributions of the clarifications from DCM and provide feedback to optimize the DCM, enhancing its adaptability and alignment with the parser’s requirements. Extensive experiments on the STAC and Molweni datasets demonstrate that our approach effectively resolves ambiguities and significantly outperforms the state-of-the-art (SOTA) baselines.
pdf
bib
abs
ImPart: Importance-Aware Delta-Sparsification for Improved Model Compression and Merging in LLMs
Yan Yang
|
Yixia Li
|
Hongru Wang
|
Xuetao Wei
|
James Jianqiao Yu
|
Yun Chen
|
Guanhua Chen
With the proliferation of task-specific large language models, delta compression has emerged as a method to mitigate the resource challenges of deploying numerous such models by effectively compressing the delta model parameters. Previous delta-sparsification methods either remove parameters randomly or truncate singular vectors directly after singular value decomposition (SVD). However, these methods either disregard parameter importance entirely or evaluate it with too coarse a granularity. In this work, we introduce ImPart, a novel importance-aware delta sparsification approach. Leveraging SVD, it dynamically adjusts sparsity ratios of different singular vectors based on their importance, effectively retaining crucial task-specific knowledge even at high sparsity ratios. Experiments show that ImPart achieves state-of-the-art delta sparsification performance, demonstrating 2× higher compression ratio than baselines at the same performance level. When integrated with existing methods, ImPart sets a new state-of-the-art on delta quantization and model merging.
pdf
bib
abs
Words of Warmth: Trust and Sociability Norms for over 26k English Words
Saif M. Mohammad
Social psychologists have shown that Warmth (W) and Competence (C) are the primary dimensions along which we assess other people and groups. These dimensions impact various aspects of our lives from social competence and emotion regulation to success in the work place and how we view the world. More recent work has started to explore how these dimensions develop, why they have developed, and what they constitute. Of particular note, is the finding that warmth has two distinct components: Trust (T) and Sociability (S). In this work, we introduce Words of Warmth, the first large-scale repository of manually derived word–warmth (as well as word–trust and word–sociability) associations for over 26k English words. We show that the associations are highly reliable. We use the lexicons to study the rate at which children acquire WCTS words with age. Finally, we show that the lexicon enables a wide variety of bias and stereotype research through case studies on various target entities. Words of Warmth is freely available at: http://saifmohammad.com/warmth.html
pdf
bib
abs
BehaviorBox: Automated Discovery of Fine-Grained Performance Differences Between Language Models
Lindia Tjuatja
|
Graham Neubig
Language model evaluation is a daunting task: prompts are brittle, corpus-level perplexities are vague, and the choice of benchmarks are endless. Finding examples that show meaningful, generalizable differences between two LMs is crucial to understanding where one model succeeds and another fails. Can this process be done automatically? In this work, we propose methodology for automated comparison of language models that uses performance-aware contextual embeddings to find fine-grained features of text where one LM outperforms another. Our method, which we name BehaviorBox, extracts coherent features that demonstrate differences with respect to the ease of generation between two LMs. Specifically, BehaviorBox finds features that describe groups of words in fine-grained contexts, such as “conditional ‘were’ in the phrase ‘if you were’” and “exclamation marks after emotional statements”, where one model outperforms another within a particular datatset. We apply BehaviorBox to compare models that vary in size, model family, and post-training, and enumerate insights into specific contexts that illustrate meaningful differences in performance which cannot be found by measures such as corpus-level perplexity alone.
pdf
bib
abs
HAF-RM: A Hybrid Alignment Framework for Reward Model Training
Shujun Liu
|
Xiaoyu Shen
|
Yuhang Lai
|
Siyuan Wang
|
Shengbin Yue
|
Zengfeng Huang
|
Xuanjing Huang
|
Zhongyu Wei
The reward model has become increasingly important in alignment, assessment, and data construction for large language models (LLMs). Most existing researchers focus on enhancing reward models through data improvements, following the conventional training framework for reward models that directly optimizes the predicted rewards.In this paper, we propose a hybrid alignment framework **HAF-RM** for reward model training by introducing an additional constraint on token-level policy probabilities in addition to the reward score. It can simultaneously supervise the internal preference model at the token level and optimize the mapping layer of the reward model at the sequence level.Experiment results on five datasets sufficiently show the validity and effectiveness of our proposed hybrid framework for training a high-quality reward model.By decoupling the reward modeling procedure and incorporating hybrid supervision, our **HAF-RM** framework offers a principled and effective approach to enhancing the performance and alignment of reward models, a critical component in the responsible development of powerful language models. We release our code at [https://haf-rm.github.io](https://haf-rm.github.io).
pdf
bib
abs
CULEMO: Cultural Lenses on Emotion - Benchmarking LLMs for Cross-Cultural Emotion Understanding
Tadesse Destaw Belay
|
Ahmed Haj Ahmed
|
Alvin C Grissom Ii
|
Iqra Ameer
|
Grigori Sidorov
|
Olga Kolesnikova
|
Seid Muhie Yimam
NLP research has increasingly focused on subjective tasks such as emotion analysis. However, existing emotion benchmarks suffer fromtwo major shortcomings: (1) they largely rely on keyword-based emotion recognition, overlooking crucial cultural dimensions required fordeeper emotion understanding, and (2) many are created by translating English-annotated data into other languages, leading to potentially unreliable evaluation. To address these issues, we introduce Cultural Lenses on Emotion (CuLEmo), the first benchmark designedto evaluate culture-aware emotion prediction across six languages: Amharic, Arabic, English, German, Hindi, and Spanish. CuLEmocomprises 400 crafted questions per language, each requiring nuanced cultural reasoning and understanding. We use this benchmark to evaluate several state-of-the-art LLMs on culture-aware emotion prediction and sentiment analysis tasks. Our findings reveal that (1) emotion conceptualizations vary significantly across languages and cultures, (2) LLMs performance likewise varies by language and cultural context, and (3) prompting in English with explicit country context often outperforms in-language prompts for culture-aware emotion and sentiment understanding. The dataset and evaluation code is available.
pdf
bib
abs
DiffPO: Diffusion-styled Preference Optimization for Inference Time Alignment of Large Language Models
Ruizhe Chen
|
Wenhao Chai
|
Zhifei Yang
|
Xiaotian Zhang
|
Ziyang Wang
|
Tony Quek
|
Joey Tianyi Zhou
|
Soujanya Poria
|
Zuozhu Liu
Inference-time alignment provides an efficient alternative for aligning LLMs with humans. However, these approaches still face challenges, such as limited scalability due to policy-specific value functions and latency during the inference phase. In this paper, we propose a novel approach, Diffusion-styled Preference Optimization (DiffPO), which provides an efficient and policy-agnostic solution for aligning LLMs with humans. By directly performing alignment at sentence level, DiffPO avoids the time latency associated with token-level generation. Designed as a plug-and-play module, DiffPO can be seamlessly integrated with various base models to enhance their alignment. Extensive experiments on AlpacaEval 2, MT-bench, and HH-RLHF demonstrate that DiffPO achieves superior alignment performance across various settings, achieving a favorable trade-off between alignment quality and inference-time latency. Furthermore, DiffPO demonstrates model-agnostic scalability, significantly improving the performance of large models such as Llama-3-70B.
pdf
bib
abs
MemeQA: Holistic Evaluation for Meme Understanding
Khoi P. N. Nguyen
|
Terrence Li
|
Derek Lou Zhou
|
Gabriel Xiong
|
Pranav Balu
|
Nandhan Alahari
|
Alan Huang
|
Tanush Chauhan
|
Harshavardhan Bala
|
Emre Guzelordu
|
Affan Kashfi
|
Aaron Xu
|
Suyesh Shrestha
|
Megan Vu
|
Jerry Wang
|
Vincent Ng
Automated meme understanding requires systems to demonstrate fine-grained visual recognition, commonsense reasoning, and extensive cultural knowledge. However, existing benchmarks for meme understanding only concern narrow aspects of meme semantics. To fill this gap, we present MemeQA, a dataset of over 9,000 multiple-choice questions designed to holistically evaluate meme comprehension across seven cognitive aspects. Experiments show that state-of-the-art Large Multimodal Models perform much worse than humans on MemeQA. While fine-tuning improves their performance, they still make many errors on memes wherein proper understanding requires going beyond surface-level sentiment. Moreover, injecting “None of the above” into the available options makes the questions more challenging for the models. Our dataset is publicly available at https://github.com/npnkhoi/memeqa.
pdf
bib
abs
LoGU: Long-form Generation with Uncertainty Expressions
Ruihan Yang
|
Caiqi Zhang
|
Zhisong Zhang
|
Xinting Huang
|
Sen Yang
|
Nigel Collier
|
Dong Yu
|
Deqing Yang
While Large Language Models (LLMs) demonstrate impressive capabilities, they still struggle with generating factually incorrect content (i.e., hallucinations). A promising approach to mitigate this issue is enabling models to express uncertainty when unsure. Previous research on uncertainty modeling has primarily focused on short-form QA, but real-world applications often require much longer responses. In this work, we introduce the task of Long-form Generation with Uncertainty (LoGU). We identify two key challenges: Uncertainty Suppression, where models hesitate to express uncertainty, and Uncertainty Misalignment, where models convey uncertainty inaccurately. To tackle these challenges, we propose a refinement-based data collection framework and a two-stage training pipeline. Our framework adopts a divide-and-conquer strategy, refining uncertainty based on atomic claims. The collected data are then used in training through supervised fine-tuning (SFT) and direct preference optimization (DPO) to enhance uncertainty expression. Extensive experiments on three long-form instruction following datasets show that our method significantly improves accuracy, reduces hallucinations, and maintains the comprehensiveness of responses.
pdf
bib
abs
KiRAG: Knowledge-Driven Iterative Retriever for Enhancing Retrieval-Augmented Generation
Jinyuan Fang
|
Zaiqiao Meng
|
Craig MacDonald
Iterative retrieval-augmented generation (iRAG) models offer an effective approach for multihop question answering (QA). However, their retrieval processes face two key challenges: (1) they can be disrupted by irrelevant documents or factually inaccurate chain-of-thoughts; (2) their retrievers are not designed to dynamically adapt to the evolving information needs in multi-step reasoning, making it difficult to identify and retrieve the missing information required at each iterative step. Therefore, we propose KiRAG, which uses a knowledge-driven iterative retriever model to enhance the retrieval process of iRAG. Specifically, KiRAG decomposes documents into knowledge triples and performs iterative retrieval with these triples to enable a factually reliable retrieval process. Moreover, KiRAG integrates reasoning into the retrieval process to dynamically identify and retrieve knowledge that bridges information gaps, effectively adapting to the evolving information needs. Empirical results show that KiRAG significantly outperforms existing iRAG models, with an average improvement of 9.40% in R@3 and 5.14% in F1 on multi-hop QA datasets.
pdf
bib
abs
Enhancing Lexicon-Based Text Embeddings with Large Language Models
Yibin Lei
|
Tao Shen
|
Yu Cao
|
Andrew Yates
Recent large language models (LLMs) have demonstrated exceptional performance on general-purpose text embedding tasks. While dense embeddings have dominated related research, we introduce the first lexicon-based embeddings (LENS) leveraging LLMs that achieve competitive performance on these tasks. LENS consolidates the vocabulary space through token embedding clustering to handle the issue of token redundancy in LLM vocabularies. To further improve performance, we investigate bidirectional attention and various pooling strategies. Specifically, LENS simplifies lexical matching with redundant vocabularies by assigning each dimension to a specific token cluster, where semantically similar tokens are grouped together. Extensive experiments demonstrate that LENS outperforms dense embeddings on the Massive Text Embedding Benchmark (MTEB), delivering compact representations with dimensionality comparable to dense counterparts. Furthermore, LENS inherently supports efficient embedding dimension pruning without any specialized objectives like Matryoshka Representation Learning. Notably, combining LENS with dense embeddings achieves state-of-the-art performance on the retrieval subset of MTEB (i.e., BEIR).
pdf
bib
abs
CoCoLex: Confidence-guided Copy-based Decoding for Grounded Legal Text Generation
Santosh T.y.s.s
|
Youssef Tarek Elkhayat
|
Oana Ichim
|
Pranav Shetty
|
Dongsheng Wang
|
Zhiqiang Ma
|
Armineh Nourbakhsh
|
Xiaomo Liu
Due to their ability to process long and complex contexts, LLMs can offer key benefits to the Legal domain, but their adoption has been hindered by their tendency to generate unfaithful, ungrounded, or hallucinatory outputs. While Retrieval-Augmented Generation offers a promising solution by grounding generations in external knowledge, it offers no guarantee that the provided context will be effectively integrated. To address this, context-aware decoding strategies have been proposed to amplify the influence of relevant context, but they usually do not explicitly enforce faithfulness to the context. In this work, we introduce Confidence-guided Copy-based Decoding for Legal Text Generation (CoCoLex)—a decoding strategy that dynamically interpolates the model produced vocabulary distribution with a distribution derived based on copying from the context. CoCoLex encourages direct copying based on models’ confidence, ensuring greater fidelity to the source. Experimental results on five legal benchmarks demonstrate that CoCoLex outperforms existing context-aware decoding methods, particularly in long-form generation tasks.
pdf
bib
abs
Beyond N-Grams: Rethinking Evaluation Metrics and Strategies for Multilingual Abstractive Summarization
Itai Mondshine
|
Tzuf Paz-Argaman
|
Reut Tsarfaty
Automatic N-gram based metrics such as ROUGE are widely used for evaluating generative tasks such as summarization. While these metrics are considered indicative (even if imperfect), of human evaluation for English, their suitability for other languages remains unclear. To address this, in this paper we systematically assess evaluation metrics for generation — both n-gram-based and neural-based— to assess their effectiveness across languages and tasks. Specifically, we design a large-scale evaluation suite across eight languages from four typological families — agglutinative, isolating, low-fusional, and high-fusional — from both low- and high-resource languages, to analyze their correlations with human judgments. Our findings highlight the sensitivity of the evaluation metric to the language type at hand. For example, for fusional languages, n-gram-based metrics demonstrate a lower correlation with human assessments, compared to isolating and agglutinative languages. We also demonstrate that tokenization considerations can significantly mitigate this for fusional languages with rich morphology, up to reversing such negative correlations. Additionally, we show that neural-based metrics specifically trained for evaluation, such as COMET, consistently outperform other neural metrics and correlate better than ngrmas metrics with human judgments in low-resource languages. Overall, our analysis highlights the limitations of n-gram metrics for fusional languages and advocates for investment in neural-based metrics trained for evaluation tasks.
pdf
bib
abs
CC-Tuning: A Cross-Lingual Connection Mechanism for Improving Joint Multilingual Supervised Fine-Tuning
Yangfan Ye
|
Xiaocheng Feng
|
Zekun Yuan
|
Xiachong Feng
|
Libo Qin
|
Lei Huang
|
Weitao Ma
|
Yichong Huang
|
Zhirui Zhang
|
Yunfei Lu
|
Xiaohui Yan
|
Duyu Tang
|
Dandan Tu
|
Bing Qin
Current large language models (LLMs) often exhibit imbalanced multilingual capabilities due to their English-centric training corpora. To address this, existing fine-tuning approaches operating at the data-level (e.g., through data augmentation or distillation) typically introduce implicit cross-lingual alignment, overlooking the potential for more profound, latent-level cross-lingual interactions. In this work, we propose CC-Tuning, a novel multilingual fine-tuning paradigm that explicitly establishes a cross-lingual connection mechanism at the latent level. During training, CC-Tuning fuses the feed forward activations from both English and non-English inputs, enabling the model to benefit from both linguistic resources. This process is facilitated with a trainable Decision Maker that identifies beneficial activations. Furthermore, during inference, a Transform Matrix is utilized to simulate the cross-lingual connection under monolingual setting through representation transformation. Our experiments on six benchmarks covering 22 languages show that CC-Tuning outperforms vanilla SFT and offers a strong latent-level alternative to data-level augmentation methods. Further analysis also highlights the practicality of CC-Tuning and the potential of latent-level cross-lingual interactions in advancing the multilingual performance of LLMs.
pdf
bib
abs
SConU: Selective Conformal Uncertainty in Large Language Models
Zhiyuan Wang
|
Qingni Wang
|
Yue Zhang
|
Tianlong Chen
|
Xiaofeng Zhu
|
Xiaoshuang Shi
|
Kaidi Xu
As large language models are increasingly utilized in real-world applications, guarantees of task-specific metrics are essential for their reliable deployment. Previous studies have introduced various criteria of conformal uncertainty grounded in split conformal prediction, which offer user-specified correctness coverage. However, existing frameworks often fail to identify uncertainty data outliers that violate the exchangeability assumption, leading to unbounded miscoverage rates and unactionable prediction sets. In this paper, we propose a novel approach termed Selective Conformal Uncertainty (SConU), which, for the first time, implements significance tests, by developing two conformal p-values that are instrumental in determining whether a given sample deviates from the uncertainty distribution of the calibration set at a specific manageable risk level. Our approach not only facilitates rigorous management of miscoverage rates across both single-domain and interdisciplinary contexts, but also enhances the efficiency of predictions. Furthermore, we comprehensively analyze the components of the conformal procedures, aiming to approximate conditional coverage, particularly in high-stakes question-answering tasks.
pdf
bib
abs
MegaPairs: Massive Data Synthesis for Universal Multimodal Retrieval
Junjie Zhou
|
Yongping Xiong
|
Zheng Liu
|
Ze Liu
|
Shitao Xiao
|
Yueze Wang
|
Bo Zhao
|
Chen Jason Zhang
|
Defu Lian
Despite the rapidly growing demand for multimodal retrieval, progress in this field remains severely constrained by a lack of training data. In this paper, we introduce MegaPairs, a novel data synthesis method that leverages vision language models (VLMs) and open-domain images, together with a massive synthetic dataset generated from this method. Our empirical analysis shows that MegaPairs generates high-quality data, enabling the multimodal retriever to significantly outperform the baseline model trained on 70× more data from existing datasets. Moreover, since MegaPairs solely relies on general image corpora and open-source VLMs, it can be easily scaled up, enabling continuous improvements in retrieval performance. In this stage, we produced more than 26 million training instances and trained several models of varying sizes using this data. These new models achieve state-of-the-art zero-shot performance across 4 popular composed image retrieval (CIR) benchmarks and the highest overall performance on the 36 datasets provided by MMEB. They also demonstrate notable performance improvements with additional downstream fine-tuning. Our code, synthesized dataset, and pre-trained models are publicly available at https://github.com/VectorSpaceLab/MegaPairs.
pdf
bib
abs
When GPT Spills the Tea: Comprehensive Assessment of Knowledge File Leakage in GPTs
Xinyue Shen
|
Yun Shen
|
Michael Backes
|
Yang Zhang
Knowledge files have been widely used in large language model (LLM)-powered agents, such as GPTs, to improve response quality. However, concerns over the potential leakage of knowledge files have grown significantly. Existing studies demonstrate that adversarial prompts can induce GPTs to leak knowledge file content. Yet, it remains uncertain whether additional leakage vectors exist, particularly given the complex data flows across clients, servers, and databases in GPTs. In this paper, we present a comprehensive risk assessment of knowledge file leakage, leveraging a novel workflow inspired by Data Security Posture Management (DSPM). Through the analysis of 651,022 GPT metadata, 11,820 flows, and 1,466 responses, we identify five leakage vectors: metadata, GPT initialization, retrieval, sandboxed execution environments, and prompts. These vectors enable adversaries to extract sensitive knowledge file data such as titles, content, types, and sizes. Notably, the activation of the built-in tool Code Interpreter leads to a privilege escalation vulnerability, enabling adversaries to directly download original knowledge files with a 95.95% success rate. Further analysis reveals that 28.80% of leaked files are copyrighted, including digital copies from major publishers and internal materials from a listed company. In the end, we provide actionable solutions for GPT builders and platform providers to secure the GPT data supply chain.
pdf
bib
abs
UniCodec: Unified Audio Codec with Single Domain-Adaptive Codebook
Yidi Jiang
|
Qian Chen
|
Shengpeng Ji
|
Yu Xi
|
Wen Wang
|
Chong Zhang
|
Xianghu Yue
|
ShiLiang Zhang
|
Haizhou Li
The emergence of audio language models is empowered by neural audio codecs, which establish critical mappings between continuous waveforms and discrete tokens compatible with language model paradigms. The evolutionary trends from multi-layer residual vector quantizer to single-layer quantizer are beneficial for language-autoregressive decoding. However, the capability to handle multi-domain audio signals through a single codebook remains constrained by inter-domain distribution discrepancies. In this work, we introduce UniCodec, a unified audio codec with a single codebook to support multi-domain audio data, including speech, music, and sound. To achieve this, we propose a partitioned domain-adaptive codebook method based on domain Mixture-of-Experts strategy to capture the distinct characteristics of each audio domain. Furthermore, to enrich the semantic density of the codec without auxiliary modules, we propose a self-supervised mask prediction modeling approach. Comprehensive objective and subjective evaluations demonstrate that UniCodec achieves excellent audio reconstruction performance across the three audio domains, outperforming existing unified neural codecs with a single codebook, and even surpasses state-of-the-art domain-specific codecs on both acoustic and semantic representation capabilities.
pdf
bib
abs
KERL: Knowledge-Enhanced Personalized Recipe Recommendation using Large Language Models
Fnu Mohbat
|
Mohammed J Zaki
Recent advances in large language models (LLMs) and the abundance of food data have resulted in studies to improve food understanding using LLMs. Despite several recommendation systems utilizing LLMs and Knowledge Graphs (KGs), there has been limited research on integrating food related KGs with LLMs. We introduce KERL, a unified system that leverages food KGs and LLMs to provide personalized food recommendations and generates recipes with associated micro-nutritional information. Given a natural language question, KERL extracts entities, retrieves subgraphs from the KG, which are then fed into the LLM as context to select the recipes that satisfy the constraints. Next, our system generates the cooking steps and nutritional information for each recipe. To evaluate our approach, we also develop a benchmark dataset by curating recipe related questions, combined with constraints and personal preferences. Through extensive experiments, we show that our proposed KG-augmented LLM significantly outperforms existing approaches, offering a complete and coherent solution for food recommendation, recipe generation, and nutritional analysis. Our code and benchmark datasets are publicly available at https://github.com/mohbattharani/KERL.
pdf
bib
abs
Multilingual Arbitration: Optimizing Data Pools to Accelerate Multilingual Progress
Ayomide Odumakinde
|
Daniel D’souza
|
Pat Verga
|
Beyza Ermis
|
Sara Hooker
Synthetic data has driven recent state-of-the-art advancements, but reliance on a single oracle teacher model can lead to model collapse and bias propagation. These issues are particularly severe in multilingual settings, where no single model excels across all languages. In this study, we propose multilingual arbitration, which exploits performance variations among multiple models for each language. By strategically routing samples through a diverse set of models, each with unique strengths, we mitigate these challenges and enhance multilingual performance. Extensive experiments with state-of-the-art models demonstrate that our approach significantly surpasses single-teacher distillation, achieving up to 80% win rates over proprietary and open-weight models like Gemma 2, Llama 3.1, and Mistral v0.3, with the largest improvements in low-resource languages.
pdf
bib
abs
Controlled Low-Rank Adaptation with Subspace Regularization for Continued Training on Large Language Models
Yuheng Lu
|
Bingshuo Qian
|
Caixia Yuan
|
Huixing Jiang
|
Xiaojie Wang
Large language models (LLMs) exhibit remarkable capabilities in natural language processing but face catastrophic forgetting when learning new tasks, where adaptation to a new domain leads to a substantial decline in performance on previous tasks. In this paper, we propose Controlled LoRA (CLoRA), a subspace regularization method on LoRA structure. Aiming to reduce the scale of output change while introducing minimal constraint on model capacity, CLoRA imposes constraints on the direction of updating matrix’s null space. Experimental results on one-stage LLM finetuning tasks and continual learning settings highlight the superiority of CLoRA as an effective parameter-efficient finetuning method with catastrophic forgetting mitigating. Further investigation for model parameters indicates that CLoRA effectively balances the trade-off between model capacity and degree of forgetting. The code for implementing CLoRA will be publicly available.
pdf
bib
abs
Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models
Yancheng He
|
Shilong Li
|
Jiaheng Liu
|
Yingshui Tan
|
Weixun Wang
|
Hui Huang
|
Xingyuan Bu
|
Hangyu Guo
|
Chengwei Hu
|
Boren Zheng
|
Zhuoran Lin
|
Dekai Sun
|
Zhicheng Zheng
|
Wenbo Su
|
Bo Zheng
New LLM benchmarks are important to align with the rapid development of Large Language Models (LLMs). In this work, we present Chinese SimpleQA, the first comprehensive Chinese benchmark to evaluate the factuality ability of LLMs to answer short questions, and Chinese SimpleQA mainly has five properties (i.e., Chinese, Diverse, High-quality, Static, Easy-to-evaluate). Specifically, first, we focus on the Chinese language over 6 major topics with 99 diverse subtopics. Second, we conduct a comprehensive quality control process to achieve high-quality questions and answers, where the reference answers are static and cannot be changed over time. Third, following SimpleQA, the questions and answers are very short, and the grading process is easy-to-evaluate. Based on Chinese SimpleQA, we perform a comprehensive evaluation of the factuality abilities of existing LLMs. Finally, we hope that Chinese SimpleQA could guide the developers to better understand the Chinese factuality abilities of their models and facilitate the growth of LLMs.
pdf
bib
abs
PVP: An Image Dataset for Personalized Visual Persuasion with Persuasion Strategies, Viewer Characteristics, and Persuasiveness Ratings
Junseo Kim
|
Jongwook Han
|
Dongmin Choi
|
Jongwook Yoon
|
Eun-Ju Lee
|
Yohan Jo
Visual persuasion, which uses visual elements to influence cognition and behaviors, is crucial in fields such as advertising and politicalcommunication. With recent advancements in artificial intelligence, there is growing potential to develop persuasive systems that automatically generate persuasive images tailored to individuals. However, a significant bottleneck in this area is the lack of comprehensivedatasets that connect the persuasiveness of images with the personal information about those who evaluated the images. To address this gap and facilitate technological advancements in personalized visual persuasion, we release the Personalized Visual Persuasion (PVP) dataset, comprising 28,454 persuasive images across 596 messages and 9 persuasion strategies. Importantly, the PVP dataset provides persuasiveness scores of images evaluated by 2,521 human annotators, along with their demographic and psychological characteristics (personality traits and values). We demonstrate the utility of our dataset by developing a persuasive image generator and an automated evaluator, and establish benchmark baselines. Our experiments reveal that incorporating psychological characteristics enhances the generation and evaluation of persuasive images, providing valuable insights for personalized visual persuasion.
pdf
bib
abs
Any Information Is Just Worth One Single Screenshot: Unifying Search With Visualized Information Retrieval
Zheng Liu
|
Ze Liu
|
Zhengyang Liang
|
Junjie Zhou
|
Shitao Xiao
|
Chao Gao
|
Chen Jason Zhang
|
Defu Lian
With the popularity of multimodal techniques, it receives growing interests to acquire useful information in visual forms. In this work, we formally define an emerging IR paradigm called Visualized Information Retrieval, or Vis-IR, where multimodal information, such as texts, images, tables and charts, is jointly represented by a unified visual format called Screenshots, for various retrieval applications. We further make three key contributions for Vis-IR. First, we create VIRA (Vis-IR Aggregation), a large-scale dataset comprising a vast collection of screenshots from diverse sources, carefully curated into captioned and question-answer formats. Second, we develop UniSE (Universal Screenshot Embeddings), a family of retrieval models that enable screenshots to query or be queried across arbitrary data modalities. Finally, we construct MVRB (Massive Visualized IR Benchmark), a comprehensive benchmark covering a variety of task forms and application scenarios. Through extensive evaluations on MVRB, we highlight the deficiency from existing multimodal retrievers and the substantial improvements made by UniSE. Our data, model and benchmark have been made publicly available, which lays a solid foundation for this emerging field.
pdf
bib
abs
Tunable LLM-based Proactive Recommendation Agent
Mingze Wang
|
Chongming Gao
|
Wenjie Wang
|
Yangyang Li
|
Fuli Feng
Recommender systems are indispensable on various digital platforms. However, traditional methods often reinforce existing user interests, which leads to echo chambers and limits diversity. Proactive Recommendation Systems (PRS) aim to address this issue by cultivating users’ latent interests through multi-step recommendations. Despite advancements, challenges persist particularly in optimizing long-term rewards and adapting to real-time user feedback. In this study, we propose an LLM-based Actor-Critic Agent framework to enhance PRS. This framework utilizes the LLM-based agent to adjust recommendations in real time based on feedback and employs agent-tuning methods to optimize long-term rewards using three proposed reward functions. Extensive experiments validate the significant superiority of this framework over existing methods by optimizing long-term rewards and dynamically evolving with user feedback.
pdf
bib
abs
AgentRM: Enhancing Agent Generalization with Reward Modeling
Yu Xia
|
Jingru Fan
|
Weize Chen
|
Siyu Yan
|
Xin Cong
|
Zhong Zhang
|
Yaxi Lu
|
Yankai Lin
|
Zhiyuan Liu
|
Maosong Sun
Existing LLM-based agents have achieved strong performance on held-in tasks, but their generalizability to unseen tasks remains poor. Hence, some recent work focus on fine-tuning the policy model with more diverse tasks to improve the generalizability. In this work, we find that finetuning a reward model to guide the policy model is more robust than directly finetuning the policy model.Based on this finding, we propose AgentRM, a 8B generalizable reward model, to guide the policy model for effective test-time search.We comprehensively investigate three approaches to construct the reward model, including explicit reward modeling, implicit reward modeling and LLM-as-a-judge.We then use AgentRM to guide the answer generation with Best-of-N sampling and beam search.We show that AgentRM is robust to paraphrasings of task instructions and can generalize to unseen tasks that require novel optimal behavior.Through extensive evaluation across nine tasks spanning four categories, AgentRM enhances the non-finetuned 8B policy model by 8.8 points on average, surpassing the top general agent by 4.0.Moreover, it demonstrates weak-to-strong generalization, yielding greater improvement on more powerful policy models.As for the specializability, AgentRM can also boost a finetuned policy model and outperform the top specialized agent by 11.4 on three held-in tasks.Further analysis verifies its effectiveness in test-time scaling.We release the code and data at https://github.com/thunlp/AgentRM.
pdf
bib
abs
From Outcomes to Processes: Guiding PRM Learning from ORM for Inference-Time Alignment
Bin Xie
|
Bingbing Xu
|
Yige Yuan
|
Shengmao Zhu
|
Huawei Shen
Inference-time alignment methods have gained significant attention for their efficiency and effectiveness in aligning large language models (LLMs) with human preferences. However, existing dominant approaches, reward-guided search (RGS), suffer from a critical granularity mismatch: reward models (RMs) are trained on complete responses but applied to incomplete sequences during generation, leading to inconsistent scoring and suboptimal alignment. To combat the challenge, we argue that an ideal RM should satisfy two objectives: Score Consistency, ensuring coherent evaluation across partial and complete responses, and Preference Consistency, aligning partial sequence assessments with human preferences. To achieve these, we propose SPRM, a novel dual-consistency framework integrating score consistency-based and preference consistency-based partial evaluation modules, which leverage the Bradley-Terry model and entropy-based reweighting to predict cumulative rewards and prioritize human-aligned sequences. Extensive experiments on dialogue, summarization, and reasoning tasks demonstrate the effectiveness of SPRM, significantly reducing granularity discrepancies by up to 11.7 on TL;DR Summarization and achieving a 3.6%–10.3% improvement in GPT-4 evaluation scores across all tasks. Code is publicly available at [this link](https://github.com/xiebin23/SPRM).
pdf
bib
abs
Segment-Based Attention Masking for GPTs
Shahar Katz
|
Liran Ringel
|
Yaniv Romano
|
Lior Wolf
Causal masking is a fundamental component in Generative Pre-Trained Transformer (GPT) models, playing a crucial role during training. Although GPTs can process the entire user prompt at once, the causal masking is applied to all input tokens step-by-step, mimicking the generation process. This imposes an unnecessary constraint during the initial “prefill” phase when the model processes the input prompt and generates the internal representations before producing any output tokens. In this work, attention is masked based on the known block structure at the prefill phase, followed by the conventional token-by-token autoregressive process after that. For example, in a typical chat prompt, the system prompt is treated as one block, and the user prompt as the next one. Each of these is treated as a unit for the purpose of masking, such that the first tokens in each block can access the subsequent tokens in a non-causal manner. Then, the model answer is generated in the conventional causal manner. The Segment-by-Segment scheme entails no additional computational overhead. When integrated using a lightweight fine-tuning into already trained models such as Llama and Qwen, MAS quickly increases models’ performances.
pdf
bib
abs
Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity
Yuri Kuratov
|
Mikhail Arkhipov
|
Aydar Bulatov
|
Mikhail Burtsev
A range of recent works addresses the problem of compression of sequence of tokens into a shorter sequence of real-valued vectors to be used as inputs instead of token embeddings or key-value cache. These approaches are focused on reduction of the amount of compute in existing language models rather than minimization of number of bits needed to store text. Despite relying on powerful models as encoders, the maximum attainable lossless compression ratio is typically not higher than x10. This fact is highly intriguing because, in theory, the maximum information capacity of large real-valued vectors is far beyond the presented rates even for 16-bit precision and a modest vector size. In this work, we explore the limits of compression by replacing the encoder with a per-sample optimization procedure. We show that vectors with compression ratios up to x1500 exist, which highlights two orders of magnitude gap between existing and practically attainable solutions. Furthermore, we empirically show that the compression limits are determined not by the length of the input but by the amount of uncertainty to be reduced, namely, the cross-entropy loss on this sequence without any conditioning. The obtained limits highlight the substantial gap between the theoretical capacity of input embeddings and their practical utilization, suggesting significant room for optimization in model design.
pdf
bib
abs
Bi-Tuning with Collaborative Information for Controllable LLM-based Sequential Recommendation
Xinyu Zhang
|
Linmei Hu
|
Luhao Zhang
|
Wentao Cheng
|
Yashen Wang
|
Ge Shi
|
Chong Feng
|
Liqiang Nie
Sequential recommender systems, which leverage historical interactions to deliver targeted recommendations, have been significantly advanced by large language models (LLMs). However, LLM-based generative sequential recommendation often faces two key challenges: the lack of collaborative knowledge and the limited controllability over the generated content. In this paper, we propose a simple Bi-Tuning framework with collaborative information for controllable Large Language Model-based Sequential Recommendation (Laser). Specifically, Bi-Tuning works through incorporating learnable virtual tokens at both the prefix and suffix of the input text, where the prefix tokens enable the adaptation of LLMs with collaborative information, while the suffix token transforms the LLM output into item/user embeddings for similarity comparison, thereby facilitating controllable recommendations. Furthermore, we introduce an MoE-based querying transformer that selectively activates experts to extract relevant information from varying collaborative signals of frozen ID-based recommenders into the prefix, coupled with a multi-task loss function incorporating the MoE load-balancing objective. Finally, a two-phase training strategy is employed to progressively obtain high-quality item and user embeddings through the learnable suffix. Experiments on real-world datasets show that Laser effectively adapts LLMs for sequential recommendation, outperforming state-of-the-art baselines.
pdf
bib
abs
A Modular Approach for Clinical SLMs Driven by Synthetic Data with Pre-Instruction Tuning, Model Merging, and Clinical-Tasks Alignment
Jean-Philippe Corbeil
|
Amin Dada
|
Jean-Michel Attendu
|
Asma Ben Abacha
|
Alessandro Sordoni
|
Lucas Caccia
|
Francois Beaulieu
|
Thomas Lin
|
Jens Kleesiek
|
Paul Vozila
High computation costs and latency of large language models such as GPT-4 have limited their deployment in clinical settings. Small language models (SLMs) offer a cost-effective alternative, but their limited capacity requires biomedical domain adaptation, which remains challenging. An additional bottleneck is the unavailability and high sensitivity of clinical data. To address these challenges, we propose a novel framework for adapting SLMs into high-performing clinical models. We introduce the MediPhi collection of 3.8B-parameter SLMs developed with our novel framework: pre-instruction tuning of experts on relevant medical and clinical corpora (PMC, Medical Guideline, MedWiki, etc.), model merging, and clinical-tasks alignment. To cover most clinical tasks, we extended the CLUE benchmark to CLUE+, doubling its size. Our expert models deliver relative improvements on this benchmark over the base model without any task-specific fine-tuning: 64.3% on medical entities, 49.5% on radiology reports, and 44% on ICD-10 coding (outperforming GPT-4-0125 by 14%). We unify the expert models into MediPhi via model merging, preserving gains across benchmarks. Furthermore, we built the MediFlow collection, a synthetic dataset of 2.5 million high-quality instructions on 14 medical NLP tasks, 98 fine-grained document types, and JSON format support. Alignment of MediPhi using supervised fine-tuning and direct preference optimization achieves further gains of 18.9% on average.
pdf
bib
abs
DIVE into MoE: Diversity-Enhanced Reconstruction of Large Language Models from Dense into Mixture-of-Experts
Yuchen Feng
|
Bowen Shen
|
Naibin Gu
|
Jiaxuan Zhao
|
Peng Fu
|
Zheng Lin
|
Weiping Wang
Large language models (LLMs) with the Mixture-of-Experts (MoE) architecture achieve high cost-efficiency by selectively activating a subset of the parameters. Despite the inference efficiency of MoE LLMs, the training of extensive experts from scratch incurs substantial overhead, whereas reconstructing a dense LLM into an MoE LLM significantly reduces the training budget. However, existing reconstruction methods often overlook the diversity among experts, leading to potential redundancy. In this paper, we come up with the observation that a specific LLM exhibits notable diversity after being pruned on different calibration datasets, based on which we present a Diversity-Enhanced reconstruction method named DIVE. The recipe of DIVE includes domain affinity mining, pruning-based expert reconstruction, and efficient retraining. Specifically, the reconstruction includes pruning and reassembly of the feed-forward network (FFN) module. After reconstruction, we efficiently retrain the model on routers, experts and normalization modules. We implement DIVE on Llama-style LLMs with open-source training corpora. Experiments show that DIVE achieves training efficiency with minimal accuracy trade-offs, outperforming existing pruning and MoE reconstruction methods with the same number of activated parameters. Code is available at: https://github.com/yuchenblah/DIVE.
pdf
bib
abs
DAC: A Dynamic Attention-aware Approach for Task-Agnostic Prompt Compression
Yi Zhao
|
Zuchao Li
|
Hai Zhao
|
Baoyuan Qi
|
Liu Guoming
Task-agnostic prompt compression leverages the redundancy in natural language to reduce computational overhead and enhance information density within prompts, especially in long-context scenarios. Existing methods predominantly rely on information entropy as the metric to compress lexical units, aiming to achieve minimal information loss. However, these approaches overlook two critical aspects: (i) the importance of attention-critical tokens at the algorithmic level, and (ii) shifts in information entropy during the compression process. Motivated by these challenges, we propose a dynamic attention-aware approach for task-agnostic prompt compression (DAC). This approach effectively integrates entropy and attention information, dynamically sensing entropy shifts during compression to achieve fine-grained prompt compression. Extensive experiments across various domains, including LongBench, GSM8K, and BBH, show that DAC consistently yields robust and substantial improvements across a diverse range of tasks and LLMs, offering compelling evidence of its efficacy.
pdf
bib
abs
Computation Mechanism Behind LLM Position Generalization
Chi Han
|
Heng Ji
Most written natural languages are composed of sequences of words and sentences. Similar to humans, large language models (LLMs) exhibit flexibility in handling textual positions - a phenomenon we term Position Generalization. They can understand texts with position perturbations and generalize to longer texts than those encountered during training with the latest techniques. These phenomena suggest that LLMs handle positions in a tolerant manner, but how LLMs computationally process positional relevance remains largely unexplored. In this work, we show how LLMs enforce certain computational mechanisms to allow for the aforementioned tolerance in position perturbations. Despite the complex design of the self-attention mechanism, in this work, LLMs are revealed to learn a counterintuitive disentanglement of attention logits, where their values show a 0.959 linear correlation with an approximation of the arithmetic sum of positional relevance and semantic importance. Furthermore, we identify a prevalent pattern in intermediate features that enables this effect, suggesting that it is a learned behavior rather than a natural result of the model architecture. Based on these findings, we provide computational explanations and criteria for the aforementioned position flexibilities observed in LLMs.
pdf
bib
abs
IPO: Your Language Model is Secretly a Preference Classifier
Shivank Garg
|
Ayush Singh
|
Shweta Singh
|
Paras Chopra
Reinforcement learning from human feedback (RLHF) has emerged as the primary method for aligning large language models (LLMs) with human preferences. While it enables LLMs to achieve human-level alignment, it often incurs significant computational and financial costs due to its reliance on training external reward models or human-labeled preferences. In this work, we propose Implicit Preference Optimization (IPO), an alternative approach that leverages generative LLMs as preference classifiers, thereby reducing the dependence on external human feedback or reward models to obtain preferences. We conduct a comprehensive evaluation on the preference classification ability of LLMs using RewardBench, assessing models across different sizes, architectures, and training levels to validate our hypothesis. Furthermore, we investigate the self-improvement capabilities of LLMs by generating multiple responses for a given instruction and employing the model itself as a preference classifier for Direct Preference Optimization (DPO)-based training. Our findings demonstrate that models trained through IPO achieve performance comparable to those utilizing state-of-the-art reward models for obtaining preferences.
pdf
bib
abs
Reversal of Thought: Enhancing Large Language Models with Preference-Guided Reverse Reasoning Warm-up
Jiahao Yuan
|
Dehui Du
|
Hao Zhang
|
Zixiang Di
|
Usman Naseem
Large language models (LLMs) have shown remarkable performance in reasoning tasks but face limitations in mathematical and complex logical reasoning. Existing methods to improve LLMs’ logical capabilities either involve traceable or verifiable logical sequences that generate more reliable responses by constructing logical structures yet increase computational costs, or introduces rigid logic template rules, reducing flexibility. In this paper, we propose Reversal of Thought (RoT), a plug-and-play and cost-effective reasoning framework designed to enhance the logical reasoning abilities of LLMs during the warm-up phase prior to batch inference. RoT utilizes a Preference-Guided Reverse Reasoning warm-up strategy, which integrates logical symbols for pseudocode planning through meta-cognitive mechanisms and pairwise preference self-evaluation to generate task-specific prompts solely through demonstrations, aligning with LLMs’ cognitive preferences shaped by RLHF. Through reverse reasoning, we utilize a Cognitive Preference Manager to assess knowledge boundaries and further expand LLMs’ reasoning capabilities by aggregating solution logic for known tasks and stylistic templates for unknown tasks. Experiments across various tasks demonstrate that RoT surpasses existing baselines in both reasoning accuracy and efficiency.
pdf
bib
abs
Déjà Vu? Decoding Repeated Reading from Eye Movements
Yoav Meiri
|
Omer Shubi
|
Cfir Avraham Hadar
|
Ariel Kreisberg Nitzav
|
Yevgeni Berzak
Be it your favorite novel, a newswire article, a cooking recipe or an academic paper – in many daily situations we read the same text more than once. In this work, we ask whether it is possible to automatically determine whether the reader has previously encountered a text based on their eye movement patterns during reading. We introduce two variants of this task and address them using both feature-based and neural models. We further introduce a general strategy for enhancing these models with machine generated simulations of eye movements from a cognitive model. Finally, we present an analysis of model performance which on the one hand yields insights on the information used by the models, and on the other hand leverages predictive modeling as an analytic tool for better characterization of the role of memory in repeated reading. Our work advances the understanding of the extent and manner in which eye movements in reading capture memory effects from prior text exposure, and paves the way for future applications that involve predictive modeling of repeated reading.
pdf
bib
abs
LLMs can be easily Confused by Instructional Distractions
Yerin Hwang
|
Yongil Kim
|
Jahyun Koo
|
Taegwan Kang
|
Hyunkyung Bae
|
Kyomin Jung
Despite the fact that large language models (LLMs) show exceptional skill in instruction following tasks, this strength can turn into a vulnerability when the models are required to disregard certain instructions. Instruction following tasks typically involve a clear task description and input text containing the target data to be processed. However, when the input itself resembles an instruction, confusion may arise, even if there is explicit prompting to distinguish between the task instruction and the input. We refer to this phenomenon as instructional distraction. In this paper, we introduce a novel benchmark, named **DIM-Bench**, specifically designed to assess LLMs’ performance under instructional distraction. The benchmark categorizes real-world instances of instructional distraction and evaluates LLMs across four instruction tasks: proofreading, rewriting, translation, and style transfer—alongside five input tasks: reasoning, code generation, mathematical reasoning, bias detection, and question answering. Our experimental results reveal that even the most advanced LLMs are susceptible to instructional distraction, often failing to accurately follow user intent in such cases.
pdf
bib
abs
PlanGenLLMs: A Modern Survey of LLM Planning Capabilities
Hui Wei
|
Zihao Zhang
|
Shenghua He
|
Tian Xia
|
Shijia Pan
|
Fei Liu
LLMs have immense potential for generating plans, transforming an initial world state into a desired goal state. A large body of research has explored the use of LLMs for various planning tasks, from web navigation to travel planning and database querying. However, many of these systems are tailored to specific problems, making it challenging to compare them or determine the best approach for new tasks. There is also a lack of clear and consistent evaluation criteria. Our survey aims to offer a comprehensive overview of current LLM planners to fill this gap. It builds on foundational work by Kartam and Wilkins (1990) and examines six key performance criteria: completeness, executability, optimality, representation, generalization, and efficiency. For each, we provide a thorough analysis of representative works and highlight their strengths and weaknesses. Our paper also identifies crucial future directions, making it a valuable resource for both practitioners and newcomers interested in leveraging LLM planning to support agentic workflows.
pdf
bib
abs
IAM: Efficient Inference through Attention Mapping between Different-scale LLMs
Yi Zhao
|
Zuchao Li
|
Hai Zhao
LLMs encounter significant challenges in resource consumption nowadays, especially with long contexts. Despite extensive efforts dedicate to enhancing inference efficiency, these methods primarily exploit internal sparsity within the models, without leveraging external information for optimization. We identify the high similarity of attention matrices across different-scale LLMs, which offers a novel perspective for optimization. We first conduct a comprehensive analysis of how to measure similarity, how to select mapping Layers and whether mapping is consistency. Based on these insights, we introduce the IAM framework, which achieves dual benefits of accelerated attention computation and reduced KV cache usage by performing attention mapping between small and large LLMs. Our experimental results demonstrate that IAM can accelerate prefill by 15% and reduce KV cache usage by 22.1% without appreciably sacrificing performance. Experiments on different series of models show the generalizability of IAM. Importantly, it is also orthogonal to many existing KV cache optimization methods, making it a versatile addition to the current toolkit for enhancing LLM efficiency.
pdf
bib
abs
nvAgent: Automated Data Visualization from Natural Language via Collaborative Agent Workflow
Geliang Ouyang
|
Jingyao Chen
|
Zhihe Nie
|
Yi Gui
|
Yao Wan
|
Hongyu Zhang
|
Dongping Chen
*Natural Language to Visualization* (NL2Vis) seeks to convert natural-language descriptions into visual representations of given tables, empowering users to derive insights from large-scale data. Recent advancements in *Large Language Models* (LLMs) show promise in automating code generation to transform tabular data into accessible visualizations. However, they often struggle with complex queries that require reasoning across multiple tables. To address this limitation, we propose a collaborative agent workflow, termed **nvAgent**, for NL2Vis. Specifically, **nvAgent** comprises three agents: a processor agent for database processing and context filtering, a composer agent for planning visualization generation, and a validator agent for code translation and output verification. Comprehensive evaluations on the new VisEval benchmark demonstrate that **nvAgent** consistently surpasses state-of-the-art baselines, achieving a 7.88% improvement in single-table and a 9.23% improvement in multi-table scenarios. Qualitative analyses further highlight that **nvAgent** maintains nearly a 20% performance margin over previous models, underscoring its capacity to produce high-quality visual representations from complex, heterogeneous data sources. All datasets and source code are available at: [https://github.com/geliang0114/nvAgent](https://github.com/geliang0114/nvAgent).
pdf
bib
abs
ZIPA: A family of efficient models for multilingual phone recognition
Jian Zhu
|
Farhan Samir
|
Eleanor Chodroff
|
David R. Mortensen
We present ZIPA, a family of efficient speech models that advances the state-of-the-art performance of crosslinguistic phone recognition. We first curated IPA PACK++, a large-scale multilingual speech corpus with 17,000+ hours of normalized phone transcriptions and a novel evaluation set capturing unseen languages and sociophonetic variation. ZIPA, including transducer (ZIPA-T) and CTC-based (ZIPA-CR) variants, leverages the efficient Zipformer backbones and outperforms existing phone recognition systems with much fewer parameters. Further scaling via noisy student training on 11,000+ hours of pseudo-labeled multilingual data yields further improvement. While ZIPA achieves strong performance on benchmarks, error analysis reveals persistent limitations in modeling sociophonetic diversity, underscoring challenges for future research.
pdf
bib
abs
GRACE: A Granular Benchmark for Evaluating Model Calibration against Human Calibration
Yoo Yeon Sung
|
Eve Fleisig
|
Yu Hou
|
Ishan Upadhyay
|
Jordan Lee Boyd-Graber
Language models are often miscalibrated, leading to confidently incorrect answers. We introduce GRACE, a benchmark for language model calibration that incorporates comparison with human calibration. GRACE consists of question-answer pairs, in which each question contains a series of clues that gradually become easier, all leading to the same answer; models must answer correctly as early as possible as the clues are revealed. This setting permits granular measurement of model calibration based on how early, accurately, and confidently a model answers. After collecting these questions, we host live human vs. model competitions to gather 1,749 data points on human and model teams’ timing, accuracy, and confidence. We propose a metric, CalScore, that uses GRACE to analyze model calibration errors and identify types of model miscalibration that differ from human behavior. We find that although humans are less accurate than models, humans are generally better calibrated. Since state-of-the-art models struggle on GRACE, it effectively evaluates progress on improving model calibration.
pdf
bib
abs
Dynamic Evaluation with Cognitive Reasoning for Multi-turn Safety of Large Language Models
Lanxue Zhang
|
Yanan Cao
|
Yuqiang Xie
|
Fang Fang
|
Yangxi Li
The rapid advancement of Large Language Models (LLMs) poses significant challenges for safety evaluation. Current static datasets struggle to identify emerging vulnerabilities due to three limitations: (1) they risk being exposed in model training data, leading to evaluation bias; (2) their limited prompt diversity fails to capture real-world application scenarios; (3) they are limited to provide human-like multi-turn interactions. To address these limitations, we propose a dynamic evaluation framework, CogSafe, for comprehensive and automated multi-turn safety assessment of LLMs. We introduce CogSafe based on cognitive theories to simulate the real chatting process. To enhance assessment diversity, we introduce scenario simulation and strategy decision to guide the dynamic generation, enabling coverage of application situations. Furthermore, we incorporate the cognitive process to simulate multi-turn dialogues that reflect the cognitive dynamics of real-world interactions. Extensive experiments demonstrate the scalability and effectiveness of our framework, which has been applied to evaluate the safety of widely used LLMs.
pdf
bib
abs
From Tools to Teammates: Evaluating LLMs in Multi-Session Coding Interactions
Nathanaël Carraz Rakotonirina
|
Mohammed Hamdy
|
Jon Ander Campos
|
Lucas Weber
|
Alberto Testoni
|
Marzieh Fadaee
|
Sandro Pezzelle
|
Marco Del Tredici
Large Language Models (LLMs) are increasingly used in working environments for a wide range of tasks, excelling at solving individual problems in isolation. However, are they also able to effectively collaborate over long-term interactions? To investigate this, we introduce MemoryCode, a synthetic multi-session dataset designed to test LLMs’ ability to track and execute simple coding instructions amid irrelevant information, simulating a realistic setting. While all the models we tested handle isolated instructions well, even the performance of state-of-the-art models like GPT-4o deteriorates when instructions are spread across sessions. Our analysis suggests this is due to their failure to retrieve and integrate information over long interaction chains. Our results highlight a fundamental limitation of current LLMs, restricting their ability to collaborate effectively in long interactions.
pdf
bib
abs
Guiding not Forcing: Enhancing the Transferability of Jailbreaking Attacks on LLMs via Removing Superfluous Constraints
Junxiao Yang
|
Zhexin Zhang
|
Shiyao Cui
|
Hongning Wang
|
Minlie Huang
Jailbreaking attacks can effectively induce unsafe behaviors in Large Language Models (LLMs); however, the transferability of these attacks across different models remains limited. This study aims to understand and enhance the transferability of gradient-based jailbreaking methods, which are among the standard approaches for attacking white-box models. Through a detailed analysis of the optimization process, we introduce a novel conceptual framework to elucidate transferability and identify superfluous constraints—specifically, the response pattern constraint and the token tail constraint—as significant barriers to improved transferability. Removing these unnecessary constraints substantially enhances the transferability and controllability of gradient-based attacks. Evaluated on Llama-3-8B-Instruct as the source model, our method increases the overall Transfer Attack Success Rate (T-ASR) across a set of target models with varying safety levels from 18.4% to 50.3%, while also improving the stability and controllability of jailbreak behaviors on both source and target models.
pdf
bib
abs
Multilingual Text-to-Image Generation Magnifies Gender Stereotypes
Felix Friedrich
|
Katharina Hämmerl
|
Patrick Schramowski
|
Manuel Brack
|
Jindřich Libovický
|
Alexander Fraser
|
Kristian Kersting
Text-to-image (T2I) generation models have achieved great results in image quality, flexibility, and text alignment, leading to widespread use. Through improvements in multilingual abilities, a larger community can access this technology. Yet, we show that multilingual models suffer from substantial gender bias. Furthermore, the expectation that results should be similar across languages does not hold. We introduce MAGBIG, a controlled benchmark designed to study gender bias in multilingual T2I models, and use it to assess the impact of multilingualism on gender bias. To this end, we construct a set of multilingual prompts that offers a carefully controlled setting accounting for the complex grammatical differences influencing gender across languages. Our results show strong gender biases and notable language-specific differences across models. While we explore prompt engineering strategies to mitigate these biases, we find them largely ineffective and sometimes even detrimental to text-to-image alignment. Our analysis highlights the need for research on diverse language representations and greater control over bias in T2I models.
pdf
bib
abs
Adversarial Alignment with Anchor Dragging Drift (A3D2): Multimodal Domain Adaptation with Partially Shifted Modalities
Jun Sun
|
Xinxin Zhang
|
Simin Hong
|
Jian Zhu
|
Lingfang Zeng
Multimodal learning has celebrated remarkable success across diverse areas, yet faces the challenge of prohibitively expensive data collection and annotation when adapting models to new environments. In this context, domain adaptation has gained growing popularity as a technique for knowledge transfer, which, however, remains underexplored in multimodal settings compared with unimodal ones. This paper investigates multimodal domain adaptation, focusing on a practical partially shifting scenario where some modalities (referred to as anchors) remain domain-stable, while others (referred to as drifts) undergo a domain shift. We propose a bi-alignment scheme to simultaneously perform drift-drift and anchor-drift matching. The former is achieved through adversarial learning, aligning the representations of the drifts across source and target domains; the latter corresponds to an “anchor dragging drift” strategy, which matches the distributions of the drifts and anchors within the target domain using the optimal transport (OT) method. The overall design principle features
Adversarial
Alignment with
Anchor
Dragging
Drift, abbreviated as
A3D2, for multimodal domain adaptation with partially shifted modalities. Comprehensive empirical results verify the effectiveness of the proposed approach, and demonstrate that
A3D2 achieves superior performance compared with state-of-the-art approaches. The code is available at:
https://github.com/sunjunaimer/A3D2.git.
pdf
bib
abs
A Reality Check on Context Utilisation for Retrieval-Augmented Generation
Lovisa Hagström
|
Sara Vera Marjanovic
|
Haeun Yu
|
Arnav Arora
|
Christina Lioma
|
Maria Maistro
|
Pepa Atanasova
|
Isabelle Augenstein
Retrieval-augmented generation (RAG) helps address the limitations of parametric knowledge embedded within a language model (LM). In real world settings, retrieved information can vary in complexity, yet most investigations of LM utilisation of context has been limited to synthetic text. We introduce DRUID (Dataset of Retrieved Unreliable, Insufficient and Difficult-to-understand contexts) with real-world queries and contexts manually annotated for stance. The dataset is based on the prototypical task of automated claim verification, for which automated retrieval of real-world evidence is crucial. We compare DRUID to synthetic datasets (CounterFact, ConflictQA) and find that artificial datasets often fail to represent the complexity and diversity of realistically retrieved context. We show that synthetic datasets exaggerate context characteristics rare in real retrieved data, which leads to inflated context utilisation results, as measured by our novel ACU score. Moreover, while previous work has mainly focused on singleton context characteristics to explain context utilisation, correlations between singleton context properties and ACU on DRUID are surprisingly small compared to other properties related to context source. Overall, our work underscores the need for real-world aligned context utilisation studies to represent and improve performance in real-world RAG settings.
pdf
bib
abs
CU-MAM: Coherence-Driven Unified Macro-Structures for Argument Mining
Debela Gemechu
|
Chris Reed
Argument Mining (AM) involves the automatic identification of argument structure in natural language. Traditional AM methods rely on micro-structural features derived from the internal properties of individual Argumentative Discourse Units (ADUs). However, argument structure is shaped by a macro-structure capturing the functional interdependence among ADUs. This macro-structure consists of segments, where each segment contains ADUs that fulfill specific roles to maintain coherence within the segment (**local coherence**) and across segments (**global coherence**). This paper presents an approach that models macro-structure, capturing both local and global coherence to identify argument structures. Experiments on heterogeneous datasets demonstrate superior performance in both in-dataset and cross-dataset evaluations. The cross-dataset evaluation shows that macro-structure enhances transferability to unseen datasets.
pdf
bib
abs
Safer or Luckier? LLMs as Safety Evaluators Are Not Robust to Artifacts
Hongyu Chen
|
Seraphina Goldfarb-Tarrant
Large Language Models (LLMs) are increasingly employed as automated evaluators to assess the safety of generated content, yet their reliability in this role remains uncertain. This study evaluates a diverse set of 11 LLM judge models across critical safety domains, examining three key aspects: self-consistency in repeated judging tasks, alignment with human judgments, and susceptibility to input artifacts such as apologetic or verbose phrasing. Our findings reveal that biases in LLM judges can significantly distort the final verdict on which content source is safer, undermining the validity of comparative evaluations. Notably, apologetic language artifacts alone can skew evaluator preferences by up to 98%. Contrary to expectations, larger models do not consistently exhibit greater robustness, while smaller models sometimes show higher resistance to specific artifacts. To mitigate LLM evaluator robustness issues, we investigate jury-based evaluations aggregating decisions from multiple models. Although this approach both improves robustness and enhances alignment to human judgements, artifact sensitivity persists even with the best jury configurations. These results highlight the urgent need for diversified, artifact-resistant methodologies to ensure reliable safety assessments.
pdf
bib
abs
Text-to-ES Bench: A Comprehensive Benchmark for Converting Natural Language to Elasticsearch Query
DonggeXue DonggeXue
|
Zhili Pu
|
Zhentao Xia
|
Hongli Sun
|
Ruihui Hou
|
Guangya Yu
|
Yupian Lin
|
Yongqi Fan
|
Jingping Liu
|
Tong Ruan
Elasticsearch (ES) is a distributed RESTful search engine optimized for large-scale and long-text search scenarios. Recent research on text-to-Query has explored using large language models (LLMs) to convert user query intent to executable code, making it an increasingly popular research topic. To our knowledge, we are the first to introduce the novel semantic parsing task text-to-ES. To bridge the gap between LLM and ES, in detail, we leverage LLMs and employ domain experts to generate ES query bodies, which are Domain-Specific Language (DSL), along with the corresponding post-processing code to support multi-index ES queries. Consequently, we propose the text-to-ES benchmark that consists of two datasets: Large Elasticsearch Dataset (LED), containing 26,207 text-ES pairs derived from a 224.9GB schema-free database, and ElasticSearch (BirdES)with 10,926 pairs sourced from the Bird dataset on a 33.4GB schema-fixed database. Compared with fourteen advanced LLMs and six code-based LLMs, the model we trained outperformed DeepSeek-R1 by 15.64% on the LED dataset, setting a new state-of-the-art, and achieved 78% of DeepSeek-R1’s performance on the BirdES dataset. Additionally, we provide in-depth experimental analyses and suggest future research directions for this task. Our datasets are available at https://huggingface.co/datasets/Barry1915/Text-to-ES.
pdf
bib
abs
AlignDistil: Token-Level Language Model Alignment as Adaptive Policy Distillation
Songming Zhang
|
Xue Zhang
|
Tong Zhang
|
Bojie Hu
|
Yufeng Chen
|
Jinan Xu
In modern large language models (LLMs), LLM alignment is of crucial importance and is typically achieved through methods such as reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO). However, in most existing methods for LLM alignment, all tokens in the response are optimized using a sparse, response-level reward or preference annotation. The ignorance of token-level rewards may erroneously punish high-quality tokens or encourage low-quality tokens, resulting in suboptimal performance and slow convergence speed. To address this issue, we propose AlignDistil, a RLHF-equivalent distillation method for token-level reward optimization. Specifically, we introduce the reward learned by DPO into the RLHF objective and theoretically prove the equivalence between this objective and a token-level distillation process, where the teacher distribution linearly combines the logits from the DPO model and a reference model. On this basis, we further bridge the accuracy gap between the reward from the DPO model and the pure reward model, by building a contrastive DPO reward with a normal and a reverse DPO model. Moreover, to avoid under- and over-optimization on different tokens, we design a token adaptive logit extrapolation mechanism to construct an appropriate teacher distribution for each token. Experimental results demonstrate the superiority of our AlignDistil over existing methods and showcase fast convergence due to its token-level distributional reward optimization.
pdf
bib
abs
DARS: Dynamic Action Re-Sampling to Enhance Coding Agent Performance by Adaptive Tree Traversal
Vaibhav Aggarwal
|
Ojasv Kamal
|
Abhinav Japesh
|
Zhijing Jin
|
Bernhard Schölkopf
Large Language Models (LLMs) have revolutionized various domains, including natural language processing, data analysis, and software development, by enabling automation. In software engineering, LLM-powered coding agents have garnered significant attention due to their potential to automate complex development tasks, assist in debugging, and enhance productivity. However, existing approaches often struggle with sub-optimal decision-making, requiring either extensive manual intervention or inefficient compute scaling strategies. To improve coding agent performance, we present Dynamic Action Re-Sampling (DARS), a novel inference time compute scaling approach for coding agents, that is faster and more effective at recovering from sub-optimal decisions compared to baselines. While traditional agents either follow linear trajectories or rely on random sampling for scaling compute, our approach DARS works by branching out a trajectory at certain key decision points by taking an alternative action given the history of the trajectory and execution feedback of the previous attempt from that point. We evaluate our approach on SWE-Bench Lite benchmark, demonstrating that this scaling strategy achieves a pass@k score of 55% with Claude 3.5 Sonnet V2. Our framework achieves a pass@1 rate of 47%, outperforming state-of-the-art (SOTA) open-source frameworks.
pdf
bib
abs
Steering off Course: Reliability Challenges in Steering Language Models
Patrick Queiroz Da Silva
|
Hari Sethuraman
|
Dheeraj Rajagopal
|
Hannaneh Hajishirzi
|
Sachin Kumar
Steering methods for language models (LMs) have gained traction as lightweight alternatives to fine-tuning, enabling targeted modifications to model activations. However, prior studies primarily report results on a few models, leaving critical gaps in understanding the robustness of these methods. In this work, we systematically examine three prominent steering methods—DoLa, function vectors, and task vectors. In contrast to the original studies, which evaluated a handful of models, we test up to 36 models belonging to 14 families with sizes ranging from 1.5B to 70B parameters. Our experiments reveal substantial variability in the effectiveness of the steering approaches, with a large number of models showing no improvement and at times degradation in steering performance. Our analysis reveals fundamental flaws in the assumptions underlying these methods, challenging their reliability as scalable steering solutions.
pdf
bib
abs
Impartial Multi-task Representation Learning via Variance-invariant Probabilistic Decoding
Dou Hu
|
Lingwei Wei
|
Wei Zhou
|
Songlin Hu
Multi-task learning (MTL) enhances efficiency by sharing representations across tasks, but task dissimilarities often cause partial learning, where some tasks dominate while others are neglected. Existing methods mainly focus on balancing loss or gradients but fail to fundamentally address this issue due to the representation discrepancy in latent space. In this paper, we propose variance-invariant probabilistic decoding for multi-task learning (VIP-MTL), a framework that ensures impartial learning by harmonizing representation spaces across tasks. VIP-MTL decodes shared representations into task-specific probabilistic distributions and applies variance normalization to constrain these distributions to a consistent scale. Experiments on two language benchmarks show that VIP-MTL outperforms 12 representative methods under the same multi-task settings, especially in heterogeneous task combinations and data-constrained scenarios. Further analysis shows that VIP-MTL is robust to sampling distributions, efficient on optimization process, and scale-invariant to task losses. Additionally, the learned task-specific representations are more informative, enhancing the language understanding abilities of pre-trained language models under the multi-task paradigm.
pdf
bib
abs
If Eleanor Rigby Had Met ChatGPT: A Study on Loneliness in a Post-LLM World
Adrian de Wynter
**Warning: this paper discusses content related, but not limited to, violence, sex, and suicide.**Loneliness, or the lack of fulfilling relationships, significantly impacts a person’s mental and physical well-being and is prevalent worldwide. Previous research suggests that large language models (LLMs) may help mitigate loneliness. However, we argue that the use of widespread LLMs in services like ChatGPT is more prevalent–and riskier, as they are not designed for this purpose. To explore this, we analysed user interactions with ChatGPT outside of its marketed use as a task-oriented assistant. In dialogues classified as lonely, users frequently (37%) sought advice or validation, and received good engagement. However, ChatGPT failed in sensitive scenarios, like responding appropriately to suicidal ideation or trauma. We also observed a 35% higher incidence of toxic content, with women being 22× more likely to be targeted than men. Our findings underscore ethical and legal questions about this technology, and note risks like radicalisation or further isolation. We conclude with recommendations to research and industry to address loneliness.
pdf
bib
abs
Integrating Audio, Visual, and Semantic Information for Enhanced Multimodal Speaker Diarization on Multi-party Conversation
Luyao Cheng
|
Hui Wang
|
Chong Deng
|
Siqi Zheng
|
Yafeng Chen
|
Rongjie Huang
|
Qinglin Zhang
|
Qian Chen
|
Xihao Li
|
Wen Wang
Speaker diarization aims to segment an audio stream into homogeneous partitions based on speaker identity, playing a crucial role in speech comprehension and analysis. Mainstream speaker diarization systems rely only on acoustic information, making the task particularly challenging in complex acoustic environments in real-world applications. Recently, significant efforts have been devoted to audio-visual or audio-semantic multimodal modeling to enhance speaker diarization performance; however, these approaches still struggle to address the complexities of speaker diarization on spontaneous and unstructured multi-party conversations. To fully exploit meaningful dialogue patterns, we propose a novel multimodal approach that jointly utilizes audio, visual, and semantic cues to enhance speaker diarization. Our approach structures visual cues among active speakers and semantic cues in spoken content into a cohesive format known as pairwise constraints, and employs a semi-supervised clustering technique based on pairwise constrained propagation. Extensive experiments conducted on multiple multimodal datasets demonstrate that our approach effectively integrates audio-visual-semantic information into the clustering process for acoustic speaker embeddings and consistently outperforms state-of-the-art speaker diarization methods, while largely preserving the overall system framework.
pdf
bib
abs
Vulnerability of LLMs to Vertically Aligned Text Manipulations
Zhecheng Li
|
Yiwei Wang
|
Bryan Hooi
|
Yujun Cai
|
Zhen Xiong
|
Nanyun Peng
|
Kai-Wei Chang
Vertical text input is commonly encountered in various real-world applications, such as mathematical computations and word-based Sudoku puzzles. While current large language models (LLMs) have excelled in natural language tasks, they remain vulnerable to variations in text formatting.Recent research demonstrates that modifying input formats, such as vertically aligning words for encoder-based models, can substantially lower accuracy in text classification tasks. While easily understood by humans, these inputs can significantly mislead models, posing a potential risk of bypassing detection in real-world scenarios involving harmful or sensitive information. With the expanding application of LLMs, a crucial question arises: Do decoder-based LLMs exhibit similar vulnerabilities to vertically formatted text input? In this paper, we investigate the impact of vertical text input on the performance of various LLMs across multiple text classification datasets and analyze the underlying causes. Our findings are as follows: (i) Vertical text input significantly degrades the accuracy of LLMs in text classification tasks. (ii) Chain of Thought (CoT) reasoning does not help LLMs recognize vertical input or mitigate its vulnerability, but few-shot learning with careful analysis does. (iii) We explore the underlying cause of the vulnerability by analyzing the inherent issues in tokenization and attention matrices.
pdf
bib
abs
AutoMixer: Checkpoint Artifacts as Automatic Data Mixers
Ernie Chang
|
Yang Li
|
Patrick Huber
|
Vish Vogeti
|
David Kant
|
Yangyang Shi
|
Vikas Chandra
In language model training, it is desirable to equip models with capabilities from various tasks. However, it is not clear how to directly obtain the right data mixtures for these capabilities as the relationship between data and tasks is difficult to be modeled. In this work, we observe that checkpoint models exhibit emerging capabilities at different points in the training trajectory. Often, the training process saves checkpoints as artifacts that are under-utilized as a source of in-training data signals. We identify these artifact models based on their respective capabilities on the benchmarks and leverage them as data mixers by using their aggregated first-order influence approximation over source data. We demonstrated on eight reasoning benchmarks that the proposed framework shows significant improvements in the pretraining setting, with accuracy increases of up to 1.93%. Overall, this demonstrates the potential of checkpoint models to enhance data quality and optimize data mixtures.
pdf
bib
abs
Generalized Attention Flow: Feature Attribution for Transformer Models via Maximum Flow
Behrooz Azarkhalili
|
Maxwell W. Libbrecht
This paper introduces Generalized Attention Flow (GAF), a novel feature attribution method for Transformer-based models to address the limitations of current approaches. By extending Attention Flow and replacing attention weights with the generalized Information Tensor, GAF integrates attention weights, their gradients, the maximum flow problem, and the barrier method to enhance the performance of feature attributions. The proposed method exhibits key theoretical properties and mitigates the shortcomings of prior techniques that rely solely on simple aggregation of attention weights. Our comprehensive benchmarking on sequence classification tasks demonstrates that a specific variant of GAF consistently outperforms state-of-the-art feature attribution methods in most evaluation settings, providing a more reliable interpretation of Transformer model outputs.
pdf
bib
abs
Beyond Prompting: An Efficient Embedding Framework for Open-Domain Question Answering
Zhanghao Hu
|
Hanqi Yan
|
Qinglin Zhu
|
Zhenyi Shen
|
Yulan He
|
Lin Gui
Large language models (LLMs) have recently pushed open-domain question answering (ODQA) to new frontiers. However, prevailing retriever–reader pipelines often depend on multiple rounds of prompt-level instructions, leading to high computational overhead, instability, and suboptimal retrieval coverage. In this paper, we propose EmbQA, an embedding-level framework that alleviates these shortcomings by enhancing both the retriever and the reader. Specifically, we refine query representations via lightweight linear layers under an unsupervised contrastive learning objective, thereby reordering retrieved passages to highlight those most likely to contain correct answers. Additionally, we introduce an exploratory embedding that broadens the model’s latent semantic space to diversify candidate generation and employs an entropy-based selection mechanism to choose the most confident answer automatically. Extensive experiments across three open-source LLMs, three retrieval methods, and four ODQA benchmarks demonstrate that EmbQA substantially outperforms recent baselines in both accuracy and efficiency.
pdf
bib
abs
AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark
Jianlyu Chen
|
Nan Wang
|
Chaofan Li
|
Bo Wang
|
Shitao Xiao
|
Han Xiao
|
Hao Liao
|
Defu Lian
|
Zheng Liu
Evaluation plays a crucial role in the advancement of information retrieval (IR) models. However, current benchmarks, which are based on predefined domains and human-labeled data, face limitations in addressing evaluation needs for emerging domains both cost-effectively and efficiently. To address this challenge, we propose the Automated Heterogeneous Information Retrieval Benchmark (AIR-Bench). AIR-Bench is distinguished by three key features: 1) Automated. The testing data in AIR-Bench is automatically generated by large language models (LLMs) without human intervention. 2) Heterogeneous. The testing data in AIR-Bench is generated with respect to diverse tasks, domains and languages. 3) Dynamic. The domains and languages covered by AIR-Bench are constantly augmented to provide an increasingly comprehensive evaluation benchmark for community developers. We develop a reliable and robust data generation pipeline to automatically create diverse and high-quality evaluation datasets based on real-world corpora. Our findings demonstrate that the generated testing data in AIR-Bench aligns well with human-labeled testing data, making AIR-Bench a dependable benchmark for evaluating IR models. The resources in AIR-Bench are publicly available at https://github.com/AIR-Bench/AIR-Bench.
pdf
bib
abs
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?
Runqi Qiao
|
Qiuna Tan
|
Guanting Dong
|
MinhuiWu MinhuiWu
|
Chong Sun
|
Xiaoshuai Song
|
Jiapeng Wang
|
Zhuoma GongQue
|
Shanglin Lei
|
YiFan Zhang
|
Zhe Wei
|
Miaoxuan Zhang
|
Runfeng Qiao
|
Xiao Zong
|
Yida Xu
|
Peiqing Yang
|
Zhimin Bao
|
Muxi Diao
|
Chen Li
|
Honggang Zhang
Visual mathematical reasoning, as a fundamental visual reasoning ability, has received widespread attention from the Large Multimodal Models (LMMs) community. Existing benchmarks mainly focus more on the end-to-end performance, but neglect the underlying principles of knowledge acquisition and generalization. Instead, we introduce WE-MATH, the first benchmark specifically designed to explore the problem-solving principles. We meticulously collect 6.5K visual math problems and decompose them into 10.9K step-level questions for evaluation, spanning 5 layers of knowledge granularity and 67 hierarchical knowledge concepts. Specifically, we decompose composite problems into sub-problems according to the required knowledge concepts and introduce a novel four-dimensional metric to hierarchically assess inherent issues in LMMs’ reasoning process. With WE-MATH, we conduct a thorough evaluation of existing LMMs in visual mathematical reasoning and provide comprehensive analysis and insight for future development. We anticipate that WE-MATH will open new pathways for advancements in visual mathematical reasoning for LMMs. Data and code are available at https://github.com/We-Math/We-Math.
pdf
bib
abs
Modeling the Evolution of English Noun Compounds with Feature-Rich Diachronic Compositionality Prediction
Filip Miletić
|
Sabine Schulte Im Walde
We analyze the evolution of English noun compounds, which we represent as vectors of time-specific values. We implement a wide array of methods to create a rich set of features, using them to classify compounds for present-day compositionality and to assess the informativeness of the corresponding linguistic patterns. Our best results use BERT – reflecting the similarity of compounds and sentence contexts – and we further capture relevant and complementary information across approaches. Leveraging these feature differences, we find that the development of low-compositional meanings is reflected by a parallel drop in compositionality and sustained semantic change. The same distinction is echoed in transformer processing: compositionality estimates require far less contextualization than semantic change estimates.
pdf
bib
abs
What’s the Difference? Supporting Users in Identifying the Effects of Prompt and Model Changes Through Token Patterns
Michael A. Hedderich
|
Anyi Wang
|
Raoyuan Zhao
|
Florian Eichin
|
Jonas Fischer
|
Barbara Plank
Prompt engineering for large language models is challenging, as even small prompt perturbations or model changes can significantly impact the generated output texts. Existing evaluation methods of LLM outputs, either automated metrics or human evaluation, have limitations, such as providing limited insights or being labor-intensive. We propose Spotlight, a new approach that combines both automation and human analysis. Based on data mining techniques, we automatically distinguish between random (decoding) variations and systematic differences in language model outputs. This process provides token patterns that describe the systematic differences and guide the user in manually analyzing the effects of their prompts and changes in models efficiently. We create three benchmarks to quantitatively test the reliability of token pattern extraction methods and demonstrate that our approach provides new insights into established prompt data. From a human-centric perspective, through demonstration studies and a user study, we show that our token pattern approach helps users understand the systematic differences of language model outputs. We are further able to discover relevant differences caused by prompt and model changes (e.g. related to gender or culture), thus supporting the prompt engineering process and human-centric model behavior research.
pdf
bib
abs
V-Oracle: Making Progressive Reasoning in Deciphering Oracle Bones for You and Me
Runqi Qiao
|
Qiuna Tan
|
Guanting Dong
|
MinhuiWu MinhuiWu
|
Jiapeng Wang
|
YiFan Zhang
|
Zhuoma GongQue
|
Chong Sun
|
Yida Xu
|
Yadong Xue
|
Ye Tian
|
Zhimin Bao
|
Lan Yang
|
Chen Li
|
Honggang Zhang
Oracle Bone Script (OBS) is a vital treasure of human civilization, rich in insights from ancient societies. However, the evolution of written language over millennia complicates its decipherment. In this paper, we propose V-Oracle, an innovative framework that utilizes Large Multi-modal Models (LMMs) for interpreting OBS. V-Oracle applies principles of pictographic character formation and frames the task as a visual question-answering (VQA) problem, establishing a multi-step reasoning chain. It proposes a multi-dimensional data augmentation for synthesizing high-quality OBS samples, and also implements a multi-phase oracle alignment tuning to improve LMMs’ visual reasoning capabilities. Moreover, to bridge the evaluation gap in the OBS field, we further introduce Oracle-Bench, a comprehensive benchmark that emphasizes process-oriented assessment and incorporates both standard and out-of-distribution setups for realistic evaluation. Extensive experimental results can demonstrate the effectiveness of our method in providing quantitative analyses and superior deciphering capability.
pdf
bib
abs
Unveiling Cultural Blind Spots: Analyzing the Limitations of mLLMs in Procedural Text Comprehension
Amir Hossein Yari
|
Fajri Koto
Despite the impressive performance of multilingual large language models (mLLMs) in various natural language processing tasks, their ability to understand procedural texts, particularly those with culture-specific content, remains largely unexplored. Texts describing cultural procedures, including rituals, traditional craftsmanship, and social etiquette, require an inherent understanding of cultural context, presenting a significant challenge for mLLMs. In this work, we introduce CAPTex, a benchmark designed to evaluate mLLMs’ ability to process and reason over culturally diverse procedural texts in multiple languages. Using a range of evaluation methods, we find that (1) mLLMs struggle with culturally contextualized procedural content, particularly in low-resource languages; (2) performance varies across cultural domains, with some proving more difficult than others; and (3) models perform better on multiple-choice tasks presented in conversational formats than on direct questions. These results highlight the current limitations of mLLMs and emphasize the need for culturally informed benchmarks like CAPTex to support more accurate and inclusive language understanding.
pdf
bib
abs
Improving Language and Modality Transfer in Translation by Character-level Modeling
Ioannis Tsiamas
|
David Dale
|
Marta R. Costa-jussà
Current translation systems, despite being highly multilingual, cover only 5% of the world’s languages. Expanding language coverage to the long-tail of low-resource languages requires data-efficient methods that rely on cross-lingual and cross-modal knowledge transfer. To this end, we propose a character-based approach to improve adaptability to new languages and modalities. Our method leverages SONAR, a multilingual fixed-size embedding space with different modules for encoding and decoding. We use a teacher-student approach with parallel translation data to obtain a character-level encoder. Then, using ASR data, we train a lightweight adapter to connect a massively multilingual CTC ASR model (MMS), to the character-level encoder, potentially enabling speech translation from 1,000+ languages. Experimental results in text translation for 75 languages on FLORES+ demonstrate that our character-based approach can achieve better language transfer than traditional subword-based models, especially outperforming them in low-resource settings, and demonstrating better zero-shot generalizability to unseen languages. Our speech adaptation, maximizing knowledge transfer from the text modality, achieves state-of-the-art results in speech-to-text translation on the FLEURS benchmark on 33 languages, surpassing previous supervised and cascade models, albeit being a zero-shot model with minimal supervision from ASR data.
pdf
bib
abs
DialUp! Modeling the Language Continuum by Adapting Models to Dialects and Dialects to Models
Niyati Bafna
|
Emily Chang
|
Nathaniel Romney Robinson
|
David R. Mortensen
|
Kenton Murray
|
David Yarowsky
|
Hale Sirin
Most of the world’s languages and dialects are low-resource, and lack support in mainstream machine translation (MT) models. However, many of them have a closely-related high-resource language (HRL) neighbor, and differ in linguistically regular ways from it. This underscores the importance of model robustness to dialectal variation and cross-lingual generalization to the HRL dialect continuum. We present DialUp, consisting of a training-time technique for adapting a pretrained model to dialectal data (M–>D), and an inference-time intervention adapting dialectal data to the model expertise (D–>M). M–>D induces model robustness to potentially unseen and unknown dialects by exposure to synthetic data exemplifying linguistic mechanisms of dialectal variation, whereas D–>M treats dialectal divergence for known target dialects. These methods show considerable performance gains for several dialects from four language families, and modest gains for two other language families. We also conduct feature and error analyses, which show that language varieties with low baseline MT performance are more likely to benefit from these approaches.
pdf
bib
abs
AutoMixAlign: Adaptive Data Mixing for Multi-Task Preference Optimization in LLMs
Nicholas E. Corrado
|
Julian Katz-Samuels
|
Adithya M Devraj
|
Hyokun Yun
|
Chao Zhang
|
Yi Xu
|
Yi Pan
|
Bing Yin
|
Trishul Chilimbi
When aligning large language models (LLMs), their performance across various tasks (such as being helpful, harmless, and honest) is heavily influenced by the composition of the training data. However, it is difficult to determine what mixture of data should be used to produce a model with strong performance across all tasks. Existing approaches rely on large ablation studies, heuristics, or human intuition, though these can be prohibitively expensive and suboptimal. We study this problem in the context of preference optimization via DPO and propose a novel and theoretically justified algorithm, AutoMixAlign (AMA), that adaptively mixes datasets during LLM training to balance performance across multiple tasks. AMA first trains specialist models for each task to determine losses that corresponding to strong task performance. Next, AMA trains a generalist model using a novel minimax optimization that prioritizes tasks for which generalist model losses are furthest from specialist model losses. We introduce two algorithms to optimize this problem: (1) AMA-R adaptively reweights the objective to prioritize tasks, and (2) AMA-S adaptively adjusts how much data is sampled from each task to prioritize tasks. Both algorithms achieve a convergence rate of O(1/√T) in the convex case. AMA-R’s convergence result immediately follows from Sagawa et. al, 2019, and we provide a convergence proof for AMA-S using techniques from online learning such as EXP3 (Auer et. al, 2002). We evaluate AMA on several multitask alignment setups, and observe that AMA outperforms the standard alignment approach which simply optimizes the total loss across all tasks and also outperforms model-merging methods.
pdf
bib
abs
Modeling Complex Semantics Relation with Contrastively Fine-Tuned Relational Encoders
Naïm Es-sebbani
|
Esteban Marquer
|
Zied Bouraoui
Modeling relationships between concepts and entities is essential for many applications. While Large Language Models (LLMs) capture relational and commonsense knowledge effectively, they are computationally expensive and often underperform in tasks requiring efficient relational encoding, such as relation induction, extraction, and information retrieval. Despite advancements in learning relational embeddings, existing methods often fail to capture nuanced representations and the rich semantics needed for high-quality embeddings. In this work, we propose different relational encoders designed to capture diverse relational aspects and semantic properties of entity pairs. Although several datasets exist for training such encoders, they often rely on structured knowledge bases or predefined schemas, which primarily encode simple and static relations. To overcome this limitation, we also introduce a novel dataset generation method leveraging LLMs to create a diverse spectrum of relationships. Our experiments demonstrate the effectiveness of our proposed encoders and the benefits of our generated dataset.
pdf
bib
abs
Error-driven Data-efficient Large Multimodal Model Tuning
Barry Menglong Yao
|
Qifan Wang
|
Lifu Huang
Large Multimodal Models (LMMs) have demonstrated impressive performance across numerous academic benchmarks. However, fine-tuning still remains essential to achieve satisfactory performance on downstream tasks, while the task-specific tuning samples are usually not readily available or expensive and time-consuming to obtain. To address this, we propose an error-driven data-efficient tuning framework that aims to efficiently adapt generic LMMs to newly emerging tasks without requiring extensive task-specific training samples. In our approach, a generic LMM, acting as a student model, is first evaluated on a small validation set of the target task, and then a more powerful model, acting as a teacher model, identifies the erroneous steps within the student model’s reasoning steps and analyzes its capability gaps from fully addressing the target task. Based on these gaps, targeted training samples are further retrieved from existing task-agnostic datasets to tune the student model and tailor it to the target task. We perform extensive experiments across three different training data scales and seven tasks, demonstrating that our training paradigm significantly and efficiently improves LMM’s performance on downstream tasks, achieving an average performance boost of 7.01%
pdf
bib
abs
Planning with Diffusion Models for Target-Oriented Dialogue Systems
Hanwen Du
|
Bo Peng
|
Xia Ning
Target-Oriented Dialogue (TOD) remains a significant challenge in the LLM era, where strategic dialogue planning is crucial for directing conversations toward specific targets. However, existing dialogue planning methods generate dialogue plans in a step-by-step sequential manner, and may suffer from compounding errors and myopic actions. To address these limitations, we introduce a novel dialogue planning framework, DiffTOD, which leverages diffusion models to enable non-sequential dialogue planning. DiffTOD formulates dialogue planning as a trajectory generation problem with conditional guidance, and leverages a diffusion language model to estimate the likelihood of the dialogue trajectory. To optimize the dialogue action strategies, DiffTOD introduces three tailored guidance mechanisms for different target types, offering flexible guidance toward diverse TOD targets at test time. Extensive experiments across three diverse TOD settings show that DiffTOD can effectively perform non-myopic lookahead exploration and optimize action strategies over a long horizon through non-sequential dialogue planning, and demonstrates strong flexibility across complex and diverse dialogue scenarios. Our code and data are accessible through https://github.com/ninglab/DiffTOD.
pdf
bib
abs
Interactive and Expressive Code-Augmented Planning with Large Language Models
Anthony Zhe Liu
|
Xinhe Wang
|
Jacob Sansom
|
Yao Fu
|
Jongwook Choi
|
Sungryull Sohn
|
Jaekyeom Kim
|
Honglak Lee
Large Language Models (LLMs) demonstrate strong abilities in common-sense reasoning and interactive decision-making, but often struggle with complex, long-horizon planning tasks. Recent techniques have sought to structure LLM outputs using control flow and code to improve planning performance. However, code-based approaches can be error-prone and insufficient for handling ambiguous or unstructured data. To address these challenges, we propose REPL-Plan, an LLM planning approach that is fully code-expressive (it can utilize all the benefits of code) while also being dynamic (it can flexibly adapt from errors and use the LLM for soft reasoning). In REPL-Plan, an LLM solves tasks by interacting with a Read-Eval-Print Loop (REPL), which iteratively executes and evaluates code, similar to language shells or interactive code notebooks, allowing the model to flexibly correct errors and handle tasks dynamically. We demonstrate that REPL-Plan achieves strong results across various planning domains compared to previous methods.
pdf
bib
abs
Synergistic Weak-Strong Collaboration by Aligning Preferences
Yizhu Jiao
|
Xuchao Zhang
|
Zhaoyang Wang
|
Yubo Ma
|
Zhun Deng
|
Rujia Wang
|
Chetan Bansal
|
Saravan Rajmohan
|
Jiawei Han
|
Huaxiu Yao
Current Large Language Models excel in general reasoning yet struggle with specialized tasks requiring proprietary or domain-specific knowledge. Fine-tuning large models for every niche application is often infeasible due to black-box constraints and high computational overhead. To address this, we propose a collaborative framework that pairs a specialized weak model with a general strong model. The weak model, tailored to specific domains, produces initial drafts and background information, while the strong model leverages its advanced reasoning to refine these drafts, extending LLMs’ capabilities to critical yet specialized tasks. To optimize this collaboration, we introduce a collaborative feedback to fine-tunes the weak model, which quantifies the influence of the weak model’s contributions in the collaboration procedure and establishes preference pairs to guide preference tuning of the weak model. We validate our framework through experiments on three domains. We find that the collaboration significantly outperforms each model alone by leveraging complementary strengths. Moreover, aligning the weak model with the collaborative preference further enhances overall performance.
pdf
bib
abs
Understanding Silent Data Corruption in LLM Training
Jeffrey Jian Ma
|
Hengzhi Pei
|
Leonard Lausen
|
George Karypis
As the scale of training large language models (LLMs) increases, one emergent failure is silent data corruption (SDC), where hardware produces incorrect computations without explicit failure signals. In this work, we are the first to investigate the impact of real-world SDCs on LLM training by comparing model training between healthy production nodes and unhealthy nodes exhibiting SDCs. With the help from a cloud computing platform, we access the unhealthy nodes that were swept out from production by automated fleet management. Using deterministic execution via XLA compiler and our proposed synchronization mechanisms, we isolate and analyze the impact of SDC errors on these nodes at three levels: at each submodule computation, at a single optimizer step, and at a training period. Our results reveal that the impact of SDCs on computation varies on different unhealthy nodes. Although in most cases the perturbations from SDCs on submodule computation and gradients are relatively small, SDCs can lead models to converge to different optima with different weights and even cause spikes in the training loss. Our analysis sheds light on further understanding and mitigating the impact of SDCs.
pdf
bib
abs
Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback
Guan-Ting Lin
|
Prashanth Gurunath Shivakumar
|
Aditya Gourav
|
Yile Gu
|
Ankur Gandhe
|
Hung-yi Lee
|
Ivan Bulyko
While textless Spoken Language Models (SLMs) have shown potential in end-to-end speech-to-speech modeling, they still lag behind text-based Large Language Models (LLMs) in terms of semantic coherence and relevance. This work introduces the Align-SLM framework, which leverages preference optimization inspired by Reinforcement Learning with Human Feedback (RLHF) to enhance the semantic understanding of SLMs. Our approach generates multiple speech continuations from a given prompt and uses LLM-based semantic metrics to create preference data for Direct Preference Optimization (DPO). We evaluate the framework using ZeroSpeech 2021 benchmarks for lexical and syntactic modeling, the spoken version of the StoryCloze dataset for semantic coherence, and other speech generation metrics, including the GPT4-o score and human evaluation. Experimental results show that our method achieves the state-of-the-art performance of SLMs for most benchmarks, highlighting the importance of preference optimization to improve the semantics of SLMs.
pdf
bib
abs
Can LLMs Help Uncover Insights about LLMs? A Large-Scale, Evolving Literature Analysis of Frontier LLMs
Jungsoo Park
|
Junmo Kang
|
Gabriel Stanovsky
|
Alan Ritter
The surge of LLM studies makes synthesizing their findings challenging. Analysis of experimental results from literature can uncover important trends across studies, but the time-consuming nature of manual data extraction limits its use.Our study presents a semi-automated approach for literature analysis that accelerates data extraction using LLMs.It automatically identifies relevant arXiv papers, extracts experimental results and related attributes, and organizes them into a structured dataset, LLMEvalDB.We then conduct an automated literature analysis of frontier LLMs, reducing the effort of paper surveying and data extraction by more than 93% compared to manual approaches.We validate LLMEvalDB by showing that it reproduces key findings from a recent manual analysis of Chain-of-Thought (CoT) reasoning and also uncovers new insights that go beyond it, showing, for example, that in-context examples benefit coding & multimodal tasks but offer limited gains in math reasoning tasks compared to zero-shot CoT.Our automatically updatable dataset enables continuous tracking of target models by extracting evaluation studies as new data becomes available. Through LLMEvalDB and empirical analysis, we provide insights into LLMs while facilitating ongoing literature analyses of their behavior.
pdf
bib
abs
BIG5-CHAT: Shaping LLM Personalities Through Training on Human-Grounded Data
Wenkai Li
|
Jiarui Liu
|
Andy Liu
|
Xuhui Zhou
|
Mona T. Diab
|
Maarten Sap
In this work, we tackle the challenge of embedding realistic human personality traits into LLMs. Previous approaches have primarily focused on prompt-based methods that describe the behavior associated with the desired personality traits, suffering from realism and validity issues. To address these limitations, we introduce BIG5-CHAT, a large-scale dataset containing 100,000 dialogues designed to ground models in how humans express their personality in text. Leveraging this dataset, we explore Supervised Fine-Tuning and Direct Preference Optimization as training-based methods to align LLMs more naturally with human personality patterns. Our methods outperform prompting on personality assessments such as BFI and IPIP-NEO, with trait correlations more closely matching human data. Furthermore, our experiments reveal that models trained to exhibit higher conscientiousness, higher agreeableness, lower extraversion, and lower neuroticism display better performance on reasoning tasks, aligning with psychological findings on how these traits impact human cognitive performance. To our knowledge, this work is the first comprehensive study to demonstrate how training-based methods can shape LLM personalities through learning from real human behaviors.
pdf
bib
abs
Deep Temporal Reasoning in Video Language Models: A Cross-Linguistic Evaluation of Action Duration and Completion through Perfect Times
Olga Loginova
|
Sofía Ortega Loguinova
Human perception of events is intrinsically tied to distinguishing between completed (perfect and telic) and ongoing (durative) actions, a process mediated by both linguistic structure and visual cues. In this work, we introduce the Perfect Times dataset, a novel, quadrilingual (English, Italian, Russian, and Japanese) multiple-choice question-answering benchmark designed to assess video-language models (VLMs) on temporal reasoning. By pairing everyday activity videos with event completion labels and perfectivity-tailored distractors, our dataset probes whether models truly comprehend temporal dynamics or merely latch onto superficial markers. Experimental results indicate that state-of-the-art models, despite their success on text-based tasks, struggle to mirror human-like temporal and causal reasoning grounded in video. This study underscores the necessity of integrating deep multimodal cues to capture the nuances of action duration and completion within temporal and causal video dynamics, setting a new standard for evaluating and advancing temporal reasoning in VLMs.
pdf
bib
abs
Amplifying Trans and Nonbinary Voices: A Community-Centred Harm Taxonomy for LLMs
Eddie L. Ungless
|
Sunipa Dev
|
Cynthia L. Bennett
|
Rebecca Gulotta
|
Jasmijn Bastings
|
Remi Denton
We explore large language model (LLM) responses that may negatively impact the transgender and nonbinary (TGNB) community and introduce the Transing Transformers Toolkit, T3, which provides resources for identifying such harmful response behaviors. The heart of T3 is a community-centred taxonomy of harms, developed in collaboration with the TGNB community, which we complement with, amongst other guidance, suggested heuristics for evaluation. To develop the taxonomy, we adopted a multi-method approach that included surveys and focus groups with community experts. The contribution highlights the importance of community-centred approaches in mitigating harm, and outlines pathways for LLM developers to improve how their models handle TGNB-related topics.
pdf
bib
abs
Enhancing Human Evaluation in Machine Translation with Comparative Judgement
Yixiao Song
|
Parker Riley
|
Daniel Deutsch
|
Markus Freitag
Human evaluation is crucial for assessing rapidly evolving language models but is influenced by annotator proficiency and task design. This study explores the integration of comparative judgment into human annotation for machine translation (MT) and evaluates three annotation setups—point-wise Multidimensional Quality Metrics (MQM), side-by-side (S×S) MQM, and its simplified version S×S relative ranking (RR). In MQM, annotators mark error spans with categories and severity levels. S×S MQM extends MQM to pairwise error annotation for two translations of the same input, while S×S RR focuses on selecting the better output without labeling errors.Key findings are: (1) the S×S settings achieve higher inter-annotator agreement than MQM; (2) S×S MQM enhances inter-translation error marking consistency compared to MQM by, on average, 38.5% for explicitly compared MT systems and 19.5% for others; (3) all annotation settings return stable system rankings, with S×S RR offering a more efficient alternative to (S×S) MQM; (4) the S×S settings highlight subtle errors overlooked in MQM without altering absolute system evaluations.To spur further research, we will release the triply annotated datasets comprising 377 ZhEn and 104 EnDe annotation examples, each covering 10 systems.
pdf
bib
abs
Infogen: Generating Complex Statistical Infographics from Documents
Akash Ghosh
|
Aparna Garimella
|
Pritika Ramu
|
Sambaran Bandyopadhyay
|
Sriparna Saha
Statistical infographics are powerful tools that simplify complex data into visually engaging and easy-to-understand formats. Despite advancements in AI, particularly with LLMs, existing efforts have been limited to generating simple charts, with no prior work addressing the creation of complex infographics from text-heavy documents that demand a deep understanding of the content. We address this gap by introducing the task of generating statistical infographics composed of multiple sub-charts (e.g., line, bar, pie) that are contextually accurate, insightful, and visually aligned. To achieve this, we define infographic metadata, that includes its title and textual insights, along with sub-chart-specific details such as their corresponding data, alignment, etc. We also present Infodat, the first benchmark dataset for text-to-infographic metadata generation, where each sample links a document to its metadata. We propose Infogen, a two-stage framework where fine-tuned LLMs first generate metadata, which is then converted into infographic code. Extensive evaluations on Infodat demonstrate that Infogen achieves state-of-the-art performance, outperforming both closed and open-source LLMs in text-to-statistical infographic generation.
pdf
bib
abs
Partial Colexifications Improve Concept Embeddings
Arne Rubehn
|
Johann-Mattis List
While the embedding of words has revolutionized the field of Natural Language Processing, the embedding of concepts has received much less attention so far. A dense and meaningful representation of concepts, however, could prove useful for several tasks in computational linguistics, especially those involving cross-linguistic data or sparse data from low resource languages. First methods that have been proposed so far embed concepts from automatically constructed colexification networks. While these approaches depart from automatically inferred polysemies, attested across a larger number of languages, they are restricted to the word level, ignoring lexical relations that would only hold for parts of the words in a given language. Building on recently introduced methods for the inference of partial colexifications, we show how they can be used to improve concept embeddings in meaningful ways. The learned embeddings are evaluated against lexical similarity ratings, recorded instances of semantic shift, and word association data. We show that in all evaluation tasks, the inclusion of partial colexifications lead to improved concept representations and better results. Our results further show that the learned embeddings are able to capture and represent different semantic relationships between concepts.
pdf
bib
abs
Improved Unbiased Watermark for Large Language Models
Ruibo Chen
|
Yihan Wu
|
Junfeng Guo
|
Heng Huang
As artificial intelligence surpasses human capabilities in text generation, the necessity to authenticate the origins of AI-generated content has become paramount. Unbiased watermarks offer a powerful solution by embedding statistical signals into language model-generated text without distorting the quality. In this paper, we introduce MCmark, a family of unbiased, Multi-Channel-based watermarks. MCmark works by partitioning the model’s vocabulary into segments and promoting token probabilities within a selected segment based on a watermark key. We demonstrate that MCmark not only preserves the original distribution of the language model but also offers significant improvements in detectability and robustness over existing unbiased watermarks. Our experiments with widely-used language models demonstrate an improvement in detectability of over 10% using MCmark, compared to existing state-of-the-art unbiased watermarks. This advancement underscores MCmark’s potential in enhancing the practical application of watermarking in AI-generated texts.
pdf
bib
abs
MaCP: Minimal yet Mighty Adaptation via Hierarchical Cosine Projection
Yixian Shen
|
Qi Bi
|
Jia-hong Huang
|
Hongyi Zhu
|
Andy D. Pimentel
|
Anuj Pathania
We present a new adaptation method MaCP, Minimal yet Mighty adaptive Cosine Projection, that achieves exceptional performance while requiring minimal parameters and memory for fine-tuning large foundation models.Its general idea is to exploit the superior energy compaction and decorrelation properties of cosine projection to improve both model efficiency and accuracy.Specifically, it projects the weight change from the low-rank adaptation into the discrete cosine space.Then, the weight change is partitioned over different levels of the discrete cosine spectrum, and each partition’s most critical frequency components are selected.Extensive experiments demonstrate the effectiveness of MaCP across a wide range of single-modality tasks, including natural language understanding, natural language generation, text summarization, as well as multi-modality tasks such as image classification and video understanding. MaCP consistently delivers superior accuracy, significantly reduced computational complexity, and lower memory requirements compared to existing alternatives.
pdf
bib
abs
Multi-Attribute Steering of Language Models via Targeted Intervention
Duy Nguyen
|
Archiki Prasad
|
Elias Stengel-Eskin
|
Mohit Bansal
Inference-time intervention (ITI) has emerged as a promising method for steering large language model (LLM) behavior in a particular direction (e.g., improving helpfulness) by intervening on token representations without costly updates to the LLM’s parameters. However, existing ITI approaches fail to scale to multi-attribute settings with conflicts, such as enhancing helpfulness while also reducing toxicity. To address this, we introduce Multi-Attribute Targeted Steering (MAT-Steer), a novel steering framework designed for selective token-level intervention across multiple attributes. We achieve this by learning steering vectors using an alignment objective that shifts the model’s internal representations of undesirable outputs closer to those of desirable ones while enforcing sparsity and orthogonality among vectors for different attributes, thereby reducing inter-attribute conflicts. We evaluate MAT-Steer in two distinct settings: (i) on question answering (QA) tasks where we balance attributes like truthfulness, bias, and toxicity; (ii) on generative tasks where we simultaneously improve attributes like helpfulness, correctness, and coherence. MAT-Steer outperforms existing ITI and parameter-efficient fine-tuning approaches across both task types (e.g., average 3% accuracy gain across QA tasks and 55.82% win rate against the best ITI baseline).
pdf
bib
abs
AdaptAgent: Adapting Multimodal Web Agents with Few-Shot Learning from Human Demonstrations
Gaurav Verma
|
Rachneet Kaur
|
Nishan Srishankar
|
Zhen Zeng
|
Tucker Balch
|
Manuela Veloso
State-of-the-art multimodal web agents, powered by Multimodal Large Language Models (MLLMs), can autonomously execute many web tasks by processing user instructions and interacting with graphical user interfaces (GUIs). Current strategies for building web agents rely on (i) the generalizability of underlying MLLMs and their steerability via prompting, and (ii) large-scale fine-tuning of MLLMs on web-related tasks. However, web agents still struggle to automate tasks on unseen websites and domains, limiting their applicability to enterprise-specific and proprietary platforms. Beyond generalization from large-scale pre-training and fine-tuning, we propose building agents for few-shot adaptability using human demonstrations. We introduce the AdaptAgent framework that enables both proprietary and open-weights multimodal web agents to adapt to new websites and domains using few human demonstrations (up to 2). Our experiments on two popular benchmarks — Mind2Web & VisualWebArena — show that using in-context demonstrations (for proprietary models) or meta-adaptation demonstrations (for meta-learned open-weights models) boosts task success rate by 3.36% to 7.21% over non-adapted state-of-the-art models, corresponding to a relative increase of 21.03% to 65.75%. Furthermore, our additional analyses (a) show the effectiveness of multimodal demonstrations over text-only ones, (b) illuminate how different meta-learning data selection strategies influence the agent’s generalization, and (c) demonstrate how the number of few-shot examples affects the web agent’s success rate. Our results offer a complementary axis for developing widely applicable multimodal web agents beyond large-scale pre-training and fine-tuning, emphasizing few-shot adaptability.
pdf
bib
abs
Can LLMs Identify Critical Limitations within Scientific Research? A Systematic Evaluation on AI Research Papers
Zhijian Xu
|
Yilun Zhao
|
Manasi Patwardhan
|
Lovekesh Vig
|
Arman Cohan
Peer review is fundamental to scientific research, but the growing volume of publications has intensified the challenges of this expertise-intensive process. While LLMs show promise in various scientific tasks, their potential to assist with peer review, particularly in identifying paper limitations, remains understudied. We first present a comprehensive taxonomy of limitation types in scientific research, with a focus on AI. Guided by this taxonomy, for studying limitations, we present LimitGen, the first comprehensive benchmark for evaluating LLMs’ capability to support early-stage feedback and complement human peer review. Our benchmark consists of two subsets: LimitGen-Syn, a synthetic dataset carefully created through controlled perturbations of high-quality papers, and LimitGen-Human, a collection of real human-written limitations. To improve the ability of LLM systems to identify limitations, we augment them with literature retrieval, which is essential for grounding identifying limitations in prior scientific findings. Our approach enhances the capabilities of LLM systems to generate limitations in research papers, enabling them to provide more concrete and constructive feedback.
pdf
bib
abs
On the Acquisition of Shared Grammatical Representations in Bilingual Language Models
Catherine Arnett
|
Tyler A. Chang
|
James A. Michaelov
|
Ben Bergen
Crosslingual transfer is crucial to contemporary language models’ multilingual capabilities, but how it occurs is not well understood. Weask what happens to a monolingual language model when it begins to be trained on a second language. Specifically, we train small bilingual models for which we control the amount of data for each language and the order of language exposure. To find evidence of shared multilingual representations, we turn to structural priming, a method used to study grammatical representations in humans. We first replicate previous crosslingual structural priming results and find that after controlling for training data quantity and language exposure, there are asymmetrical effects across language pairs and directions. We argue that this asymmetry may shape hypotheses about human structural priming effects. We also find that structural priming effects are less robust for less similar language pairs, highlighting potential limitations of crosslingual transfer learning and shared representations for typologically diverse languages.
pdf
bib
abs
Using Shapley interactions to understand how models use structure
Divyansh Singhvi
|
Diganta Misra
|
Andrej Erkelens
|
Raghav Jain
|
Isabel Papadimitriou
|
Naomi Saphra
Language is an intricately structured system, and a key goal of NLP interpretability is to provide methodological insights for understanding how language models internally represent this structure. In this paper, we use Shapley Taylor interaction indices (STII) in order to examine how language and speech models internally relate and structure their inputs. Pairwise Shapley interactions give us an attribution measure of how much two inputs work together to influence model outputs beyond if we linearly added their independent influences, providing a view into how models encode structural interactions between inputs. We relate the interaction patterns in models to three underlying linguistic structures: syntactic structure, non-compositional semantics, and phonetic interaction. We find that autoregressive text models encode interactions that correlate with the syntactic proximity of inputs, and that both autoregressive and masked models encode nonlinear interactions in idiomatic phrases with non-compositional semantics. Our speech results show that inputs are more entangled for pairs where a neighboring consonant is likely to influence a vowel or approximant, showing that models encode the phonetic interaction needed for extracting discrete phonemic representations.
pdf
bib
abs
Adversarial Tokenization
Renato Geh
|
Zilei Shao
|
Guy Van Den Broeck
Current LLM pipelines account for only one possible tokenization for a given string, ignoring exponentially many alternative tokenizations during training and inference. For example, the Llama3 standard tokenization of penguin is [p,enguin], yet [peng,uin] is another perfectly valid alternative. In this paper, we show that despite LLMs being trained solely on one tokenization, they still retain semantic understanding of other tokenizations, raising questions about their implications in LLM safety. Put succinctly, we answer the following question: can we adversarially tokenize an obviously malicious string to evade safety and alignment restrictions? We show that not only is adversarial tokenization an effective yet previously neglected axis of attack, but it is also competitive against existing state-of-the-art adversarial approaches without changing the text of the harmful request. We empirically validate this exploit across three state-of-the-art LLMs and adversarial datasets, revealing a previously unknown vulnerability in subword models.
pdf
bib
abs
Classifying Unreliable Narrators with Large Language Models
Anneliese Brei
|
Katharine Henry
|
Abhisheik Sharma
|
Shashank Srivastava
|
Snigdha Chaturvedi
Often when we interact with a first-person account of events, we consider whether or not the narrator, the primary speaker of the text, is reliable. In this paper, we propose using computational methods to identify unreliable narrators, i.e. those who unintentionally misrepresent information. Borrowing literary theory from narratology to define different types of unreliable narrators based on a variety of textual phenomena, we present TUNa, a human-annotated dataset of narratives from multiple domains, including blog posts, subreddit posts, hotel reviews, and works of literature. We define classification tasks for intra-narrational, inter-narrational, and inter-textual unreliabilities and analyze the performance of popular open-weight and proprietary LLMs for each. We propose learning from literature to perform unreliable narrator classification on real-world text data. To this end, we experiment with few-shot, fine-tuning, and curriculum learning settings. Our results show that this task is very challenging, and there is potential for using LLMs to identify unreliable narrators. We release our expert-annotated dataset and code at https://github.com/adbrei/unreliable-narrators and invite future research in this area.
pdf
bib
abs
ConceptCarve: Dynamic Realization of Evidence
Eylon Caplan
|
Dan Goldwasser
Finding evidence for human opinion and behavior at scale is a challenging task, often requiring an understanding of sophisticated thought patterns among vast online communities found on social media. For example, studying how ‘gun ownership’ is related to the perception of ‘Freedom’, requires a retrieval system that can operate at scale over social media posts, while dealing with two key challenges: (1) identifying abstract concept instances, (2) which can be instantiated differently across different communities. To address these, we introduce ConceptCarve, an evidence retrieval framework that utilizes traditional retrievers and LLMs to dynamically characterize the search space during retrieval. Our experiments show that ConceptCarve surpasses traditional retrieval systems in finding evidence within a social media community. It also produces an interpretable representation of the evidence for that community, which we use to qualitatively analyze complex thought patterns that manifest differently across the communities.
pdf
bib
abs
QQSUM: A Novel Task and Model of Quantitative Query-Focused Summarization for Review-based Product Question Answering
An Quang Tang
|
Xiuzhen Zhang
|
Minh Ngoc Dinh
|
Zhuang Li
Review-based Product Question Answering (PQA) allows e-commerce platforms to automatically address customer queries by leveraging insights from user reviews. However, existing PQA systems generate answers with only a single perspective, failing to capture the diversity of customer opinions. In this paper we introduce a novel task Quantitative Query-Focused Summarization (QQSUM), which aims to summarize diverse customer opinions into representative Key Points (KPs) and quantify their prevalence to effectively answer user queries. While Retrieval-Augmented Generation (RAG) shows promise for PQA, its generated answers still fall short of capturing the full diversity of viewpoints. To tackle this challenge, our model QQSUM-RAG, which extends RAG, employs few-shot learning to jointly train a KP-oriented retriever and a KP summary generator, enabling KP-based summaries that capture diverse and representative opinions. Experimental results demonstrate that QQSUM-RAG achieves superior performance compared to state-of-the-art RAG baselines in both textual quality and quantification accuracy of opinions. Our source code is available at: https://github.com/antangrocket1312/QQSUMM
pdf
bib
abs
Navigating Rifts in Human-LLM Grounding: Study and Benchmark
Omar Shaikh
|
Hussein Mozannar
|
Gagan Bansal
|
Adam Fourney
|
Eric Horvitz
Language models excel at following instructions but often struggle with the collaborative aspects of conversation that humans naturally employ. This limitation in grounding—the process by which conversation participants establish mutual understanding—can lead to outcomes ranging from frustrated users to serious consequences in high-stakes scenarios. To systematically study grounding challenges in human-LLM interactions, we analyze logs from three human-assistant datasets: WildChat, MultiWOZ, and Bing Chat. We develop a taxonomy of grounding acts and build models to annotate and forecast grounding behavior. Our findings reveal significant differences in human-human and human-LLM grounding: LLMs were three times less likely to initiate clarification and sixteen times less likely to provide follow-up requests than humans. Additionally, we find that early grounding failures predict later interaction breakdowns. Building on these insights, we introduce Rifts, a benchmark derived from publicly available LLM interaction data containing situations where LLMs fail to initiate grounding. We note that current frontier models perform poorly on Rifts, highlighting the need to reconsider how we train and prompt LLMs for human interaction. To this end, we develop a preliminary intervention aimed at mitigating grounding failures.
pdf
bib
abs
Substance over Style: Evaluating Proactive Conversational Coaching Agents
Vidya Srinivas
|
Xuhai Xu
|
Xin Liu
|
Kumar Ayush
|
Isaac Galatzer-Levy
|
Shwetak Patel
|
Daniel McDuff
|
Tim Althoff
While NLP research has made strides in conversational tasks, many approaches focus on single-turn responses with well-defined objectives or evaluation criteria. In contrast, coaching presents unique challenges with initially undefined goals that evolve through multi-turn interactions, subjective evaluation criteria, mixed-initiative dialogue. In this work, we describe and implement five multi-turn coaching agents that exhibit distinct conversational styles, and evaluate them through a user study, collecting first-person feedback on 155 conversations. We find that users highly value core functionality, and that stylistic components in absence of core components are viewed negatively. By comparing user feedback with third-person evaluations from health experts and an LM, we reveal significant misalignment across evaluation approaches. Our findings provide insights into design and evaluation of conversational coaching agents and contribute toward improving human-centered NLP applications.
pdf
bib
abs
Open-World Planning via Lifted Regression with LLM-Inferred Affordances for Embodied Agents
Xiaotian Liu
|
Ali Pesaranghader
|
Hanze Li
|
Punyaphat Sukcharoenchaikul
|
Jaehong Kim
|
Tanmana Sadhu
|
Hyejeong Jeon
|
Scott Sanner
Open-world planning with incomplete knowledge is crucial for real-world embodied AI tasks. Despite that, existing LLM-based planners struggle with long chains of sequential reasoning, while symbolic planners face combinatorial explosion of states and actions for complex domains due to reliance on grounding. To address these deficiencies, we introduce LLM-Regress, an open-world planning approach integrating lifted regression with LLM-generated affordances. LLM-Regress generates sound and complete plans in a compact lifted form, avoiding exhaustive enumeration of irrelevant states and actions. Additionally, it makes efficient use of LLMs to infer goal-related objects and affordances without the need to predefine all possible objects and affordances. We conduct extensive experiments on three benchmarks and show that LLM-Regress significantly outperforms state-of-the-art LLM planners and a grounded planner using LLM-generated affordances.
pdf
bib
abs
(RSA)²: A Rhetorical-Strategy-Aware Rational Speech Act Framework for Figurative Language Understanding
Cesare Spinoso-Di Piano
|
David Eric Austin
|
Pablo Piantanida
|
Jackie CK Cheung
Figurative language (e.g., irony, hyperbole, understatement) is ubiquitous in human communication, resulting in utterances where the literal and the intended meanings do not match. The Rational Speech Act (RSA) framework, which explicitly models speaker intentions, is the most widespread theory of probabilistic pragmatics, but existing implementations are either unable to account for figurative expressions or require modeling the implicit motivations for using figurative language (e.g., to express joy or annoyance) in a setting-specific way. In this paper, we introduce the Rhetorical-Strategy-Aware RSA (RSA)² framework which models figurative language use by considering a speaker’s employed rhetorical strategy. We show that (RSA)² enables human-compatible interpretations of non-literal utterances without modeling a speaker’s motivations for being non-literal. Combined with LLMs, it achieves state-of-the-art performance on the ironic split of PragMega+, a new irony interpretation dataset introduced in this study.
pdf
bib
abs
SYNTHIA: Novel Concept Design with Affordance Composition
Hyeonjeong Ha
|
Xiaomeng Jin
|
Jeonghwan Kim
|
Jiateng Liu
|
Zhenhailong Wang
|
Khanh Duy Nguyen
|
Ansel Blume
|
Nanyun Peng
|
Kai-Wei Chang
|
Heng Ji
Text-to-image (T2I) models enable rapid concept design, making them widely used in AI-driven design. While recent studies focus on generating semantic and stylistic variations of given design concepts, –the integration of multiple affordances into a single coherent concept–remains largely overlooked. In this paper, we introduce SYNTHIA, a framework for generating novel, functionally coherent designs based on desired affordances. Our approach leverages a hierarchical concept ontology that decomposes concepts into parts and affordances, serving as a crucial building block for functionally coherent design. We also develop a curriculum learning scheme based on our ontology that contrastively fine-tunes T2I models to progressively learn affordance composition while maintaining visual novelty. To elaborate, we (i) gradually increase affordance distance, guiding models from basic concept-affordance association to complex affordance compositions that integrate parts of distinct affordances into a single, coherent form, and (ii) enforce visual novelty by employing contrastive objectives to push learned representations away from existing concepts. Experimental results show that SYNTHIA outperforms state-of-the-art T2I models, demonstrating absolute gains of 25.1% and 14.7% for novelty and functional coherence in human evaluation, respectively.
pdf
bib
abs
Consistent Client Simulation for Motivational Interviewing-based Counseling
Yizhe Yang
|
Palakorn Achananuparp
|
Heyan Huang
|
Jing Jiang
|
Nicholas Gabriel Lim
|
Cameron Tan Shi Ern
|
Phey Ling Kit
|
Jenny Giam Xiuhui
|
John Pinto
|
Ee-Peng Lim
Simulating human clients in mental health counseling is crucial for training and evaluating counselors (both human or simulated) in a scalable manner. Nevertheless, past research on client simulation did not focus on complex conversation tasks such as mental health counseling. In these tasks, the challenge is to ensure that the client’s actions (i.e., interactions with the counselor) are consistent with with its stipulated profiles and negative behavior settings. In this paper, we propose a novel framework that supports consistent client simulation for mental health counseling. Our framework tracks the mental state of a simulated client, controls its state transitions, and generates for each state behaviors consistent with the client’s motivation, beliefs, preferred plan to change, and receptivity. By varying the client profile and receptivity, we demonstrate that consistent simulated clients for different counseling scenarios can be effectively created. Both our automatic and expert evaluations on the generated counseling sessions also show that our client simulation method achieves higher consistency than previous methods.
pdf
bib
abs
AUTALIC: A Dataset for Anti-AUTistic Ableist Language In Context
Naba Rizvi
|
Harper Strickland
|
Daniel Gitelman
|
Alexis Morales Flores
|
Tristan Cooper
|
Aekta Kallepalli
|
Akshat Alurkar
|
Haaset Owens
|
Saleha Ahmedi
|
Isha Khirwadkar
|
Imani N. S. Munyaka
|
Nedjma Ousidhoum
As our awareness of autism and ableism continues to increase, so does our understanding of ableist language towards autistic people. Such language poses a significant challenge in NLP research due to its subtle and context-dependent nature. Yet, detecting anti-autistic ableist language remains underexplored, with existing NLP tools often failing to capture its nuanced expressions. We present AUTALIC, the first dataset dedicated to the detection of anti-autistic ableist language in context, addressing a significant gap in the field. AUTALIC comprises 2,400 autism-related sentences collected from Reddit, accompanied by surrounding context, and annotated by trained experts with backgrounds in neurodiversity. Our comprehensive evaluation reveals that current language models, including state-of-the-art LLMs, struggle to reliably identify anti-autistic ableism and diverge from human judgments, underscoring their limitations in this domain. We publicly release our dataset along with the individual annotations, providing an essential resource for developing more inclusive and context-aware NLP systems that better reflect diverse perspectives.
pdf
bib
abs
Structural Reasoning Improves Molecular Understanding of LLM
Yunhui Jang
|
Jaehyung Kim
|
Sungsoo Ahn
Recently, large language models (LLMs) have shown significant progress, approaching human perception levels. In this work, we demonstrate that despite these advances, LLMs still struggle to reason using molecular structural information. This gap is critical because many molecular properties, including functional groups, depend heavily on such structural details. To address this limitation, we propose an approach that sketches molecular structures for reasoning. Specifically, we introduce Molecular Structural Reasoning (MSR) framework to enhance the understanding of LLMs by explicitly incorporating the key structural features. We present two frameworks for scenarios where the target molecule is known or unknown. We verify that our MSR improves molecular understanding through extensive experiments.
pdf
bib
abs
CAMI: A Counselor Agent Supporting Motivational Interviewing through State Inference and Topic Exploration
Yizhe Yang
|
Palakorn Achananuparp
|
Heyan Huang
|
Jing Jiang
|
Phey Ling Kit
|
Nicholas Gabriel Lim
|
Cameron Tan Shi Ern
|
Ee-Peng Lim
Conversational counselor agents have become essential tools for addressing the rising demand for scalable and accessible mental health support. This paper introduces CAMI, a novel automated counselor agent grounded in Motivational Interviewing (MI) – a client-centered counseling approach designed to address ambivalence and facilitate behavior change. CAMI employs a novel STAR framework, consisting of client’s state inference, motivation topic exploration, and response generation modules, leveraging large language models (LLMs). These components work together to evoke change talk, aligning with MI principles and improving counseling outcomes for diverse clients. We evaluate CAMI’s performance through both automated and expert evaluations, utilizing simulated clients to assess MI skill competency, client’s state inference accuracy, topic exploration proficiency, and overall counseling success. Results show that CAMI not only outperforms several state-of-the-art methods but also shows more realistic counselor-like behavior. Additionally, our ablation study underscores the critical roles of state inference and topic exploration in achieving this performance.
pdf
bib
abs
Know You First and Be You Better: Modeling Human-Like User Simulators via Implicit Profiles
Kuang Wang
|
Xianfei Li
|
Shenghao Yang
|
Li Zhou
|
Feng Jiang
|
Haizhou Li
User simulators are crucial for replicating human interactions with dialogue systems, supporting both collaborative training and automatic evaluation, especially for large language models (LLMs). However, current role-playing methods face challenges such as a lack of utterance-level authenticity and user-level diversity, often hindered by role confusion and dependence on predefined profiles of well-known figures. In contrast, direct simulation focuses solely on text, neglecting implicit user traits like personality and conversation-level consistency. To address these issues, we introduce the User Simulator with Implicit Profiles (USP), a framework that infers implicit user profiles from human-machine interactions to simulate personalized and realistic dialogues. We first develop an LLM-driven extractor with a comprehensive profile schema, then refine the simulation using conditional supervised fine-tuning and reinforcement learning with cycle consistency, optimizing at both the utterance and conversation levels. Finally, a diverse profile sampler captures the distribution of real-world user profiles. Experimental results show that USP outperforms strong baselines in terms of authenticity and diversity while maintaining comparable consistency. Additionally, using USP to evaluate LLM on dynamic multi-turn aligns well with mainstream benchmarks, demonstrating its effectiveness in real-world applications.
pdf
bib
abs
Targeted Syntactic Evaluation for Grammatical Error Correction
Aomi Koyama
|
Masato Mita
|
Su-Youn Yoon
|
Yasufumi Takama
|
Mamoru Komachi
Language learners encounter a wide range of grammar items across the beginner, intermediate, and advanced levels.To develop grammatical error correction (GEC) models effectively, it is crucial to identify which grammar items are easier or more challenging for models to correct. However, conventional benchmarks based on learner-produced texts are insufficient for conducting detailed evaluations of GEC model performance across a wide range of grammar items due to biases in their distribution.To address this issue, we propose a new evaluation paradigm that assesses GEC models using minimal pairs of ungrammatical and grammatical sentences for each grammar item. As the first benchmark within this paradigm, we introduce the CEFR-based Targeted Syntactic Evaluation Dataset for Grammatical Error Correction (CTSEG), which complements existing English benchmarks by enabling fine-grained analyses previously unattainable with conventional datasets. Using CTSEG, we evaluate three mainstream types of English GEC models: sequence-to-sequence models, sequence tagging models, and prompt-based models. The results indicate that while current models perform well on beginner-level grammar items, their performance deteriorates substantially for intermediate and advanced items.
pdf
bib
abs
VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos
Tingyu Song
|
Tongyan Hu
|
Guo Gan
|
Yilun Zhao
Recently, multimodal large language models (MLLMs) have been extensively explored in video question answering. However, most existing assessments focus on natural videos, overlooking synthetic videos (e.g., AI-generated content). Meanwhile, some works in video generation rely on MLLMs to evaluate the quality of generated videos, but the capabilities of MLLMs on AIGC videos remain largely underexplored. To address this, we propose a new benchmark, VQ-Eval, which introduces four tasks—coherence validation, error awareness, error type detection, and reasoning evaluation—to comprehensively evaluate the abilities of MLLMs on AIGC videos. We evaluate 13 frontier MLLMs on VQ-Eval and find that even the best-performing model, GPT-4.1, struggles to achieve consistently good performance across all tasks. This highlights the challenging nature of our benchmark. Additionally, to investigate the practical applications of VQ-Eval in improving video generation, we design a re-prompt pipeline, demonstrating that aligning MLLMs more closely with human feedback can benefit the video generation.
pdf
bib
abs
Language Model Fine-Tuning on Scaled Survey Data for Predicting Distributions of Public Opinions
Joseph Suh
|
Erfan Jahanparast
|
Suhong Moon
|
Minwoo Kang
|
Serina Chang
Large language models (LLMs) present novel opportunities in public opinion research by predicting survey responses in advance during the early stages of survey design. Prior methods steer LLMs via descriptions of subpopulations as LLMs’ input prompt, yet such prompt engineering approaches have struggled to faithfully predict the distribution of survey responses from human subjects. In this work, we propose directly fine-tuning LLMs to predict response distributions by leveraging unique structural characteristics of survey data. To enable fine-tuning, we curate SubPOP, a significantly scaled dataset of 3,362 questions and 70K subpopulation-response pairs from well-established public opinion surveys. We show that fine-tuning on SubPOP greatly improves the match between LLM predictions and human responses across various subpopulations, reducing the LLM-human gap by up to 46% compared to baselines, and achieves strong generalization to unseen surveys and subpopulations. Our findings highlight the potential of survey-based fine-tuning to improve opinion prediction for diverse, real-world subpopulations and therefore enable more efficient survey designs.
pdf
bib
abs
TESS 2: A Large-Scale Generalist Diffusion Language Model
Jaesung Tae
|
Hamish Ivison
|
Sachin Kumar
|
Arman Cohan
We introduce TESS 2, a general instruction-following diffusion language model that outperforms contemporary instruction-tuned diffusion models, as well as matches and sometimes exceeds strong autoregressive (AR) models. We train TESS 2 by first adapting a strong AR model via continued pretraining with a diffusion loss and then performing further instruction tuning. We find that adaptation training as well as the choice of the base model is crucial for training good instruction-following diffusion models. We further propose reward guidance, a novel and modular inference-time guidance procedure to align model outputs without needing to train the underlying model. Finally, we show that TESS 2 further improves with increased inference-time compute, highlighting the utility of diffusion LMs in having fine-grained controllability over the amount of compute used at inference time.
pdf
bib
abs
KatFishNet: Detecting LLM-Generated Korean Text through Linguistic Feature Analysis
Shinwoo Park
|
Shubin Kim
|
Do-Kyung Kim
|
Yo-Sub Han
The rapid advancement of large language models (LLMs) increases the difficulty of distinguishing between human-written and LLM-generated text. Detecting LLM-generated text is crucial for upholding academic integrity, preventing plagiarism, protecting copyrights, and ensuring ethical research practices. Most prior studies on detecting LLM-generated text focus primarily on English text. However, languages with distinct morphological and syntactic characteristics require specialized detection approaches. Their unique structures and usage patterns hinder the direct application of methods primarily designed for English. Among such languages, we focus on Korean, which has relatively flexible spacing rules, a rich morphological system, and less frequent comma usage compared to English. We introduce KatFish, the first benchmark dataset for detecting LLM-generated Korean text. The dataset consists of text written by humans and generated by four LLMs across three genres. By examining spacing patterns, part-of-speech diversity, and comma usage, we illuminate the linguistic differences between human-written and LLM-generated Korean text. Building on these observations, we propose KatFishNet, a detection method specifically designed for the Korean language. KatFishNet achieves an average of 19.78% higher AUC-ROC compared to the best-performing existing detection method. Our code and data are available at https://github.com/Shinwoo-Park/katfishnet.
pdf
bib
abs
Uncovering the Impact of Chain-of-Thought Reasoning for Direct Preference Optimization: Lessons from Text-to-SQL
Hanbing Liu
|
Haoyang Li
|
Xiaokang Zhang
|
Ruotong Chen
|
Haiyong Xu
|
Tian Tian
|
Qi Qi
|
Jing Zhang
Direct Preference Optimization (DPO) has proven effective in complex reasoning tasks like math word problems and code generation. However, when applied to Text-to-SQL datasets, it often fails to improve performance and can even degrade it. Our investigation reveals the root cause: unlike math and code tasks, which naturally integrate Chain-of-Thought (CoT) reasoning with DPO, Text-to-SQL datasets typically include only final answers (gold SQL queries) without detailed CoT solutions. By augmenting Text-to-SQL datasets with synthetic CoT solutions, we achieve, for the first time, consistent and significant performance improvements using DPO.Our analysis shows that CoT reasoning is crucial for unlocking DPO’s potential, as it mitigates reward hacking, strengthens discriminative capabilities, and improves scalability. These findings offer valuable insights for building more robust Text-to-SQL models. To support further research, we publicly release the code and CoT-enhanced datasets: https://github.com/RUCKBReasoning/DPO_Text2SQL.
pdf
bib
abs
On Generalization across Measurement Systems: LLMs Entail More Test-Time Compute for Underrepresented Cultures
Minh Duc Bui
|
Kyung Eun Park
|
Goran Glavaš
|
Fabian David Schmidt
|
Katharina Von Der Wense
Measurement systems (e.g., currencies) differ across cultures, but the conversions between them are well defined so that humans can state using any measurement system of their choice. Being available to users from diverse cultural backgrounds, Large Language Models (LLMs) should also be able to provide accurate information irrespective of the measurement system at hand. Using newly compiled datasets we test if this is truly the case for seven open-source LLMs, addressing three key research questions: (RQ1) What is the default system used by LLMs for each type of measurement? (RQ2) Do LLMs’ answers and their accuracy vary across different measurement systems? (RQ3) Can LLMs mitigate potential challenges w.r.t. underrepresented systems via reasoning? Our findings show that LLMs default to the measurement system predominantly used in the data. Additionally, we observe considerable instability and variance in performance across different measurement systems. While this instability can in part be mitigated by employing reasoning methods such as chain-of-thought (CoT), this implies longer responses and thereby significantly increases test-time compute (and inference costs), marginalizing users from cultural backgrounds that use underrepresented measurement systems.
pdf
bib
abs
CORDIAL: Can Multimodal Large Language Models Effectively Understand Coherence Relationships?
Aashish Anantha Ramakrishnan
|
Aadarsh Anantha Ramakrishnan
|
Dongwon Lee
Multimodal Large Language Models (MLLMs) are renowned for their superior instruction-following and reasoning capabilities across diverse problem domains. However, existing benchmarks primarily focus on assessing factual and logical correctness in downstream tasks, with limited emphasis on evaluating MLLMs’ ability to interpret pragmatic cues and intermodal relationships. To address this gap, we assess the competency of MLLMs in performing Multimodal Discourse Analysis (MDA) using Coherence Relations. Our benchmark, CORDIAL, encompasses a broad spectrum of Coherence Relations across 3 different discourse domains at varying levels of granularity. Through our experiments on 10+ MLLMs employing different prompting strategies, we show that even top models like Gemini 1.5 Pro and GPT-4o fail to match the performance of simple classifier-based baselines. This study emphasizes the need to move beyond similarity-based metrics and adopt a discourse-driven framework for evaluating MLLMs, providing a more nuanced assessment of their capabilities. The benchmark and code are available at: https://aashish2000.github.io/CORDIAL/.
pdf
bib
abs
Veracity Bias and Beyond: Uncovering LLMs’ Hidden Beliefs in Problem-Solving Reasoning
Yue Zhou
|
Barbara Di Eugenio
Despite LLMs’ explicit alignment against demographic stereotypes, they have been shown to exhibit biases under various social contexts. In this work, we find that LLMs exhibit concerning biases in how they associate solution veracity with demographics. Through experiments across five human value-aligned LLMs on mathematics, coding, commonsense, and writing problems, we reveal two forms of such veracity biases: Attribution Bias, where models disproportionately attribute correct solutions to certain demographic groups, and Evaluation Bias, where models’ assessment of identical solutions varies based on perceived demographic authorship. Our results show pervasive biases: LLMs consistently attribute fewer correct solutions and more incorrect ones to African-American groups in math and coding, while Asian authorships are least preferred in writing evaluation. In additional studies, we show LLMs automatically assign racially stereotypical colors to demographic groups in visualization code, suggesting these biases are deeply embedded in models’ reasoning processes. Our findings indicate that demographic bias extends beyond surface-level stereotypes and social context provocations, raising concerns about LLMs’ deployment in educational and evaluation settings.
pdf
bib
abs
Optimal Transport-Based Token Weighting scheme for Enhanced Preference Optimization
Meng Li
|
Guangda Huzhang
|
Haibo Zhang
|
Xiting Wang
|
Anxiang Zeng
Direct Preference Optimization (DPO) has emerged as a promising framework for aligning Large Language Models (LLMs) with human preferences by directly optimizing the log-likelihood difference between chosen and rejected responses. However, existing methods assign equal importance to all tokens in the response, while humans focus on more meaningful parts. This leads to suboptimal preference optimization, as irrelevant or noisy tokens disproportionately influence DPO loss. To address this limitation, we propose Optimal Transport-based token weighting scheme for enhancing direct Preference Optimization (OTPO). By emphasizing semantically meaningful token pairs and de-emphasizing less relevant ones, our method introduces a context-aware token weighting scheme that yields a more contrastive reward difference estimate. This adaptive weighting enhances reward stability, improves interpretability, and ensures that preference optimization focuses on meaningful differences between responses. Extensive experiments have validated OTPO’s effectiveness in improving instruction-following ability across various settings.
pdf
bib
abs
LLM Meets Scene Graph: Can Large Language Models Understand and Generate Scene Graphs? A Benchmark and Empirical Study
Dongil Yang
|
Minjin Kim
|
Sunghwan Kim
|
Beong-woo Kwak
|
Minjun Park
|
Jinseok Hong
|
Woontack Woo
|
Jinyoung Yeo
The remarkable reasoning and generalization capabilities of Large Language Models (LLMs) have paved the way for their expanding applications in embodied AI, robotics, and other real-world tasks. To effectively support these applications, grounding in spatial and temporal understanding in multimodal environments is essential. To this end, recent works have leveraged scene graphs, a structured representation that encodes entities, attributes, and their relationships in a scene. However, a comprehensive evaluation of LLMs’ ability to utilize scene graphs remains limited. In this work, we introduce Text-Scene Graph (TSG) Bench, a benchmark designed to systematically assess LLMs’ ability to (1) understand scene graphs and (2) generate them from textual narratives. With TSG Bench we evaluate 11 LLMs and reveal that, while models perform well on scene graph understanding, they struggle with scene graph generation, particularly for complex narratives. Our analysis indicates that these models fail to effectively decompose discrete scenes from a complex narrative, leading to a bottleneck when generating scene graphs. These findings underscore the need for improved methodologies in scene graph generation and provide valuable insights for future research. The demonstration of our benchmark is available at https://tsg-bench.netlify.app. Additionally, our code and evaluation data are publicly available at https://github.com/docworlds/tsg-bench.
pdf
bib
abs
Beyond Frameworks: Unpacking Collaboration Strategies in Multi-Agent Systems
Haochun Wang
|
Sendong Zhao
|
Jingbo Wang
|
Zewen Qiang
|
Bing Qin
|
Ting Liu
Multi-agent collaboration has emerged as a pivotal paradigm for addressing complex, distributed tasks in large language model (LLM)-driven applications. While prior research has focused on high-level architectural frameworks, the granular mechanisms governing agents—critical to performance and scalability—remain underexplored. This study systematically investigates four dimensions of collaboration strategies: (1) agent governance, (2) participation control, (3) interaction dynamics, and (4) dialogue history management. Through rigorous experimentation under two context-dependent scenarios—Distributed Evidence Integration (DEI) and Structured Evidence Synthesis (SES)—we quantify the impact of these strategies on both task accuracy and computational efficiency. Our findings reveal that centralized governance, instructor-led participation, ordered interaction patterns, and instructor-curated context summarization collectively optimize the trade-off between decision quality and resource utilization with the support of the proposed Token-Accuracy Ratio (TAR). This work establishes a foundation for designing adaptive, scalable multi-agent systems, shifting the focus from structural novelty to strategic interaction mechanics.
pdf
bib
abs
The Invisible Hand: Unveiling Provider Bias in Large Language Models for Code Generation
Xiaoyu Zhang
|
Juan Zhai
|
Shiqing Ma
|
Qingshuang Bao
|
Weipeng Jiang
|
Qian Wang
|
Chao Shen
|
Yang Liu
Large Language Models (LLMs) have emerged as the new recommendation engines, surpassing traditional methods in both capability and scope, particularly in code generation. In this paper, we reveal a novel **provider bias** in LLMs: without explicit directives, these models show systematic preferences for services from specific providers in their recommendations (e.g., favoring Google Cloud over Microsoft Azure). To systematically investigate this bias, we develop an automated pipeline to construct the dataset, incorporating 6 distinct coding task categories and 30 real-world application scenarios. Leveraging this dataset, we conduct the **first** comprehensive empirical study of provider bias in LLM code generation across seven state-of-the-art LLMs, utilizing approximately 500 million tokens (equivalent to $5,000+ in computational costs). Our findings reveal that LLMs exhibit significant provider preferences, predominantly favoring services from Google and Amazon, and can autonomously modify input code to incorporate their preferred providers without users’ requests. Such a bias holds far-reaching implications for market dynamics and societal equilibrium, potentially contributing to digital monopolies. It may also deceive users and violate their expectations, leading to various consequences. We call on the academic community to recognize this emerging issue and develop effective evaluation and mitigation methods to uphold AI security and fairness.
pdf
bib
abs
K/DA: Automated Data Generation Pipeline for Detoxifying Implicitly Offensive Language in Korean
Minkyeong Jeon
|
Hyemin Jeong
|
Yerang Kim
|
Jiyoung Kim
|
Jae Hyeon Cho
|
Byung-Jun Lee
Language detoxification involves removing toxicity from offensive language. While a neutral-toxic paired dataset provides a straightforward approach for training detoxification models, creating such datasets presents several challenges: i) the need for human annotation to build paired data, and ii) the rapid evolution of offensive terms, rendering static datasets quickly outdated. To tackle these challenges, we introduce an automated paired data generation pipeline, called K/DA. This pipeline is designed to generate offensive language with implicit offensiveness and trend-aligned slang, making the resulting dataset suitable for detoxification model training. We demonstrate that the dataset generated by K/DA exhibits high pair consistency and greater implicit offensiveness compared to existing Korean datasets, and also demonstrates applicability to other languages. Furthermore, it enables effective training of a high-performing detoxification model with simple instruction fine-tuning.
pdf
bib
abs
THOR-MoE: Hierarchical Task-Guided and Context-Responsive Routing for Neural Machine Translation
Yunlong Liang
|
Fandong Meng
|
Jie Zhou
The sparse Mixture-of-Experts (MoE) has achieved significant progress for neural machine translation (NMT). However, there exist two limitations in current MoE solutions which may lead to sub-optimal performance: 1) they directly use the task knowledge of NMT into MoE (e.g., domain/linguistics-specific knowledge), which are generally unavailable at practical application and neglect the naturally grouped domain/linguistic properties; 2) the expert selection only depends on the localized token representation without considering the context, which fully grasps the state of each token in a global view. To address the above limitations, we propose THOR-MoE via arming the MoE with hierarchical task-guided and context-responsive routing policies. Specifically, it 1) firstly predicts the domain/language label and then extracts mixed domain/language representation to allocate task-level experts in a hierarchical manner; 2) injects the context information to enhance the token routing from the pre-selected task-level experts set, which can help each token to be accurately routed to more specialized and suitable experts. Extensive experiments on multi-domain translation and multilingual translation benchmarks with different architectures consistently demonstrate the superior performance of THOR-MoE. Additionally, the THOR-MoE operates as a plug-and-play module compatible with existing Top-(CITATION) or Top-(CITATION) routing schemes, ensuring broad applicability across diverse MoE architectures. For instance, compared with vanilla Top- (CITATION) routing, the context-aware manner can achieve an average improvement of 0.75 BLEU with less than 22% activated parameters on multi-domain translation tasks.
pdf
bib
abs
Neuron Empirical Gradient: Discovering and Quantifying Neurons’ Global Linear Controllability
Xin Zhao
|
Zehui Jiang
|
Naoki Yoshinaga
While feed-forward neurons in pre-trained language models (PLMs) can encode knowledge, past research targeted a small subset of neurons that heavily influence outputs.This leaves the broader role of neuron activations unclear, limiting progress in areas like knowledge editing.We uncover a global linear relationship between neuron activations and outputs using neuron interventions on a knowledge probing dataset.The gradient of this linear relationship, which we call the **neuron empirical gradient (NEG)**, captures how changes in activations affect predictions.To compute NEG efficiently, we propose **NeurGrad**, enabling large-scale analysis of neuron behavior in PLMs.We also show that NEG effectively captures language skills across diverse prompts through skill neuron probing. Experiments on **MCEval8k**, a multi-genre multiple-choice knowledge benchmark, support NEG’s ability to represent model knowledge. Further analysis highlights the key properties of NEG-based skill representation: efficiency, robustness, flexibility, and interdependency.Code and data are released.
pdf
bib
abs
Can Third Parties Read Our Emotions?
Jiayi Li
|
Yingfan Zhou
|
Pranav Narayanan Venkit
|
Halima Binte Islam
|
Sneha Arya
|
Shomir Wilson
|
Sarah Rajtmajer
Natural Language Processing tasks that aim to infer an author’s private states, e.g., emotions and opinions, from their written text, typically rely on datasets annotated by third-party annotators. However, the assumption that third-party annotators can accurately capture authors’ private states remains largely unexamined. In this study, we present human subjects experiments on emotion recognition tasks that directly compare third-party annotations with first-party (author-provided) emotion labels. Our findings reveal significant limitations in third-party annotations—whether provided by human annotators or large language models (LLMs)—in faithfully representing authors’ private states. However, LLMs outperform human annotators nearly across the board. We further explore methods to improve third-party annotation quality. We find that demographic similarity between first-party authors and third-party human annotators enhances annotation performance. While incorporating first-party demographic information into prompts leads to a marginal but statistically significant improvement in LLMs’ performance. We introduce a framework for evaluating the limitations of third-party annotations and call for refined annotation practices to accurately represent and model authors’ private states.
pdf
bib
abs
OZSpeech: One-step Zero-shot Speech Synthesis with Learned-Prior-Conditioned Flow Matching
Nghia Huynh Nguyen Hieu
|
Ngoc Son Nguyen
|
Huynh Nguyen Dang
|
Thieu Vo
|
Truong-Son Hy
|
Van Nguyen
Text-to-speech (TTS) systems have seen significant advancements in recent years, driven by improvements in deep learning and neural network architectures. Viewing the output speech as a data distribution, previous approaches often employ traditional speech representations, such as waveforms or spectrograms, within the Flow Matching framework. However, these methods have limitations, including overlooking various speech attributes and incurring high computational costs due to additional constraints introduced during training. To address these challenges, we introduce OZSpeech, the first TTS method to explore optimal transport conditional flow matching with one-step sampling and a learned prior as the condition, effectively disregarding preceding states and reducing the number of sampling steps. Our approach operates on disentangled, factorized components of speech in token format, enabling accurate modeling of each speech attribute, which enhances the TTS system’s ability to precisely clone the prompt speech. Experimental results show that our method achieves promising performance over existing methods in content accuracy, naturalness, prosody generation, and speaker style preservation. Audio samples are available at our demo page https://ozspeech.github.io/OZSpeech_Web/.
pdf
bib
abs
World Modeling Makes a Better Planner: Dual Preference Optimization for Embodied Task Planning
Siyin Wang
|
Zhaoye Fei
|
Qinyuan Cheng
|
Shiduo Zhang
|
Panpan Cai
|
Jinlan Fu
|
Xipeng Qiu
Recent advances in large vision-language models (LVLMs) have shown promise for embodied task planning, yet they struggle with fundamental challenges like dependency constraints and efficiency. Existing approaches either solely optimize action selection or directly leverage pre-trained models as world models during inference, overlooking the benefits of learning to model the world as a way to enhance planning capabilities. We propose Dual Preference Optimization (D2PO), a new learning framework that jointly optimizes state prediction and action selection through preference learning, enabling LVLMs to understand environment dynamics for better planning. To automatically collect trajectories and stepwise preference data without human annotation, we introduce a tree search mechanism for extensive exploration via trial-and-error. Extensive experiments on VoTa-Bench demonstrate that our D2PO-based method significantly outperforms existing methods and GPT-4o when applied to Qwen2-VL (7B), LLaVA-1.6 (7B), and LLaMA-3.2 (11B), achieving superior task success rates with more efficient execution paths.
pdf
bib
abs
JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs
Junjie Chu
|
Yugeng Liu
|
Ziqing Yang
|
Xinyue Shen
|
Michael Backes
|
Yang Zhang
Jailbreak attacks aim to bypass the LLMs’ safeguards. While researchers have proposed different jailbreak attacks in depth, they have done so in isolation—either with unaligned settings or comparing a limited range of methods. To fill this gap, we present a large-scale evaluation of various jailbreak attacks. We collect 17 representative jailbreak attacks, summarize their features, and establish a novel jailbreak attack taxonomy. Then we conduct comprehensive measurement and ablation studies across nine aligned LLMs on 160 forbidden questions from 16 violation categories. Also, we test jailbreak attacks under eight advanced defenses. Based on our taxonomy and experiments, we identify some important patterns, such as heuristic-based attacks, which could achieve high attack success rates but are easy to mitigate by defenses. Our study offers valuable insights for future research on jailbreak attacks and defenses and serves as a benchmark tool for researchers and practitioners to evaluate them effectively.
pdf
bib
abs
CogniBench: A Legal-inspired Framework and Dataset for Assessing Cognitive Faithfulness of Large Language Models
Xiaqiang Tang
|
Jian Li
|
Keyu Hu
|
Nan Du
|
Xiaolong Li
|
Xi Zhang
|
Weigao Sun
|
Sihong Xie
Faithfulness hallucinations are claims generated by a Large Language Model (LLM) not supported by contexts provided to the LLM. Lacking assessment standards, existing benchmarks focus on “factual statements” that rephrase source materials while overlooking “cognitive statements” that involve making inferences from the given context. Consequently, evaluating and detecting the hallucination of cognitive statements remains challenging. Inspired by how evidence is assessed in the legal domain, we design a rigorous framework to assess different levels of faithfulness of cognitive statements and introduce the CogniBench dataset where we reveal insightful statistics. To keep pace with rapidly evolving LLMs, we further develop an automatic annotation pipeline that scales easily across different models. This results in a large-scale CogniBench-L dataset, which facilitates training accurate detectors for both factual and cognitive hallucinations. We release our model and datasets at: https://github.com/FUTUREEEEEE/CogniBench
pdf
bib
abs
Neural Incompatibility: The Unbridgeable Gap of Cross-Scale Parametric Knowledge Transfer in Large Language Models
Yuqiao Tan
|
Shizhu He
|
Kang Liu
|
Jun Zhao
Large Language Models (LLMs) offer a transparent brain with accessible parameters that encode extensive knowledge, which can be analyzed, located and transferred. Consequently, a key research challenge is to transcend traditional knowledge transfer paradigms rooted in symbolic language and achieve genuine Parametric Knowledge Transfer (PKT). Significantly, exploring effective methods for transferring knowledge across LLMs of different scales through parameters presents an intriguing and valuable research direction. In this paper, we first demonstrate Alignment in parametric space is the fundamental prerequisite to achieve successful cross-scale PKT. We redefine the previously explored knowledge transfer as Post-Align PKT (PostPKT), which utilizes extracted parameters for LoRA initialization and requires subsequent fine-tune for alignment. Hence, to reduce cost for further fine-tuning, we introduce a novel Pre-Align PKT (PrePKT) paradigm and propose a solution called LaTen (Locate-Then-Align) that aligns the parametric spaces of LLMs across scales only using several training steps without following training. Comprehensive experiments on four benchmarks demonstrate that both PostPKT and PrePKT face challenges in achieving consistently stable transfer. Through in-depth analysis, we identify Neural Incompatibility as the ethological and parametric structural differences between LLMs of varying scales, presenting fundamental challenges to achieving effective PKT. These findings provide fresh insights into the parametric architectures of LLMs and highlight promising directions for future research on efficient PKT. Our code is available at https://github.com/Trae1ounG/Neural_Incompatibility.
pdf
bib
abs
Enhancing Mathematical Reasoning in LLMs by Stepwise Correction
Zhenyu Wu
|
Qingkai Zeng
|
Zhihan Zhang
|
Zhaoxuan Tan
|
Chao Shen
|
Meng Jiang
Best-of-N decoding methods instruct large language models (LLMs) to generate multiple solutions, score each using a scoring function, and select the highest scored as the final answer to mathematical reasoning problems. However, this repeated independent process often leads to the same mistakes, making the selected solution still incorrect. We propose a novel prompting method named Stepwise Correction (StepCo) that helps LLMs identify and revise incorrect steps in their generated reasoning paths. It iterates verification and revision phases that employ a process-supervised verifier. The verify-then-revise process not only improves answer correctness but also reduces token consumption with fewer paths needed to generate. With StepCo, a series of LLMs demonstrate exceptional performance. Notably, using GPT-4o as the backend LLM, StepCo achieves an average accuracy of 94.1 across eight datasets, significantly outperforming the state-of-the-art Best-of-N method by +2.4, while reducing token consumption by 77.8%. Our implementation is made publicly available at https://wzy6642.github.io/stepco.github.io.
pdf
bib
abs
PsyDial: A Large-scale Long-term Conversational Dataset for Mental Health Support
Huachuan Qiu
|
Zhenzhong Lan
Dialogue systems for mental health counseling aim to alleviate client distress and assist individuals in navigating personal challenges. Developing effective conversational agents for psychotherapy requires access to high-quality, real-world, long-term client-counselor interaction data, which is difficult to obtain due to privacy concerns. Although removing personally identifiable information is feasible, this process is labor-intensive. To address these challenges, we propose a novel privacy-preserving data reconstruction method that reconstructs real-world client-counselor dialogues while mitigating privacy concerns. We apply the RMRR (Retrieve, Mask, Reconstruct, Refine) method, which facilitates the creation of the privacy-preserving PsyDial dataset, with an average of 37.8 turns per dialogue. Extensive analysis demonstrates that PsyDial effectively reduces privacy risks while maintaining dialogue diversity and conversational exchange. To fairly and reliably evaluate the performance of models fine-tuned on our dataset, we manually collect 101 dialogues from professional counseling books. Experimental results show that models fine-tuned on PsyDial achieve improved psychological counseling performance, outperforming various baseline models. A user study involving counseling experts further reveals that our LLM-based counselor provides higher-quality responses. Code, data, and models are available at https://github.com/qiuhuachuan/PsyDial, serving as valuable resources for future advancements in AI psychotherapy.
pdf
bib
abs
Enhancing Goal-oriented Proactive Dialogue Systems via Consistency Reflection and Correction
Didi Zhang
|
Yaxin Fan
|
Peifeng Li
|
Qiaoming Zhu
Goal-oriented proactive dialogue systems are designed to guide user conversations seamlessly towards specific objectives by planning a goal-oriented path. However, previous research has focused predominantly on optimizing these paths while neglecting the inconsistencies that may arise between generated responses and dialogue contexts, including user profiles, dialogue history, domain knowledge, and subgoals. To address this issue, we introduce a model-agnostic two-stage Consistency Reflection and Correction (CRC) framework. Specifically, in the consistency reflection stage, the model is prompted to reflect on the discrepancies between generated responses and dialogue contexts, identifying inconsistencies and suggesting possible corrections. In the consistency correction stage, the model generates responses that are more consistent with the dialogue context based on these reflection results. We conducted experiments on various model architectures with different parameter sizes, including encoder-decoder models (BART, T5) and decoder-only models (GPT-2, DialoGPT, Phi3, Mistral and LLaMA3), and the experimental results on three datasets demonstrate that our CRC framework significantly improves the consistency between generated responses and dialogue contexts.
pdf
bib
abs
Exclusion of Thought: Mitigating Cognitive Load in Large Language Models for Enhanced Reasoning in Multiple-Choice Tasks
Qihang Fu
|
Yongbin Qin
|
Ruizhang Huang
|
Yanping Chen
|
Yulin Zhou
|
Lintao Long
Multiple-choice questions (MCQs) are a widely used and vital assessment format for evaluating large language models (LLMs). This study reveals that LLMs are susceptible to “cognitive load” caused by distractor options in MCQs, leading to excessive attention to distractors and consequent vacillation between correct and incorrect options. To mitigate this cognitive burden, we introduce a novel reasoning prompt strategy, called EoT, which effectively reduces cognitive load by steering the model’s attention away from erroneous options. This enables the model to focus more effectively on reasonable answers. Additionally, by documenting the elimination process, EoT enhances the transparency and interpretability of the model’s reasoning. Experimental results demonstrate that EoT, as a plug-and-play approach, significantly reduces cognitive load and improves performance, showcasing its potential to enhance both the accuracy and interpretability of LLMs.
pdf
bib
abs
Registering Source Tokens to Target Language Spaces in Multilingual Neural Machine Translation
Zhi Qu
|
Yiran Wang
|
Jiannan Mao
|
Chenchen Ding
|
Hideki Tanaka
|
Masao Utiyama
|
Taro Watanabe
The multilingual neural machine translation (MNMT) aims for arbitrary translations across multiple languages.Although MNMT-specific models trained on parallel data offer low costs in training and deployment, their performance consistently lags behind that of large language models (LLMs).In this work, we introduce registering, a novel method that enables a small MNMT-specific model to compete with LLMs.Specifically, we insert a set of artificial tokens specifying the target language, called registers, into the input sequence between the source and target tokens.By modifying the attention mask, the target token generation only pays attention to the activation of registers, representing the source tokens in the target language space.Experiments on EC-40, a large-scale benchmark, show that our method advances the state-of-the-art of MNMT.We further pre-train two models, namely MITRE (multilingual translation with registers), by 9.3 billion sentence pairs across 24 languages collected from public corpora.One of them, MITRE-913M, outperforms NLLB-3.3B, achieves comparable performance with commercial LLMs, and shows strong adaptability in fine-tuning.Finally, we open-source our models to facilitate further research and development in MNMT: https://github.com/zhiqu22/mitre.
pdf
bib
abs
VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search
Yikun Wang
|
Siyin Wang
|
Qinyuan Cheng
|
Zhaoye Fei
|
Liang Ding
|
Qipeng Guo
|
Dacheng Tao
|
Xipeng Qiu
Recent advancements in Large Vision-Language Models have showcased remarkable capabilities. However, they often falter when confronted with complex reasoning tasks that humans typically address through visual aids and deliberate, step-by-step thinking. While existing methods have explored text-based slow thinking or rudimentary visual assistance, they fall short of capturing the intricate, interleaved nature of human visual-verbal reasoning processes. To overcome these limitations and inspired by the mechanisms of slow thinking in human cognition, we introduce VisuoThink, a novel framework that seamlessly integrates visuospatial and linguistic domains. VisuoThink facilitates multimodal slow thinking by enabling progressive visual-textual reasoning and incorporates test-time scaling through look-ahead tree search. Extensive experiments demonstrate that VisuoThink significantly enhances reasoning capabilities via inference-time scaling, even without fine-tuning, achieving state-of-the-art performance in tasks involving geometry and spatial reasoning.
pdf
bib
abs
Automated CAD Modeling Sequence Generation from Text Descriptions via Transformer-Based Large Language Models
JianXing Liao
|
Junyan Xu
|
Yatao Sun
|
Maowen Tang
|
Sicheng He
|
Jingxian Liao
|
Shui Yu
|
Yun Li
|
Xiaohong Guan
Designing complex computer-aided design (CAD) models is often time-consuming due to challenges such as computational inefficiency and the difficulty of generating precise models. We propose a novel language-guided framework for industrial design automation to address these issues, integrating large language models (LLMs) with computer-automated design (CAutoD).Through this framework, CAD models are automatically generated from parameters and appearance descriptions, supporting the automation of design tasks during the detailed CAD design phase. Our approach introduces three key innovations: (1) a semi-automated data annotation pipeline that leverages LLMs and vision-language large models (VLLMs) to generate high-quality parameters and appearance descriptions; (2) a Transformer-based CAD generator (TCADGen) that predicts modeling sequences via dual-channel feature aggregation; (3) an enhanced CAD modeling generation model, called CADLLM, that is designed to refine the generated sequences by incorporating the confidence scores from TCADGen. Experimental results demonstrate that the proposed approach outperforms traditional methods in both accuracy and efficiency, providing a powerful tool for automating industrial workflows and generating complex CAD models from textual prompts.The code is available at https://jianxliao.github.io/cadllm-page/
pdf
bib
abs
LED-Merging: Mitigating Safety-Utility Conflicts in Model Merging with Location-Election-Disjoint
Qianli Ma
|
Dongrui Liu
|
Qian Chen
|
Linfeng Zhang
|
Jing Shao
Fine-tuning pre-trained Large Language Models (LLMs) for specialized tasks incurs substantial computational and data costs. While model merging offers a training-free solution to integrate multiple task-specific models, existing methods suffer from safety-utility conflicts where enhanced general capabilities degrade safety safeguards. We identify two root causes: neuron misidentification due to simplistic parameter magnitude-based selection, and cross-task neuron interference during merging.To address these challenges, we propose LED-Merging, a three-stage framework that Locates task-specific neurons via gradient-based attribution, dynamically Elects critical neurons through multi-model importance fusion, and Disjoints conflicting updates through parameter isolation.Extensive experiments on Llama-3-8B, Mistral-7B, and Llama2-13B demonstrate that LED-Merging effectively reduces harmful response rates, showing a 31.4% decrease on Llama-3-8B-Instruct on HarmBench, while simultaneously preserving 95% of utility performance, such as achieving 52.39% accuracy on GSM8K.LED-Merging resolves safety-utility conflicts and provides a lightweight, training-free paradigm for constructing reliable multi-task LLMs.Code is available at https://github.com/MqLeet/LED-Merging
pdf
bib
abs
Dolphin: Moving Towards Closed-loop Auto-research through Thinking, Practice, and Feedback
Jiakang Yuan
|
Xiangchao Yan
|
Bo Zhang
|
Tao Chen
|
Botian Shi
|
Wanli Ouyang
|
Yu Qiao
|
Lei Bai
|
Bowen Zhou
The scientific research paradigm is undergoing a profound transformation owing to the development of Artificial Intelligence (AI). Recent works demonstrate that various AI-assisted research methods can largely improve research efficiency by improving data analysis, accelerating computation, and fostering novel idea generation. To further move towards the ultimate goal (i.e., automatic scientific research), in this paper, we introduce Dolphin, a closed-loop LLM-driven framework to enhance the automation level of scientific research. Dolphin first generates novel ideas based on feedback from previous experiments and relevant papers ranked by the topic and task attributes. Then, the generated ideas can be implemented using a code template refined and debugged with the designed exception-traceback-guided local code structure. Finally, Dolphin automatically analyzes the results of each idea and feeds the results back to the next round of idea generation. Experiments are conducted on the benchmark datasets of different topics and a subset of MLE-bench. Results show that Dolphin can continuously improve the performance of the input topic in a loop. We highlight that Dolphin can automatically propose methods that are comparable to the state-of-the-art in some tasks such as 3D point classification.
pdf
bib
abs
PerSphere: A Comprehensive Framework for Multi-Faceted Perspective Retrieval and Summarization
Yun Luo
|
Yingjie Li
|
Xiangkun Hu
|
Qinglin Qi
|
Fang Guo
|
Qipeng Guo
|
Zheng Zhang
|
Yue Zhang
As online platforms and recommendation algorithms evolve, people are increasingly trapped in echo chambers, leading to biased understandings of various issues. To combat this issue, we have introduced PerSphere, a benchmark designed to facilitate multi-faceted perspective retrieval and summarization, thus breaking free from these information silos. For each query within PerSphere, there are two opposing claims, each supported by distinct, non-overlapping perspectives drawn from one or more documents. Our goal is to accurately summarize these documents, aligning the summaries with the respective claims and their underlying perspectives. This task is structured as a two-step end-to-end pipeline that includes comprehensive document retrieval and multi-faceted summarization. Furthermore, we propose a set of metrics to evaluate the comprehensiveness of the retrieval and summarization content. Experimental results on various counterparts for the pipeline show that recent models struggle with such a complex task. Analysis shows that the main challenge lies in long context and perspective extraction, and we propose a simple but effective multi-agent summarization system, offering a promising solution to enhance performance on PerSphere.
pdf
bib
abs
Prompt-Guided Internal States for Hallucination Detection of Large Language Models
Fujie Zhang
|
Peiqi Yu
|
Biao Yi
|
Baolei Zhang
|
Tong Li
|
Zheli Liu
Large Language Models (LLMs) have demonstrated remarkable capabilities across a variety of tasks in different domains. However, they sometimes generate responses that are logically coherent but factually incorrect or misleading, which is known as LLM hallucinations. Data-driven supervised methods train hallucination detectors by leveraging the internal states of LLMs, but detectors trained on specific domains often struggle to generalize well to other domains. In this paper, we aim to enhance the cross-domain performance of supervised detectors with only in-domain data. We propose a novel framework, prompt-guided internal states for hallucination detection of LLMs, namely PRISM. By utilizing appropriate prompts to guide changes to the structure related to text truthfulness in LLMs’ internal states, we make this structure more salient and consistent across texts from different domains. We integrated our framework with existing hallucination detection methods and conducted experiments on datasets from different domains. The experimental results indicate that our framework significantly enhances the cross-domain generalization of existing hallucination detection methods.
pdf
bib
abs
Typology-Guided Adaptation in Multilingual Models
Ndapa Nakashole
Multilingual models often treat language diversity as a problem of data imbalance, overlooking structural variation. We introduce the *Morphological Index* (MoI), a typologically grounded metric that quantifies how strongly a language relies on surface morphology for noun classification. Building on MoI, we propose *MoI-MoE*, a Mixture of Experts model that routes inputs based on morphological structure. Evaluated on 10 Bantu languages—a large, morphologically rich and underrepresented family—MoI-MoE outperforms strong baselines, improving Swahili accuracy by 14 points on noun class recognition while maintaining performance on morphology-rich languages like Zulu. These findings highlight typological structure as a practical and interpretable signal for multilingual model adaptation.
pdf
bib
abs
Don’t Erase, Inform! Detecting and Contextualizing Harmful Language in Cultural Heritage Collections
Orfeas Menis Mastromichalakis
|
Jason Liartis
|
Kristina Rose
|
Antoine Isaac
|
Giorgos Stamou
Cultural Heritage (CH) data hold invaluable knowledge, reflecting the history, traditions, and identities of societies, and shaping our understanding of the past and present. However, many CH collections contain outdated or offensive descriptions that reflect historical biases. CH Institutions (CHIs) face significant challenges in curating these data due to the vast scale and complexity of the task. To address this, we develop an AI-powered tool that detects offensive terms in CH metadata and provides contextual insights into their historical background and contemporary perception. We leverage a multilingual vocabulary co-created with marginalized communities, researchers, and CH professionals, along with traditional NLP techniques and Large Language Models (LLMs). Available as a standalone web app and integrated with major CH platforms, the tool has processed over 7.9 million records, contextualizing the contentious terms detected in their metadata. Rather than erasing these terms, our approach seeks to inform, making biases visible and providing actionable insights for creating more inclusive and accessible CH collections.
pdf
bib
abs
ECLM: Entity Level Language Model for Spoken Language Understanding with Chain of Intent
Shangjian Yin
|
Peijie Huang
|
JiaTian Chen
|
Haojing Huang
|
Yuhong Xu
Large Language Models (LLMs) have demonstrated impressive capabilities in language generation and general task performance. However, their application to spoken language understanding (SLU) remains challenging, particularly for token-level tasks, where the autoregressive nature of LLMs often leads to misalignment issues. They also struggle to capture nuanced interrelations in semantic-level tasks through direct fine-tuning alone. To address these challenges, we propose the Entity-level Language Model (ECLM) framework, which reformulates slot-filling as an entity recognition task and introduces a novel concept, Chain of Intent, to enable step-by-step multi-intent recognition. Experimental results show that ECLM significantly outperforms strong baselines such as Uni-MIS, achieving gains of 3.7% on MixATIS and 3.1% on MixSNIPS. Compared to standard supervised fine-tuning of LLMs, ECLM further achieves improvements of 8.5% and 21.2% on these datasets, respectively. Our code is available at https://github.com/SJY8460/ECLM.
pdf
bib
abs
FaithfulRAG: Fact-Level Conflict Modeling for Context-Faithful Retrieval-Augmented Generation
Qinggang Zhang
|
Zhishang Xiang
|
Yilin Xiao
|
Le Wang
|
Junhui Li
|
Xinrun Wang
|
Jinsong Su
Large language models (LLMs) augmented with retrieval systems have demonstrated significant potential in handling knowledge-intensive tasks. However, these models often struggle with unfaithfulness issues, generating outputs that either ignore the retrieved context or inconsistently blend it with the LLM’s parametric knowledge. This issue is particularly severe in cases of knowledge conflict, where the retrieved context conflicts with the model’s parametric knowledge. While existing faithful RAG approaches enforce strict context adherence through well-designed prompts or modified decoding strategies, our analysis reveals a critical limitation: they achieve faithfulness by forcibly suppressing the model’s parametric knowledge, which undermines the model’s internal knowledge structure and increases the risk of misinterpreting the context. To this end, this paper proposes FaithfulRAG, a novel framework that resolves knowledge conflicts by explicitly modeling discrepancies between the model’s parametric knowledge and retrieved context. Specifically, FaithfulRAG identifies conflicting knowledge at the fact level and designs a self-thinking process, allowing LLMs to reason about and integrate conflicting facts before generating responses. Extensive experiments demonstrate that our method outperforms state-of-the-art methods. The code is available at https://github.com/DeepLearnXMU/Faithful-RAG.
pdf
bib
abs
Knowledge Image Matters: Improving Knowledge-Based Visual Reasoning with Multi-Image Large Language Models
Guanghui Ye
|
Huan Zhao
|
Zhixue Zhao
|
Xupeng Zha
|
Yang Liu
|
Zhihua Jiang
We revisit knowledge-based visual reasoning (KB-VR) in light of modern advances in multimodal large language models (MLLMs), and make the following contributions: (i) We propose Visual Knowledge Card (VKC) – a novel image that incorporates not only internal visual knowledge (e.g., scene-aware information) detected from the raw image, but also external world knowledge (e.g., attribute or object knowledge) produced by a knowledge generator; (ii) We present VKC-based Multi-Image Reasoning (VKC-MIR) – a four-stage pipeline which harnesses a state-of-the-art scene perception engine to construct an initial VKC (Stage-1), a powerful LLM to generate relevant domain knowledge (Stage-2), an excellent image editing toolkit to introduce generated knowledge into the updated VKC (Stage-3), and finally, an emerging multi-image MLLM to solve the VKC-enhanced task (Stage-4). By performing experiments on three popular KB-VR benchmarks, our approach achieves new state-of-the-art results compared to previous top-performing models.
pdf
bib
abs
Evaluating Personalized Tool-Augmented LLMs from the Perspectives of Personalization and Proactivity
Yupu Hao
|
Pengfei Cao
|
Zhuoran Jin
|
Huanxuan Liao
|
Yubo Chen
|
Kang Liu
|
Jun Zhao
Personalized tool utilization is essential for aligning large language models (LLMs) with user preference in interaction scenarios with various tools. However, most of the current benchmarks primarily focus on either personalization of text generation or direct tool-utilizing, without considering both. In this work, we introduce a novel benchmark ETAPP for evaluating personalized tool invocation, establishing a sandbox environment, and a comprehensive dataset of 800 testing cases covering diverse user profiles. To improve the accuracy of our evaluation, we propose a key-point-based LLM evaluation method, mitigating biases in the LLM-as-a-judge system by manually annotating key points for each test case and providing them to LLM as the reference. Additionally, we evaluate the excellent LLMs and provide an in-depth analysis. Furthermore, we investigate the impact of different tool-invoking strategies on LLMs’ personalization performance and the effects of fine-tuning in our task. The effectiveness of our preference-setting and key-point-based evaluation method is also validated. Our findings offer insights into improving personalized LLM agents. Our code is available at https://github.com/hypasd-art/ETAPP.
pdf
bib
abs
GUICourse: From General Vision Language Model to Versatile GUI Agent
Wentong Chen
|
Junbo Cui
|
Jinyi Hu
|
Yujia Qin
|
Junjie Fang
|
Yue Zhao
|
Chongyi Wang
|
Jun Liu
|
Guirong Chen
|
Yupeng Huo
|
Yuan Yao
|
Yankai Lin
|
Zhiyuan Liu
|
Maosong Sun
Utilizing Graphic User Interfaces (GUIs) for human-computer interaction is essential for accessing various digital tools. Recent advancements in Vision Language Models (VLMs) reveal significant potential for developing versatile agents that assist humans in navigating GUIs. However, current VLMs face challenges related to fundamental abilities, such as OCR and grounding, as well as a lack of knowledge about GUI elements functionalities and control methods. These limitations hinder their effectiveness as practical GUI agents. To address these challenges, we introduce GUICourse, a series of datasets for training visual-based GUI agents using general VLMs. First, we enhance the OCR and grounding capabilities of VLMs using the GUIEnv dataset. Next, we enrich the GUI knowledge of VLMs using the GUIAct and GUIChat datasets. Our experiments demonstrate that even a small-sized GUI agent (with 3.1 billion parameters) performs effectively on both single-step and multi-step GUI tasks. We further finetune our GUI agents on other GUI tasks with different action spaces (AITW and Mind2Web), and the results show that our agents are better than their baseline VLMs. Additionally, we analyze the impact of OCR and grounding capabilities through an ablation study, revealing a positive correlation with GUI navigation ability.
pdf
bib
abs
Evaluating Visual and Cultural Interpretation: The K-Viscuit Benchmark with Human-VLM Collaboration
ChaeHun Park
|
Yujin Baek
|
Jaeseok Kim
|
Yu-Jung Heo
|
Du-Seong Chang
|
Jaegul Choo
To create culturally inclusive vision-language models (VLMs), developing a benchmark that tests their ability to address culturally relevant questions is essential. Existing approaches typically rely on human annotators, making the process labor-intensive and creating a cognitive burden in generating diverse questions. To address this, we propose a semi-automated framework for constructing cultural VLM benchmarks, specifically targeting multiple-choice QA. This framework combines human-VLM collaboration, where VLMs generate questions based on guidelines, a small set of annotated examples, and relevant knowledge, followed by a verification process by native speakers. We demonstrate the effectiveness of this framework through the creation of K-Viscuit, a dataset focused on Korean culture. Our experiments on this dataset reveal that open-source models lag behind proprietary ones in understanding Korean culture, highlighting key areas for improvement. We also present a series of further analyses, including human evaluation, augmenting VLMs with external knowledge, and the evaluation beyond multiple-choice QA. Our dataset is available at https://huggingface.co/datasets/ddehun/k-viscuit.
pdf
bib
abs
Maximizing the Effectiveness of Larger BERT Models for Compression
Wen-Shu Fan
|
Su Lu
|
Shangyu Xing
|
Xin-Chun Li
|
De-Chuan Zhan
Knowledge distillation (KD) is a widely used approach for BERT compression, where a larger BERT model serves as a teacher to transfer knowledge to a smaller student model. Prior works have found that distilling a larger BERT with superior performance may degrade student’s performance than a smaller BERT. In this paper, we investigate the limitations of existing KD methods for larger BERT models. Through Canonical Correlation Analysis, we identify that these methods fail to fully exploit the potential advantages of larger teachers. To address this, we propose an improved distillation approach that effectively enhances knowledge transfer. Comprehensive experiments demonstrate the effectiveness of our method in enabling larger BERT models to distill knowledge more efficiently.
pdf
bib
abs
Can LLMs Reason About Program Semantics? A Comprehensive Evaluation of LLMs on Formal Specification Inference
Thanh Le-Cong
|
Bach Le
|
Toby Murray
Large Language Models (LLMs) are increasingly being used to automate programming tasks. However, the capabilities of LLMs in reasoning about program semantics are still inadequately studied, leaving substantial potential for further exploration. This paper introduces FormalBench, a comprehensive benchmark designed to evaluate the reasoning abilities of Large Language Models (LLMs) on program semantics. Specifically, it utilizes the task of synthesizing formal program specifications as a proxy measure for assessing the semantic reasoning of LLMs. This task requires both comprehensive reasoning over all possible program executions and the generation of precise, syntactically correct expressions that adhere to formal syntax and semantics. Using this benchmark, we evaluated the ability of LLMs to synthesize consistent and complete specifications. Our findings show that LLMs perform well with simple control flows but struggle with more complex structures, especially loops, even with advanced prompting. Additionally, LLMs exhibit limited robustness against semantic-preserving transformations. We also highlight common failure patterns and design self-repair prompts, improving success rates by 25%. FormalBench is packaged as an executable library and has been released at https://github.com/thanhlecongg/FormalBench/.
pdf
bib
abs
HACo-Det: A Study Towards Fine-Grained Machine-Generated Text Detection under Human-AI Coauthoring
Zhixiong Su
|
Yichen Wang
|
Herun Wan
|
Zhaohan Zhang
|
Minnan Luo
The misuse of large language models (LLMs) poses potential risks, motivating the development of machine-generated text (MGT) detection. Existing literature primarily concentrates on binary, document-level detection, thereby neglecting texts that are composed jointly by human and LLM contributions. Hence, this paper explores the possibility of fine-grained MGT detection under human-AI coauthoring.We suggest fine-grained detectors can pave pathways toward coauthored text detection with a numeric AI ratio.Specifically, we propose a dataset, HACo-Det, which produces human-AI coauthored texts via an automatic pipeline with word-level attribution labels. We retrofit seven prevailing document-level detectors to generalize them to word-level detection.Then we evaluate these detectors on HACo-Det on both word- and sentence-level detection tasks.Empirical results show that metric-based methods struggle to conduct fine-grained detection with a 0.462 average F1 score, while finetuned models show superior performance and better generalization across domains. However, we argue that fine-grained co-authored text detection is far from solved.We further analyze factors influencing performance, e.g., context window, and highlight the limitations of current methods, pointing to potential avenues for improvement.
pdf
bib
abs
IndicSynth: A Large-Scale Multilingual Synthetic Speech Dataset for Low-Resource Indian Languages
Divya V Sharma
|
Vijval Ekbote
|
Anubha Gupta
Recent advances in synthetic speech generation technology have facilitated the generation of high-quality synthetic (fake) speech that emulates human voices. These technologies pose a threat of misuse for identity theft and the spread of misinformation. Consequently, the misuse of such powerful technologies necessitates the development of robust and generalizable audio deepfake detection (ADD) and anti-spoofing models. However, such models are often linguistically biased. Consequently, the models trained on datasets in one language exhibit a low accuracy when evaluated on out-of-domain languages. Such biases reduce the usability of these models and highlight the urgent need for multilingual synthetic speech datasets for bias mitigation research. However, most available datasets are in English or Chinese. The dearth of multilingual synthetic datasets hinders multilingual ADD and anti-spoofing research. Furthermore, the problem intensifies in countries with rich linguistic diversity, such as India. Therefore, we introduce IndicSynth, which contains 4,000 hours of synthetic speech from 989 target speakers, including 456 females and 533 males for 12 low-resourced Indian languages. The dataset includes rich metadata covering gender details and target speaker identifiers. Experimental results demonstrate that IndicSynth is a valuable contribution to multilingual ADD and anti-spoofing research. The dataset can be accessed from https://github.com/vdivyas/IndicSynth.
pdf
bib
abs
Reinforced IR: A Self-Boosting Framework For Domain-Adapted Information Retrieval
Chaofan Li
|
Jianlyu Chen
|
Yingxia Shao
|
Chaozhuo Li
|
Quanqing Xu
|
Defu Lian
|
Zheng Liu
While retrieval techniques are widely used in practice, they still face significant challenges in cross-domain scenarios. Recently, generation-augmented methods have emerged as a promising solution to this problem. These methods enhance raw queries by incorporating additional information from an LLM-based generator, facilitating more direct retrieval of relevant documents. However, existing methods struggle with highly specialized situations that require extensive domain expertise. To address this problem, we present Reinforced-IR, a novel approach that jointly adapts a pre-trained retriever and generator for precise cross-domain retrieval. A key innovation of Reinforced-IR is its Self-Boosting framework, which enables retriever and generator to learn from each other’s feedback. Specifically, the generator is reinforced to generate query augmentations that enhance the retriever’s performance, while the retriever is trained to better discriminate the relevant documents identified by the generator. This iterative process allows the end-to-end retrieval performance to be progressively optimized using an unlabeled corpus from the target domain. In our experiment, Reinforced-IR outperforms existing domain adaptation methods by a large margin, leading to substantial improvements in retrieval quality across a wide range of application scenarios.We have publicly released our code at this repo.
pdf
bib
abs
CoIR: A Comprehensive Benchmark for Code Information Retrieval Models
Xiangyang Li
|
Kuicai Dong
|
Yi Quan Lee
|
Wei Xia
|
Hao Zhang
|
Xinyi Dai
|
Yasheng Wang
|
Ruiming Tang
Despite the substantial success of Information Retrieval (IR) in various NLP tasks, most IR systems predominantly handle queries and corpora in natural language, neglecting the domain of code retrieval. Code retrieval is critically important yet remains under-explored, with existing methods and benchmarks inadequately representing the diversity of code in various domains and tasks. Moreover, many models have begun to overfit existing leaderboards, limiting their generalizability and real-world applicability. Addressing this gap, we present CoIR (**Co**de **I**nformation **R**etrieval Benchmark), a robust and comprehensive benchmark specifically designed to assess code retrieval capabilities. CoIR comprises ten meticulously curated code datasets, spanning eight distinctive retrieval tasks across seven diverse domains. We first discuss the construction of CoIR and its diverse dataset composition. Further, we evaluate ten widely used retrieval models using CoIR, uncovering significant difficulties in performing code retrieval tasks even with state-of-the-art systems. CoIR also introduces a simple yet effective python framework, which additionally defines various advanced modes to facilitate researchers in evaluating their models. It shares the same data schema as other popular benchmarks like MTEB and BEIR, enabling seamless cross-benchmark evaluations. Through CoIR, we aim to invigorate research in the code retrieval domain, providing a versatile benchmarking tool that encourages further development and exploration of code retrieval systems.
pdf
bib
abs
Enhancing Multimodal Retrieval via Complementary Information Extraction and Alignment
Delong Zeng
|
Yuexiang Xie
|
Yaliang Li
|
Ying Shen
Multimodal retrieval has emerged as a promising yet challenging research direction in recent years. Most existing studies in multimodal retrieval focus on capturing information in multimodal data that is similar to their paired texts, but often ignores the complementary information contained in multimodal data. In this study, we propose CIEA, a novel multimodal retrieval approach that employs Complementary Information Extraction and Alignment, which transforms both text and images in documents into a unified latent space and features a complementary information extractor designed to identify and preserve differences in the image representations. We optimize CIEA using two complementary contrastive losses to ensure semantic integrity and effectively capture the complementary information contained in images. Extensive experiments demonstrate the effectiveness of CIEA, which achieves significant improvements over both divide-and-conquer models and universal dense retrieval models. We provide an ablation study, further discussions, and case studies to highlight the advancements achieved by CIEA. To promote further research in the community, we have released the source code at https://github.com/zengdlong/CIEA.
pdf
bib
abs
JoPA: Explaining Large Language Model’s Generation via Joint Prompt Attribution
Yurui Chang
|
Bochuan Cao
|
Yujia Wang
|
Jinghui Chen
|
Lu Lin
Large Language Models (LLMs) have demonstrated impressive performances in complex text generation tasks. However, the contribution of the input prompt to the generated content still remains obscure to humans, underscoring the necessity of understanding the causality between input and output pairs. Existing works for providing prompt-specific explanation often confine model output to be classification or next-word prediction. Few initial attempts aiming to explain the entire language generation often treat input prompt texts independently, ignoring their combinatorial effects on the follow-up generation. In this study, we introduce a counterfactual explanation framework based on joint prompt attribution, JoPA, which aims to explain how a few prompt texts collaboratively influences the LLM’s complete generation. Particularly, we formulate the task of prompt attribution for generation interpretation as a combinatorial optimization problem, and introduce a probabilistic algorithm to search for the casual input combination in the discrete space. We define and utilize multiple metrics to evaluate the produced explanations, demonstrating both the faithfulness and efficiency of our framework.
pdf
bib
abs
Proxy-Driven Robust Multimodal Sentiment Analysis with Incomplete Data
Aoqiang Zhu
|
Min Hu
|
Xiaohua Wang
|
Jiaoyun Yang
|
Yiming Tang
|
Ning An
Multimodal Sentiment Analysis (MSA) with incomplete data has gained significant attention recently. Existing studies focus on optimizing model structures to handle modality missingness, but models still face challenges in robustness when dealing with uncertain missingness. To this end, we propose a data-centric robust multimodal sentiment analysis method, Proxy-Driven Robust Multimodal Fusion (P-RMF). First, we map unimodal data to the latent space of Gaussian distributions to capture core features and structure, thereby learn stable modality representation. Then, we combine the quantified inherent modality uncertainty to learn stable multimodal joint representation (i.e., proxy modality), which is further enhanced through multi-layer dynamic cross-modal injection to increase its diversity. Extensive experimental results show that P-RMF outperforms existing models in noise resistance and achieves state-of-the-art performance on multiple benchmark datasets. Code will be available at https://github.com/***/P-RMF.
pdf
bib
abs
Not All Terms Matter: Recall-Oriented Adaptive Learning for PLM-aided Query Expansion in Open-Domain Question Answering
Xinran Chen
|
Ben He
|
Xuanang Chen
|
Le Sun
The effectiveness of open-domain question answering (ODQA), particularly those employing a retriever-reader architecture, depends on the ability to recall relevant documents - a critical step that enables the reader to accurately extract answers. To enhance this retrieval phase, current query expansion (QE) techniques leverage pre-trained language models (PLM) to mitigate word mismatches and improve the recall of relevant documents. Despite their advancements, these techniques often treat all expanded terms uniformly, which can lead to less-than-optimal retrieval outcomes. In response, we propose a novel Recall-oriented Adaptive Learning (ReAL) method, which iteratively adjusts the importance weights of QE terms based on their relevance, thereby refining term distinction and enhancing the separation of relevant terms. Specifically, ReAL employs a similarity-based model to classify documents into pseudo-relevant and pseudo-irrelevant sets, and then optimizes term weights via two tailored loss functions to maximize the scoring gap between them. Experiments on four ODQA datasets and five QE methods show that ReAL consistently enhances retrieval accuracy and overall end-to-end QA performance, providing a robust and efficient solution for improving QE strategies in ODQA scenarios.
pdf
bib
abs
A Mutual Information Perspective on Knowledge Graph Embedding
Jiang Li
|
Xiangdong Su
|
Zehua Duo
|
Tian Lan
|
Xiaotao Guo
|
Guanglai Gao
Knowledge graph embedding techniques have emerged as a critical approach for addressing the issue of missing relations in knowledge graphs. However, existing methods often suffer from limitations, including high intra-group similarity, loss of semantic information, and insufficient inference capability, particularly in complex relation patterns such as 1-N and N-1 relations. To address these challenges, we introduce a novel KGE framework that leverages mutual information maximization to improve the semantic representation of entities and relations. By maximizing the mutual information between different components of triples, such as (h, r) and t, or (r, t) and h, the proposed method improves the model’s ability to preserve semantic dependencies while maintaining the relational structure of the knowledge graph. Extensive experiments on benchmark datasets demonstrate the effectiveness of our approach, with consistent performance improvements across various baseline models. Additionally, visualization analyses and case studies demonstrate the improved ability of the MI framework to capture complex relation patterns.
pdf
bib
abs
Aligned but Blind: Alignment Increases Implicit Bias by Reducing Awareness of Race
Lihao Sun
|
Chengzhi Mao
|
Valentin Hofmann
|
Xuechunzi Bai
Although value-aligned language models (LMs) appear unbiased in explicit bias evaluations, they often exhibit stereotypes in implicit word association tasks, raising concerns about their fair usage. We investigate the mechanisms behind this discrepancy and find that alignment surprisingly amplifies implicit bias in model outputs. Specifically, we show that aligned LMs, unlike their unaligned counterparts, overlook racial concepts in early internal representations when the context is ambiguous. Not representing race likely fails to activate safety guardrails, leading to unintended biases. Inspired by this insight, we propose a new bias mitigation strategy that works by incentivizing the representation of racial concepts in the early model layers. In contrast to conventional mitigation methods of machine unlearning, our interventions find that steering the model to be more aware of racial concepts effectively mitigates implicit bias. Similar to race blindness in humans, ignoring racial nuances can inadvertently perpetuate subtle biases in LMs.
pdf
bib
abs
IOPO: Empowering LLMs with Complex Instruction Following via Input-Output Preference Optimization
Xinghua Zhang
|
Haiyang Yu
|
Cheng Fu
|
Fei Huang
|
Yongbin Li
In the realm of large language models (LLMs), the ability of models to accurately follow instructions is paramount as more agents and applications leverage LLMs for construction, where the complexity of instructions are rapidly increasing. However, on the one hand, there is only a certain amount of complex instruction evaluation data; on the other hand, there are no dedicated algorithms to improve the ability to follow complex instructions. To this end, this paper introduces Trace, a benchmark for improving and evaluating the complex instruction-following ability, which consists of 120K training data and 1K evaluation data. Furthermore, we propose IOPO (Input-Output Preference Optimization) alignment method which takes both input and output preference pairs into consideration, where LLMs not only rapidly align with response preferences but also meticulously explore the instruction preferences. Extensive experiments on both in-domain and out-of-domain datasets confirm the effectiveness of IOPO, showing 8.15%, 2.18% improvements on in-domain data and 5.91%, 2.83% on out-of-domain data compared to SFT and DPO respectively. Our code and dataset are released at https://anonymous.4open.science/r/Code7-34A5.
pdf
bib
abs
ProMALex: Progressive Modular Adapters for Multi-Jurisdictional Legal Language Modeling
Santosh T.y.s.s
|
Mohamed Hesham Elganayni
This paper addresses the challenge of adapting language models to the jurisdiction-specific nature of legal corpora. Existing approaches—training separate models for each jurisdiction or using a single shared model—either fail to leverage common legal principles beneficial for low-resource settings or risk negative interference from conflicting jurisdictional interpretations. To overcome these limitations, we propose a parameter-efficient framework ProMALex, that first derives hierarchical relationships across jurisdictions and progressively inserts adapter modules across model layers based on jurisdictional similarity. This design allows modules in lower layers to be shared across jurisdictions, capturing common legal principles, while higher layers specialize through jurisdiction-specific adapters. Experimental results on two legal language modeling benchmarks demonstrate that ProMALex outperforms both fully shared and jurisdiction-specific models.
pdf
bib
abs
Flipping Knowledge Distillation: Leveraging Small Models’ Expertise to Enhance LLMs in Text Matching
Mingzhe Li
|
Jing Xiang
|
Qishen Zhang
|
Kaiyang Wan
|
Xiuying Chen
Knowledge distillation typically involves transferring knowledge from a Large Language Model (LLM) to a Smaller Language Model (SLM). However, in tasks like text matching, smaller fine-tuned models often produce more effective domain-specific representations as they focus on optimizing the similarity between input pairs. To combine the specialized strengths of small models with the rich semantic understanding of LLMs, we propose a flipped knowledge distillation paradigm, where the LLM learns from the SLM. To bridge the architectural gap between commonly used decoder-only LLMs and the encoder-based frameworks of smaller models, we reinterpret LLMs as encoder-decoder models using LoRA. In this setup, the encoder generates compressed text representations, while the decoder transforms them into the output space. During training, the encoder produces text representations and computes their similarities, which are then aligned with the similarity scores produced by the teacher model. We achieve this alignment using our proposed Margin-aware Contrastive Learning (MCL) approach. MCL ensures accurate similarity for both positive and negative pairs, while also adaptively handling differences within positive and negative samples. We validate the effectiveness of our approach on financial and healthcare benchmarks as well as real-world online applications. Our model has been fully deployed in an online application environment, demonstrating its practical utility.
pdf
bib
abs
Disentangling Language and Culture for Evaluating Multilingual Large Language Models
Jiahao Ying
|
Wei Tang
|
Yiran Zhao
|
Yixin Cao
|
Yu Rong
|
Wenxuan Zhang
This paper introduces a Dual Evaluation Framework to comprehensively assess the multilingual capabilities of LLMs. By decomposing the evaluation along the dimensions of linguistic medium and cultural context, this framework enables a nuanced analysis of LLMs’ ability to process questions within both native and cross-cultural contexts cross-lingually. Extensive evaluations are conducted on a wide range of models, revealing a notable “Cultural-Linguistic Synergy” phenomenon, where models exhibit better performance when questions are culturally aligned with the language. This phenomenon is further explored through interpretability probing, which shows that a higher proportion of specific neurons are activated in a language’s cultural context. This activation proportion could serve as a potential indicator for evaluating multilingual performance during model training. Our findings challenge the prevailing notion that LLMs, primarily trained on English data, perform uniformly across languages and highlight the necessity of culturally and linguistically model evaluations.
pdf
bib
abs
Detecting Sockpuppetry on Wikipedia Using Meta-Learning
Luc Raszewski
|
Christine de Kock
Malicious sockpuppet detection on Wikipedia is critical to preserving access to reliable information on the internet and preventing the spread of disinformation. Prior machine learning approaches rely on stylistic and meta-data features, but do not prioritise adaptability to author-specific behaviours. As a result, they struggle to effectively model the behaviour of specific sockpuppet-groups, especially when text data is limited. To address this, we propose the application of meta-learning, a machine learning technique designed to improve performance in data-scarce settings by training models across multiple tasks. Meta-learning optimises a model for rapid adaptation to the writing style of a new sockpuppet-group. Our results show that meta-learning significantly enhances the precision of predictions compared to pre-trained models, marking an advancement in combating sockpuppetry on open editing platforms. We release an updated dataset of sockpuppet investigations to foster future research in both sockpuppetry and meta-learning fields.
pdf
bib
abs
Diversity-oriented Data Augmentation with Large Language Models
Zaitian Wang
|
Jinghan Zhang
|
Xinhao Zhang
|
Kunpeng Liu
|
Pengfei Wang
|
Yuanchun Zhou
Data augmentation is an essential technique in natural language processing (NLP) for enriching training datasets by generating diverse samples. This process is crucial for improving the robustness and generalization capabilities of NLP models. However, a significant challenge remains: Insufficient Attention to Sample Distribution Diversity. Most existing methods focus on increasing the sample numbers while neglecting the sample distribution diversity, which can lead to model overfitting. In response, we explore data augmentation’s impact on dataset diversity and propose a Diversity-oriented data Augmentation framework (DoAug). Specifically, we utilize a diversity-oriented fine-tuning approach to train a large language model (LLM) as a diverse paraphraser, which is capable of augmenting textual datasets by generating diversified paraphrases. Then, we apply the LLM paraphraser to a selected coreset of highly informative samples and integrate the paraphrases with the original data to create a more diverse augmented dataset. Finally, we conduct extensive experiments on 12 real-world textual datasets. The results show that our fine-tuned LLM augmenter improves diversity while preserving label consistency, thereby enhancing the robustness and performance of downstream tasks. Specifically, it achieves an average performance gain of 10.52%, surpassing the runner-up baseline with more than three percentage points.
pdf
bib
abs
CoreEval: Automatically Building Contamination-Resilient Datasets with Real-World Knowledge toward Reliable LLM Evaluation
Jingqian Zhao
|
Bingbing Wang
|
Geng Tu
|
Yice Zhang
|
Qianlong Wang
|
Bin Liang
|
Jing Li
|
Ruifeng Xu
Data contamination poses a significant challenge to the fairness of LLM evaluations in natural language processing tasks by inadvertently exposing models to test data during training.Current studies mitigate this issue by modifying existing datasets or generating new ones from freshly collected information. However, these methods fall short of ensuring contamination-resilient evaluation, as they fail to fully eliminate pre-existing knowledge from models or preserve the semantic complexity of the original datasets. To address these limitations, we propose CoreEval, a Contamination-resilient Evaluation strategy for automatically updating data with real-world knowledge. This approach begins by extracting entity relationships from the original data and leveraging the GDELT database to retrieve relevant and up-to-date knowledge. The retrieved knowledge is then recontextualized and integrated with the original data, which is refined and restructured to ensure semantic coherence and enhanced task relevance. Ultimately, a robust data reflection mechanism in a Chain-of-Thought manner to iteratively verify and refine labels, ensuring consistency between the updated and original datasets. Extensive experiments on updated datasets validate the robustness of CoreEval, demonstrating its effectiveness in mitigating performance overestimation caused by data contamination.
pdf
bib
abs
RiOT: Efficient Prompt Refinement with Residual Optimization Tree
Chenyi Zhou
|
Zhengyan Shi
|
Yuan Yao
|
Lei Liang
|
Huajun Chen
|
Qiang Zhang
Recent advancements in large language models (LLMs) have highlighted their potential across a variety of tasks, but their performance still heavily relies on the design of effective prompts. Existing methods for automatic prompt optimization face two challenges: lack of diversity, limiting the exploration of valuable and innovative directions and semantic drift, where optimizations for one task can degrade performance in others. To address these issues, we propose Residual Optimization Tree (RiOT), a novel framework for automatic prompt optimization. RiOT iteratively refines prompts through text gradients, generating multiple semantically diverse candidates at each step, and selects the best prompt using perplexity. Additionally, RiOT incorporates the text residual connection to mitigate semantic drift by selectively retaining beneficial content across optimization iterations. A tree structure efficiently manages the optimization process, ensuring scalability and flexibility. Extensive experiments across five benchmarks — covering commonsense, mathematical, logical, temporal, and semantic reasoning — demonstrate that RiOT outperforms both previous prompt optimization methods and manual prompting. Code will be released.
pdf
bib
abs
Caution for the Environment: Multimodal LLM Agents are Susceptible to Environmental Distractions
Xinbei Ma
|
Yiting Wang
|
Yao Yao
|
Tongxin Yuan
|
Aston Zhang
|
Zhuosheng Zhang
|
Hai Zhao
This paper investigates the faithfulness of multimodal large language model (MLLM) agents in a graphical user interface (GUI) environment, aiming to address the research question of whether multimodal GUI agents can be distracted by environmental context. A general scenario is proposed where both the user and the agent are benign, and the environment, while not malicious, contains unrelated content. A wide range of MLLMs are evaluated as GUI agents using a simulated dataset, following three working patterns with different levels of perception. Experimental results reveal that even the most powerful models, whether generalist agents or specialist GUI agents, are susceptible to distractions. While recent studies predominantly focus on the helpfulness of agents, our findings first indicate that these agents are prone to environmental distractions. Furthermore, we implement an adversarial environment injection and analyze the approach to improve faithfulness, calling for a collective focus on this important topic.
pdf
bib
abs
Automatic Evaluation for Text-to-image Generation: Task-decomposed Framework, Distilled Training, and Meta-evaluation Benchmark
Rong-Cheng Tu
|
Zi-Ao Ma
|
Tian Lan
|
Yuehao Zhao
|
Heyan Huang
|
Xian-Ling Mao
Driven by the remarkable progress in diffusion models, text-to-image generation has achieved substantial advancements, underscoring the urgent need for robust automatic quality assessment. This task is inherently complex, requiring evaluations that range from object presence and attribute correctness to relational consistency and visual fidelity. Consequently, current state-of-the-art MLLM-based approaches often rely on powerful commercial models such as GPT-4o, which offer superior reasoning and instruction-following capabilities but are not universally accessible. In contrast, while open-source MLLMs demonstrate promising skills in vision and language understanding, they underperform in comprehensive image quality assessment.To address these challenges, we propose a task decomposition evaluation framework based on GPT-4o to automatically construct a specialized training dataset, breaking down the multifaceted evaluation process into simpler sub-tasks and thus reducing learning complexity. Building on this dataset, we design novel training strategies to distill GPT-4o’s evaluation capabilities into a 7B open-source MLLM, MiniCPM-V-2.6, enabling it to better follow instructions across diverse assessment criteria. Furthermore, to reliably and comprehensively assess prior works and our proposed model, we manually annotate a meta-evaluation benchmark that includes chain-of-thought explanations alongside quality scores for generated images.Experimental results demonstrate that our distilled open-source MLLM significantly outperforms the current state-of-the-art GPT-4o-base baseline, VIEScore, with over 4.6% improvement in Spearman and Kendall correlations with human judgments.
pdf
bib
abs
Mitigating Lost-in-Retrieval Problems in Retrieval Augmented Multi-Hop Question Answering
Rongzhi Zhu
|
Xiangyu Liu
|
Zequn Sun
|
Yiwei Wang
|
Wei Hu
In this paper, we identify a critical problem, “lost-in-retrieval”, in retrieval-augmented multi-hop question answering (QA): the key entities are missed in LLMs’ sub-question decomposition. “Lost-in-retrieval” significantly degrades the retrieval performance, which disrupts the reasoning chain and leads to the incorrect answers. To resolve this problem, we propose a progressive retrieval and rewriting method, namely ChainRAG, which sequentially handles each sub-question by completing missing key entities and retrieving relevant sentences from a sentence graph for answer generation. Each step in our retrieval and rewriting process builds upon the previous one, creating a seamless chain that leads to accurate retrieval and answers. Finally, all retrieved sentences and sub-question answers are integrated to generate a comprehensive answer to the original question. We evaluate ChainRAG on three multi-hop QA datasets—MuSiQue, 2Wiki, and HotpotQA—using three large language models: GPT4o-mini, Qwen2.5-72B, and GLM-4-Plus. Empirical results demonstrate that ChainRAG consistently outperforms baselines in both effectiveness and efficiency.
pdf
bib
abs
TableLoRA: Low-rank Adaptation on Table Structure Understanding for Large Language Models
Xinyi He
|
Yihao Liu
|
Mengyu Zhou
|
Yeye He
|
Haoyu Dong
|
Shi Han
|
Zejian Yuan
|
Dongmei Zhang
Tabular data are crucial in many fields and their understanding by large language models (LLMs) under high parameter efficiency paradigm is important. However, directly applying parameter-efficient fine-tuning (PEFT) techniques to tabular tasks presents significant challenges, particularly in terms of better table serialization and the representation of two-dimensional structured information within a one-dimensional sequence. To address this, we propose TableLoRA, a module designed to improve LLMs’ understanding of table structure during PEFT. It incorporates special tokens for serializing tables with special token encoder and uses 2D LoRA to encode low-rank information on cell positions. Experiments on four tabular-related datasets demonstrate that TableLoRA consistently outperforms vanilla LoRA and surpasses various table encoding methods tested in control experiments. These findings reveal that TableLoRA, as a table-specific LoRA, enhances the ability of LLMs to process tabular data effectively, especially in low-parameter settings, demonstrating its potential as a robust solution for handling table-related tasks.
pdf
bib
abs
Condor: Enhance LLM Alignment with Knowledge-Driven Data Synthesis and Refinement
Maosongcao Maosongcao
|
Taolin Zhang
|
Mo Li
|
Chuyu Zhang
|
Yunxin Liu
|
Conghui He
|
Haodong Duan
|
Songyang Zhang
|
Kai Chen
The quality of Supervised Fine-Tuning (SFT) data plays a critical role in enhancing the conversational capabilities of Large Language Models (LLMs). However, the availability of high-quality human-annotated SFT data has become a significant bottleneck for LLMs, necessitating a greater reliance on synthetic training data. In this work, we introduce Condor, a two-stage synthetic data generation framework that incorporates World Knowledge Trees and Self-Reflection Refinement to produce high-quality SFT data at scale. Our experimental results demonstrate that a base model fine-tuned on only 20K Condor-generated samples achieves superior performance compared to instruct model trained with RLHF. The additional refinement stage in Condor further enables iterative self-improvement for LLMs at various scales (up to 72B), validating the effectiveness of our approach. Furthermore, our investigation into the scaling of synthetic data in post-training reveals substantial unexplored potential for performance improvements, opening promising avenues for future research.
pdf
bib
abs
CulFiT: A Fine-grained Cultural-aware LLM Training Paradigm via Multilingual Critique Data Synthesis
Ruixiang Feng
|
Shen Gao
|
Xiuying Chen
|
Lisi Chen
|
Shuo Shang
Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks, yet they often exhibit a specific cultural bias, neglecting the values and linguistic diversity of low-resource regions. This cultural bias not only undermines universal equality but also risks reinforcing stereotypes and perpetuating discrimination. To address this, we propose CulFiT, a novel culturally-aware training paradigm that leverages multilingual data and fine-grained reward modeling to enhance cultural sensitivity and inclusivity. Our approach synthesizes diverse cultural-related questions, constructs critique data in multiple culturally relevant languages, and employs fine-grained rewards to decompose cultural texts into verifiable knowledge units for interpretable evaluation. We also introduce GlobalOpinionQA, a multilingual open-ended question-answering dataset designed to evaluate culturally-aware responses in a global context. Extensive experiments on three existing benchmarks and our GlobalOpinionQA demonstrate that CulFiT achieves state-of-the-art open-source model performance in cultural alignment and general reasoning.
pdf
bib
abs
Decoding Knowledge Attribution in Mixture-of-Experts: A Framework of Basic-Refinement Collaboration and Efficiency Analysis
Junzhuo Li
|
Bo Wang
|
Xiuze Zhou
|
Peijie Jiang
|
Jia Liu
|
Xuming Hu
The interpretability of Mixture-of-Experts (MoE) models, especially those with heterogeneous designs, remains underexplored. Existing attribution methods for dense models fail to capture dynamic routing-expert interactions in sparse MoE architectures. To address this issue, we propose a cross-level attribution algorithm to analyze sparse MoE architectures (Qwen 1.5-MoE, OLMoE, Mixtral-8x7B) against dense models (Qwen 1.5-7B, Llama-7B, Mistral-7B). Results show MoE models achieve 31% higher per-layer efficiency via a “mid-activation, late-amplification” pattern: early layers screen experts, while late layers refine knowledge collaboratively. Ablation studies reveal a “basic-refinement” framework—shared experts handle general tasks (entity recognition), while routed experts specialize in domain-specific processing (geographic attributes). Semantic-driven routing is evidenced by strong correlations between attention heads and experts (r=0.68), enabling task-aware coordination. Notably, architectural depth dictates robustness: deep Qwen-MoE mitigates expert failures (e.g., 43% MRR drop in geographic tasks when blocking top-10 experts) through shared expert redundancy, whereas shallow Olmoe suffers severe degradation (76% drop). Task sensitivity further guides design: core-sensitive tasks (geography) require concentrated expertise, while distributed-tolerant tasks (object attributes) leverage broader participation. These insights advance MoE interpretability, offering principles to balance efficiency, specialization, and robustness.
pdf
bib
abs
ChartLens: Fine-grained Visual Attribution in Charts
Manan Suri
|
Puneet Mathur
|
Nedim Lipka
|
Franck Dernoncourt
|
Ryan A. Rossi
|
Dinesh Manocha
The growing capabilities of multimodal large language models (MLLMs) have advanced tasks like chart understanding. However, these models often suffer from hallucinations, where generated text sequences conflict with the provided visual data. To address this, we introduce Post-Hoc Visual Attribution for Charts, which identifies fine-grained chart elements that validate a given chart-associated response. We propose ChartLens, a novel chart attribution algorithm that uses segmentation-based techniques to identify chart objects and employs set-of-marks prompting with MLLMs for fine-grained visual attribution. Additionally, we present ChartVA-Eval, a benchmark with synthetic and real-world charts from diverse domains like finance, policy, and economics, featuring fine-grained attribution annotations. Our evaluations show that ChartLens improves fine-grained attributions by 26-66%.
pdf
bib
abs
LESA: Learnable LLM Layer Scaling-Up
Yifei Yang
|
Zouying Cao
|
Xinbei Ma
|
Yao Yao
|
Zhi Chen
|
Libo Qin
|
Hai Zhao
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive. Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones. However, existing depth scaling-up methods rely on empirical heuristic rules for layer duplication, which result in poorer initialization and slower convergence during continual pre-training. We propose LESA, a novel learnable method for depth scaling-up. By concatenating parameters from each layer and applying Singular Value Decomposition, we uncover latent patterns between layers, suggesting that inter-layer parameters can be learned. LESA uses a neural network to predict the parameters inserted between adjacent layers, enabling better initialization and faster training. Experiments show that LESA outperforms existing baselines, achieving superior performance with less than half the computational cost during continual pre-training. Extensive analyses demonstrate its effectiveness across different model sizes and tasks.
pdf
bib
abs
MMRC: A Large-Scale Benchmark for Understanding Multimodal Large Language Model in Real-World Conversation
Haochen Xue
|
Feilong Tang
|
Ming Hu
|
Yexin Liu
|
Qidong Huang
|
Yulong Li
|
Chengzhi Liu
|
Zhongxing Xu
|
Chong Zhang
|
Chun-Mei Feng
|
Yutong Xie
|
Imran Razzak
|
Zongyuan Ge
|
Jionglong Su
|
Junjun He
|
Yu Qiao
Recent multimodal large language models (MLLMs) have demonstrated significant potential in open-ended conversation, generating more accurate and personalized responses. However, their abilities to memorize, recall, and reason in sustained interactions within real-world scenarios remain underexplored. This paper introduces MMRC, a Multi-Modal Real-world Conversation benchmark for evaluating six core open-ended abilities of MLLMs: information extraction, multi-turn reasoning, information update, image management, memory recall, and answer refusal. With data collected from real-world scenarios, MMRC comprises 5,120 conversations and 28,720 corresponding manually labeled questions, posing a significant challenge to existing MLLMs. Evaluations on 20 MLLMs in MMRC indicate an accuracy drop during open-ended interactions. We identify four common failure patterns: long-term memory degradation, inadequacies in updating factual knowledge, accumulated assumption of error propagation, and reluctance to “say no.” To mitigate these issues, we propose a simple yet effective NOTE-TAKING strategy, which can record key information from the conversation and remind the model during its responses, enhancing conversational capabilities. Experiments across six MLLMs demonstrate significant performance improvements.
pdf
bib
abs
Towards the Law of Capacity Gap in Distilling Language Models
Chen Zhang
|
Qiuchi Li
|
Dawei Song
|
Zheyu Ye
|
Yan Gao
|
Yao Hu
Language model (LM) distillation aims at distilling the knowledge in a large teacher LM to a small student one. As a critical issue facing LM distillation, a superior student often arises from a teacher of a relatively small scale instead of a larger one, especially in the presence of substantial capacity gap between the teacher and student. This issue, often referred to as the curse of capacity gap, suggests that there is likely an optimal teacher yielding the best-performing student along the scaling course of the teacher. Consequently, distillation trials on teachers of a wide range of scales are called for to determine the optimal teacher, which becomes computationally intensive in the context of large LMs (LLMs). This paper addresses this critical bottleneck by providing the law of capacity gap inducted from a preliminary study on distilling a broad range of small-scale (<3B) LMs, where the optimal teacher consistently scales linearly with the student scale across different model and data scales. By extending the law to LLM distillation on a larger scale (7B), we succeed in obtaining versatile LLMs that outperform a wide array of competitors.
pdf
bib
abs
WhiSPA: Semantically and Psychologically Aligned Whisper with Self-Supervised Contrastive and Student-Teacher Learning
Rajath Rao
|
Adithya V Ganesan
|
Oscar Kjell
|
Jonah Luby
|
Akshay Raghavan
|
Scott M. Feltman
|
Whitney Ringwald
|
Ryan L. Boyd
|
Benjamin J. Luft
|
Camilo J. Ruggero
|
Neville Ryant
|
Roman Kotov
|
H. Schwartz
Current speech encoding pipelines often rely on an additional text-based LM to get robust representations of human communication, even though SotA speech-to-text models often have a LM within. This work proposes an approach to improve the LM within an audio model such that the subsequent text-LM is unnecessary. We introduce **WhiSPA** (**Whi**sper with **S**emantic and **P**sychological **A**lignment), which leverages a novel audio training objective: contrastive loss with a language model embedding as a teacher. Using over 500k speech segments from mental health audio interviews, we evaluate the utility of aligning Whisper’s latent space with semantic representations from a text autoencoder (SBERT) and lexically derived embeddings of basic psychological dimensions: emotion and personality. Over self-supervised affective tasks and downstream psychological tasks, WhiSPA surpasses current speech encoders, achieving an average error reduction of 73.4% and 83.8%, respectively. WhiSPA demonstrates that it is not always necessary to run a subsequent text LM on speech-to-text output in order to get a rich psychological representation of human communication.
pdf
bib
abs
Keys to Robust Edits: From Theoretical Insights to Practical Advances
Jianhao Yan
|
Futing Wang
|
Yun Luo
|
Yafu Li
|
Yue Zhang
Large language models (LLMs) struggle with maintaining accurate knowledge due to conflicting/outdated parametric memories. While locate-and-edit methods address this, their reliance on models’ internal representations leads to robustness failures in long-context reasoning and paraphrased queries. We identify a fundamental limitation of locate-and-edit methods: existing semantic keys (for memory localization) cannot simultaneously satisfy robustness (context-invariant activation) and specificity (precise knowledge discrimination). Through theoretical error-bound analysis, we establish formal criteria for effective editing.Our solution introduces Robust Edit Pathway (REP), a plug-and-play module that: (1) disentangles editing keys from native model representations; (2) dynamically adjusts keys via contrastive learning to achieve robustness-specificity balance. Extensive experiments across various editing methods (ROME/MEMIT/R-ROME/EMMET), existing LLMs (LLaMA2, QWen, Mistral), and datasets (CounterFact, ZsRE) show that REP improves success rate over robustness tests by up-to 66.4% while maintaining the success rate unaffected.
pdf
bib
abs
Boosting LLM’s Molecular Structure Elucidation with Knowledge Enhanced Tree Search Reasoning
Xiang Zhuang
|
Bin Wu
|
Jiyu Cui
|
Kehua Feng
|
Xiaotong Li
|
Huabin Xing
|
Keyan Ding
|
Qiang Zhang
|
Huajun Chen
Molecular structure elucidation involves deducing a molecule’s structure from various types of spectral data, which is crucial in chemical experimental analysis. While large language models (LLMs) have shown remarkable proficiency in analyzing and reasoning through complex tasks, they still encounter substantial challenges in molecular structure elucidation. We identify that these challenges largely stem from LLMs’ limited grasp of specialized chemical knowledge. In this work, we introduce a Knowledge-enhanced reasoning framework for Molecular Structure Elucidation (K-MSE), leveraging Monte Carlo Tree Search for test-time scaling as a plugin. Specifically, we construct an external molecular substructure knowledge base to extend the LLMs’ coverage of the chemical structure space. Furthermore, we design a specialized molecule-spectrum scorer to act as a reward model for the reasoning process, addressing the issue of inaccurate solution evaluation in LLMs. Experimental results show that our approach significantly boosts performance, particularly gaining more than 20% improvement on both GPT-4o-mini and GPT-4o.
pdf
bib
abs
MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation
María Andrea Cruz Blandón
|
Jayasimha Talur
|
Bruno Charron
|
Dong Liu
|
Saab Mansour
|
Marcello Federico
Automatic evaluation of retrieval augmented generation (RAG) systems relies on fine-grained dimensions like faithfulness and relevance, as judged by expert human annotators. Meta-evaluation benchmarks support the development of automatic evaluators that correlate well with human judgement. However, existing benchmarks predominantly focus on English or use translated data, which fails to capture cultural nuances. A native approach provides a better representation of the end user experience.In this work, we develop a Multilingual End-to-end Meta-Evaluation RAG benchmark MEMERAG. Our benchmark builds on the popular MIRACL dataset, using native-language questions and generating responses with diverse large language models (LLMs), which are then assessed by expert annotators for faithfulness and relevance. We describe our annotation process and show that it achieves high inter-annotator agreement. We then analyse the performance of the answer-generating LLMs across languages as per the human evaluators. Finally we apply the dataset to our main use-case which is to benchmark multilingual automatic evaluators (LLM-as-a-judge). We show that our benchmark can reliably identify improvements offered by advanced prompting techniques and LLMs. We release our benchmark to support the community developing accurate evaluation methods for multilingual RAG systems.
pdf
bib
abs
The Role of Visual Modality in Multimodal Mathematical Reasoning: Challenges and Insights
Yufang Liu
|
Yao Du
|
Tao Ji
|
Jianing Wang
|
Yang Liu
|
Yuanbin Wu
|
Aimin Zhou
|
Mengdi Zhang
|
Xunliang Cai
Recent research has increasingly focused on multimodal mathematical reasoning, particularly emphasizing the creation of relevant datasets and benchmarks. Despite this, the role of visual information in reasoning has been underexplored. Our findings show that existing multimodal mathematical models minimally leverage visual information, and model performance remains largely unaffected by changes to or removal of images in the dataset. We attribute this to the dominance of textual information and answer options that inadvertently guide the model to correct answers. To improve evaluation methods, we introduce the HC-M3D dataset, specifically designed to require image reliance for problem-solving and to challenge models with similar, yet distinct, images that change the correct answer. In testing leading models, their failure to detect these subtle visual differences suggests limitations in current visual perception capabilities. Additionally, we observe that the common approach of improving general VQA capabilities by combining various types of image encoders does not contribute to math reasoning performance. This finding also presents a challenge to enhancing visual reliance during math reasoning.
pdf
bib
abs
The Essence of Contextual Understanding in Theory of Mind: A Study on Question Answering with Story Characters
Chulun Zhou
|
Qiujing Wang
|
Mo Yu
|
Xiaoqian Yue
|
Rui Lu
|
Jiangnan Li
|
Yifan Zhou
|
Shunchi Zhang
|
Jie Zhou
|
Wai Lam
Theory-of-Mind (ToM) is a fundamental psychological capability that allows humans to understand and interpret the mental states of others. Humans infer others’ thoughts by integrating causal cues and indirect clues from broad contextual information, often derived from past interactions. In other words, human ToM heavily relies on the understanding about the backgrounds and life stories of others. Unfortunately, this aspect is largely overlooked in existing benchmarks for evaluating machines’ ToM capabilities, due to their usage of short narratives without global context, especially personal background of characters. In this paper, we verify the importance of comprehensive contextual understanding about personal backgrounds in ToM and assess the performance of LLMs in such complex scenarios. To achieve this, we introduce CharToM-QA benchmark, comprising 1,035 ToM questions based on characters from classic novels. Our human study reveals a significant disparity in performance: the same group of educated participants performs dramatically better when they have read the novels compared to when they have not. In parallel, our experiments on state-of-the-art LLMs, including the very recent o1 and DeepSeek-R1 models, show that LLMs still perform notably worse than humans, despite that they have seen these stories during pre-training. This highlights the limitations of current LLMs in capturing the nuanced contextual information required for ToM reasoning.
pdf
bib
abs
S2R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning
Ruotian Ma
|
Peisong Wang
|
Cheng Liu
|
Xingyan Liu
|
Jiaqi Chen
|
Bang Zhang
|
Xin Zhou
|
Nan Du
|
Jia Li
Recent studies have demonstrated the effectiveness of LLM test-time scaling. However, existing approaches to incentivize LLMs’ deep thinking abilities generally require large-scale data or significant training efforts. Meanwhile, it remains unclear how to improve the thinking abilities of less powerful base models. In this work, we introduce S2R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference. Specifically, we first initialize LLMs with iterative self-verification and self-correction behaviors through supervised fine-tuning on carefully curated data. The self-verification and self-correction skills are then further strengthened by outcome-level and process-level reinforcement learning with minimized resource requirements. Our results demonstrate that, with only 3.1k behavior initialization samples, Qwen2.5-math-7B achieves an accuracy improvement from 51.0% to 81.6%, outperforming models trained on an equivalent amount of long-CoT distilled data. We also discuss the effect of different RL strategies on enhancing LLMs’ deep reasoning. Extensive experiments and analysis based on three base models across both in-domain and out-of-domain benchmarks validate the effectiveness of S2R.
pdf
bib
abs
Advancing Collaborative Debates with Role Differentiation through Multi-Agent Reinforcement Learning
Haoran Li
|
Ziyi Su
|
Yun Xue
|
Zhiliang Tian
|
Yiping Song
|
Minlie Huang
Multi-agent collaborative tasks exhibit exceptional capabilities in natural language applications and generation. By prompting agents to assign clear roles, it is possible to facilitate cooperation and achieve complementary capabilities among LLMs. A common strategy involves adopting a relatively general role assignment mechanism, such as introducing a “judge” or a “summarizer”. However, these approaches lack task-specific role customization based on task characteristics. Another strategy involves decomposing the task based on domain knowledge and task characteristics, followed by assigning appropriate roles according to LLMs’ respective strengths, such as programmers and testers. However, in some given tasks, obtaining domain knowledge related to task characteristics and getting the strengths of different LLMs is hard. To solve these problems, we propose a Multi-LLM Cooperation (MLC) framework with automatic role assignment capabilities. The core idea of the MLC is to initialize role assignments randomly and then allow the role embeddings to be learned jointly with the downstream task. To capture the state transitions of multiple LLMs during turn-based speaking, the role embedding is sequence-aware. At the same time, to avoid role convergence, the role differentiation module in MLC encourages behavioral differentiation between LLMs while ensuring the LLM team consistency, guiding different LLMs to develop complementary strengths from the optimization level. Our experiments on seven datasets demonstrate that MLC significantly enhances collaboration and expertise, which collaboratively addresses multi-agent tasks.
pdf
bib
abs
Retrieval-Augmented Fine-Tuning With Preference Optimization For Visual Program Generation
Deokhyung Kang
|
Jeonghun Cho
|
Yejin Jeon
|
Sunbin Jang
|
Minsub Lee
|
Jawoon Cho
|
Gary Lee
Visual programming languages (VPLs) allow users to create programs through graphical interfaces, which results in easier accessibility and their widespread usage in various domains. To further enhance this accessibility, recent research has focused on generating VPL code from user instructions using large language models (LLMs). Specifically, by employing prompting-based methods, these studies have shown promising results. Nevertheless, such approaches can be less effective for industrial VPLs such as Ladder Diagram (LD). LD is a pivotal language used in industrial automation processes and involves extensive domain-specific configurations, which are difficult to capture in a single prompt. In this work, we demonstrate that training-based methods outperform prompting-based methods for LD generation accuracy, even with smaller backbone models. Building on these findings, we propose a two-stage training strategy to further enhance VPL generation. First, we employ retrieval-augmented fine-tuning to leverage the repetitive use of subroutines commonly seen in industrial VPLs. Second, we apply direct preference optimization (DPO) to further guide the model toward accurate outputs, using systematically generated preference pairs through graph editing operations. Extensive experiments on real-world LD data demonstrate that our approach improves program-level accuracy by over 10% compared to supervised fine-tuning, which highlights its potential to advance industrial automation.
pdf
bib
abs
STRICTA: Structured Reasoning in Critical Text Assessment for Peer Review and Beyond
Nils Dycke
|
Matej Zečević
|
Ilia Kuznetsov
|
Beatrix Suess
|
Kristian Kersting
|
Iryna Gurevych
Critical text assessment is at the core of many expert activities, such as fact-checking, peer review, and essay grading. Yet, existing work treats critical text assessment as a black box problem, limiting interpretability and human-AI collaboration. To close this gap, we introduce Structured Reasoning in Critical Text Assessment (STRICTA), a novel specification framework to model text assessment as an explicit, step-wise reasoning process. STRICTA breaks down the assessment into a graph of interconnected reasoning steps drawing on causality theory (Pearl, 1995). This graph is populated based on expert interaction data and used to study the assessment process and facilitate human-AI collaboration. We formally define STRICTA and apply it in a study on biomedical paper assessment, resulting in a dataset of over 4000 reasoning steps from roughly 40 biomedical experts on more than 20 papers. We use this dataset to empirically study expert reasoning in critical text assessment, and investigate if LLMs are able to imitate and support experts within these workflows. The resulting tools and datasets pave the way for studying collaborative expert-AI reasoning in text assessment, in peer review and beyond.
pdf
bib
abs
XDAC: XAI-Driven Detection and Attribution of LLM-Generated News Comments in Korean
Wooyoung Go
|
Hyoungshick Kim
|
Alice Oh
|
Yongdae Kim
Large language models (LLMs) generate human-like text, raising concerns about their misuse in creating deceptive content. Detecting LLM-generated comments (LGC) in online news is essential for preserving online discourse integrity and preventing opinion manipulation. However, effective detection faces two key challenges; the brevity and informality of news comments limit traditional methods, and the absence of a publicly available LGC dataset hinders model training, especially for languages other than English. To address these challenges, we propose a twofold approach. First, we develop an LGC generation framework to construct a high-quality dataset with diverse and complex examples. Second, we introduce XDAC (XAI-Driven Detection and Attribution of LLM-Generated Comments), a framework utilizing explainable AI, designed for the detection and attribution of short-form LGC in Korean news articles. XDAC leverages XAI to uncover distinguishing linguistic patterns at both token and character levels. We present the first large-scale benchmark dataset, comprising 1.3M human-written comments from Korean news platforms and 1M LLM-generated comments from 14 distinct models. XDAC outperforms existing methods, achieving a 98.5% F1 score in LGC detection with a relative improvement of 68.1%, and an 84.3% F1 score in attribution. To validate real-world applicability, we analyze 5.24M news comments from Naver, South Korea’s leading online news platform, identifying 27,029 potential LLM-generated comments.
pdf
bib
abs
CENTAUR: Bridging the Impossible Trinity of Privacy, Efficiency, and Performance in Privacy-Preserving Transformer Inference
Jinglong Luo
|
Guanzhong Chen
|
Yehong Zhang
|
Shiyu Liu
|
Hui Wang
|
Yue Yu
|
Xun Zhou
|
Yuan Qi
|
Zenglin Xu
With the growing deployment of pre-trained models like Transformers on cloud platforms, privacy concerns about model parameters and inference data are intensifying. Existing Privacy-Preserving Transformer Inference (PPTI) frameworks face the “impossible trinity” of balancing privacy, efficiency, and performance: Secure Multi-Party Computation (SMPC)-based approaches ensure strong privacy but suffer from high computational overhead and performance losses; Conversely, permutation-based methods achieve near-plaintext efficiency and accuracy but compromise privacy by exposing sensitive model parameters and intermediate results. Bridging this gap with a single approach presents substantial challenges, motivating the introduction of CENTAUR, a groundbreaking PPTI framework that seamlessly integrates random permutations and SMPC to address the “impossible trinity”. By designing efficient PPTI algorithms tailored to the structural properties of Transformer models, CENTAUR achieves an unprecedented balance among privacy, efficiency, and performance. Our experiments demonstrate CENTAUR’s ability to resist diverse data reconstruction attacks, achieve plaintext-level inference accuracy, and boost inference speed by 5.0~30.4 times, unlocking new possibilities for secure and efficient AI deployment.
pdf
bib
abs
Silencing Empowerment, Allowing Bigotry: Auditing the Moderation of Hate Speech on Twitch
Prarabdh Shukla
|
Wei Yin Chong
|
Yash Patel
|
Brennan Schaffner
|
Danish Pruthi
|
Arjun Bhagoji
To meet the demands of content moderation, online platforms have resorted to automated systems. Newer forms of real-time engagement (e.g., users commenting on live streams) on platforms like Twitch exert additional pressures on the latency expected of such moderation systems. Despite their prevalence, relatively little is known about the effectiveness of these systems. In this paper, we conduct an audit of Twitch’s automated moderation tool (AutoMod) to investigate its effectiveness in flagging hateful content. For our audit, we create streaming accounts to act as siloed test beds, and interface with the live chat using Twitch’s APIs to send over 107,000 comments collated from 4 datasets. We measure AutoMod‘s accuracy in flagging blatantly hateful content containing misogyny, racism, ableism and homophobia. Our experiments reveal that a large fraction of hateful messages, up to 94% on some datasets, bypass moderation. Contextual addition of slurs to these messages results in 100% removal, revealing AutoMod‘s reliance on slurs as a hate signal. We also find that contrary to Twitch’s community guidelines, AutoMod blocks up to 89.5% of benign examples that use sensitive words in pedagogical or empowering contexts. Overall, our audit points to large gaps in AutoMod‘s capabilities and underscores the importance for such systems to understand context effectively.
pdf
bib
abs
EdiText: Controllable Coarse-to-Fine Text Editing with Diffusion Language Models
Che Hyun Lee
|
Heeseung Kim
|
Jiheum Yeom
|
Sungroh Yoon
We propose EdiText, a controllable text editing method that modifies the reference text to desired attributes at various scales. We integrate an SDEdit-based editing technique that allows for broad adjustments in the degree of text editing. Additionally, we introduce a novel fine-level editing method based on self-conditioning, which allows subtle control of reference text. While being capable of editing on its own, this fine-grained method, integrated with the SDEdit approach, enables EdiText to make precise adjustments within the desired range. EdiText demonstrates its controllability to robustly adjust reference text at a broad range of levels across various tasks, including toxicity control and sentiment control.
pdf
bib
abs
TUMLU: A Unified and Native Language Understanding Benchmark for Turkic Languages
Jafar Isbarov
|
Arofat Akhundjanova
|
Mammad Hajili
|
Kavsar Huseynova
|
Dmitry Gaynullin
|
Anar Rzayev
|
Osman Tursun
|
Aizirek Turdubaeva
|
Ilshat Saetov
|
Rinat Kharisov
|
Saule Belginova
|
Ariana Kenbayeva
|
Amina Alisheva
|
Abdullatif Köksal
|
Samir Rustamov
|
Duygu Ataman
Being able to thoroughly assess massive multi-task language understanding (MMLU) capabilities is essential for advancing the applicability of multilingual language models. However, preparing such benchmarks in high quality native language is often costly and therefore limits the representativeness of evaluation datasets. While recent efforts focused on building more inclusive MMLU benchmarks, these are conventionally built using machine translation from high-resource languages, which may introduce errors and fail to account for the linguistic and cultural intricacies of the target languages. In this paper, we address the lack of native language MMLU benchmark especially in the under-represented Turkic language family with distinct morphosyntactic and cultural characteristics. We propose two benchmarks for Turkic language MMLU: TUMLU is a comprehensive, multilingual, and natively developed language understanding benchmark specifically designed for Turkic languages. It consists of middle- and high-school level questions spanning 11 academic subjects in Azerbaijani, Crimean Tatar, Karakalpak, Kazakh, Kyrgyz, Tatar, Turkish, Uyghur, and Uzbek. We also present TUMLU-mini, a more concise, balanced, and manually verified subset of the dataset. Using this dataset, we systematically evaluate a diverse range of open and proprietary multilingual large language models (LLMs), including Claude, Gemini, GPT, and LLaMA, offering an in-depth analysis of their performance across different languages, subjects, and alphabets. To promote further research and development in multilingual language understanding, we release TUMLU-mini and all corresponding evaluation scripts.
pdf
bib
abs
Look Both Ways and No Sink: Converting LLMs into Text Encoders without Training
Ziyong Lin
|
Haoyi Wu
|
Shu Wang
|
Kewei Tu
|
Zilong Zheng
|
Zixia Jia
Recent advancements have demonstrated the advantage of converting pretrained large language models into powerful text encoders by enabling bidirectional attention in transformer layers. However, existing methods often require extensive training on large-scale datasets, posing challenges in low-resource, domain-specific scenarios. In this work, we show that a pretrained large language model can be converted into a strong text encoder without additional training. We first conduct a comprehensive empirical study to investigate different conversion strategies and identify the impact of the attention sink phenomenon on the performance of converted encoder models. Based on our findings, we propose a novel approach that enables bidirectional attention and suppresses the attention sink phenomenon, resulting in superior performance. Extensive experiments on multiple domains demonstrate the effectiveness of our approach. Our work provides new insights into the training-free conversion of text encoders in low-resource scenarios and contributes to the advancement of domain-specific text representation generation. Our code is available at https://github.com/bigai-nlco/Look-Both-Ways-and-No-Sink.
pdf
bib
abs
A Statistical and Multi-Perspective Revisiting of the Membership Inference Attack in Large Language Models
Bowen Chen
|
Namgi Han
|
Yusuke Miyao
The lack of data transparency in Large Language Models (LLMs) has highlighted the importance of Membership Inference Attack (MIA), which differentiates trained (member) and untrained (non-member) data. Though it shows success in previous studies, recent research reported a near-random performance in different settings, highlighting a significant performance inconsistency. We assume that a single setting doesn’t represent the distribution of the vast corpora, causing members and non-members with different distributions to be sampled and causing inconsistency. In this study, instead of a single setting, we statistically revisit MIA methods from various settings with thousands of experiments for each MIA method, along with study in text feature, embedding, threshold decision, and decoding dynamics of members and non-members. We found that (1) MIA performance improves with model size and varies with domains, while most methods do not statistically outperform baselines, (2) Though MIA performance is generally low, a notable amount of differentiable member and non-member outliers exists and vary across MIA methods, (3) Deciding a threshold to separate members and non-members is an overlooked challenge, (4) Text dissimilarity and long text benefit MIA performance, (5) Differentiable or not is reflected in the LLM embedding, (6) Member and non-members show different decoding dynamics.
pdf
bib
abs
Around the World in 24 Hours: Probing LLM Knowledge of Time and Place
Carolin Holtermann
|
Paul Röttger
|
Anne Lauscher
Reasoning over time and space is essential for understanding our world. However, the abilities of language models in this area are largely unexplored as previous work has tested their abilities for logical reasoning in terms of time and space in isolation or only in simple or artificial environments. In this paper, we present the first evaluation of the ability of language models to jointly reason over time and space. To enable our analysis, we create GeoTemp, a dataset of 320k prompts covering 289 cities in 217 countries and 37 time zones. Using GeoTemp, we evaluate eight open chat models of three different model families for different combinations of temporal and geographic knowledge. We find that most models perform well on reasoning tasks involving only temporal knowledge and that overall performance improves with scale. However, performance remains constrained in tasks that require connecting temporal and geographical information. We do not find clear correlations of performance with specific geographic regions. Instead, we find a significant performance increase for location names with low model perplexity, suggesting their repeated occurrence during model training. We further demonstrate that their performance is heavily influenced by prompt formulation - a direct injection of geographical knowledge leads to performance gains, whereas, surprisingly, techniques like chain-of-thought prompting decrease performance on simpler tasks.
pdf
bib
abs
Mining the uncertainty patterns of humans and models in the annotation of moral foundations and human values
Neele Falk
|
Gabriella Lapesa
The NLP community has converged on considering disagreement in annotation (or human label variation, HLV) as a constitutive feature of subjective tasks. This paper makes a further step by investigating the relationship between HLV and model uncertainty, and the impact of linguistic features of the items on both. We focus on the identification of moral foundations (e.g., care, fairness, loyalty) and human values (e.g., be polite, be honest) in text. We select three standard datasets and proceed into two steps. First, we focus on HLV and analyze the linguistic features (complexity, polarity, pragmatic phenomena, lexical choices) that correlate with HLV. Next, we proceed to uncertainty and its relationship to HLV. We experiment with RoBERTa and Flan-T5 in a number of training setups and evaluation metrics that test the calibration of uncertainty to HLV and its relationship to performance beyond majority vote; next, we analyze the impact of linguistic features on uncertainty. We find that RoBERTa with soft loss is better calibrated to HLV, and we find alignment between calibrated models and humans in the features (textual complexity and polarity) triggering variation.
pdf
bib
abs
“What do you call a dog that is incontrovertibly true? Dogma”: Testing LLM Generalization through Humor
Alessio Cocchieri
|
Luca Ragazzi
|
Paolo Italiani
|
Giuseppe Tagliavini
|
Gianluca Moro
Humor, requiring creativity and contextual understanding, is a hallmark of human intelligence, showcasing adaptability across linguistic scenarios. While recent advances in large language models (LLMs) demonstrate strong reasoning on various benchmarks, it remains unclear whether they truly adapt to new tasks like humans (i.e., generalize) or merely replicate memorized content. To explore this, we introduce Phunny, a new humor-based question-answering benchmark designed to assess LLMs’ reasoning through carefully crafted puns. Our dataset is manually curated to ensure novelty and minimize data contamination, providing a robust evaluation of LLMs’ linguistic comprehension. Experiments on pun comprehension, resolution, and generation reveal that most LLMs struggle with generalization, even on simple tasks, consistently underperforming the human baseline. Additionally, our detailed error analysis provides valuable insights to guide future research.
pdf
bib
abs
Towards Harmonized Uncertainty Estimation for Large Language Models
Rui Li
|
Jing Long
|
Muge Qi
|
Heming Xia
|
Lei Sha
|
Peiyi Wang
|
Zhifang Sui
To facilitate robust and trustworthy deployment of large language models (LLMs), it is essential to quantify the reliability of their generations through uncertainty estimation. While recent efforts have made significant advancements by leveraging the internal logic and linguistic features of LLMs to estimate uncertainty scores, our empirical analysis highlights the pitfalls of these methods to strike a harmonized estimation between indication, balance, and calibration, which hinders their broader capability for accurate uncertainty estimation. To address this challenge, we propose CUE (Corrector for Uncertainty Estimation): A straightforward yet effective method that employs a lightweight model trained on data aligned with the target LLM’s performance to adjust uncertainty scores. Comprehensive experiments across diverse models and tasks demonstrate its effectiveness, which achieves consistent improvements of up to 60% over existing methods.
pdf
bib
abs
VITAL: A New Dataset for Benchmarking Pluralistic Alignment in Healthcare
Anudeex Shetty
|
Amin Beheshti
|
Mark Dras
|
Usman Naseem
Alignment techniques have become central to ensuring that Large Language Models (LLMs) generate outputs consistent with human values. However, existing alignment paradigms often model an averaged or monolithic preference, failing to account for the diversity of perspectives across cultures, demographics, and communities. This limitation is particularly critical in health-related scenarios, where plurality is essential due to the influence of culture, religion, personal values, and conflicting opinions. Despite progress in pluralistic alignment, no prior work has focused on health, likely due to the unavailability of publicly available datasets. To address this gap, we introduce VITAL, a new benchmark dataset comprising 13.1K value-laden situations and 5.4K multiple-choice questions focused on health, designed to assess and benchmark pluralistic alignment methodologies. Through extensive evaluation of eight LLMs of varying sizes, we demonstrate that existing pluralistic alignment techniques fall short in effectively accommodating diverse healthcare beliefs, underscoring the need for tailored AI alignment in specific domains. This work highlights the limitations of current approaches and lays the groundwork for developing health-specific alignment solutions.
pdf
bib
abs
Are We in the AI-Generated Text World Already? Quantifying and Monitoring AIGT on Social Media
Zhen Sun
|
Zongmin Zhang
|
Xinyue Shen
|
Ziyi Zhang
|
Yule Liu
|
Michael Backes
|
Yang Zhang
|
Xinlei He
Social media platforms are experiencing a growing presence of AI-Generated Texts (AIGTs). However, the misuse of AIGTs could have profound implications for public opinion, such as spreading misinformation and manipulating narratives. Despite its importance, it remains unclear how prevalent AIGTs are on social media. To address this gap, this paper aims to quantify and monitor the AIGTs on online social media platforms. We first collect a dataset (SM-D) with around 2.4M posts from 3 major social media platforms: Medium, Quora, and Reddit. Then, we construct a diverse dataset (AIGTBench) to train and evaluate AIGT detectors. AIGTBench combines popular open-source datasets and our AIGT datasets generated from social media texts by 12 LLMs, serving as a benchmark for evaluating mainstream detectors. With this setup, we identify the best-performing detector (OSM-Det). We then apply OSM-Det to SM-D to track AIGTs across social media platforms from January 2022 to October 2024, using the AI Attribution Rate (AAR) as the metric. Specifically, Medium and Quora exhibit marked increases in AAR, rising from 1.77% to 37.03% and 2.06% to 38.95%, respectively. In contrast, Reddit shows slower growth, with AAR increasing from 1.31% to 2.45% over the same period. Our further analysis indicates that AIGTs on social media differ from human-written texts across several dimensions, including linguistic patterns, topic distributions, engagement levels, and the follower distribution of authors. We envision our analysis and findings on AIGTs in social media can shed light on future research in this domain.
pdf
bib
abs
From English to Second Language Mastery: Enhancing LLMs with Cross-Lingual Continued Instruction Tuning
Linjuan Wu
|
Hao-Ran Wei
|
Baosong Yang
|
Weiming Lu
Supervised Fine-Tuning (SFT) with translated instruction data effectively adapts Large Language Models (LLMs) from English to non-English languages. We introduce Cross-Lingual Continued Instruction Tuning (X-CIT), which fully leverages translation-based parallel instruction data to enhance cross-lingual adaptability. X-CIT emulates the human process of second language acquisition and is guided by Chomsky’s Principles and Parameters Theory. It first fine-tunes the LLM on English instruction data to establish foundational capabilities (i.e. Principles), then continues with target language translation and customized chat-instruction data to adjust “parameters” specific to the target language. This chat-instruction data captures alignment information in translated parallel data, guiding the model to initially think and respond in its native language before transitioning to the target language. To further mimic human learning progression, we incorporate Self-Paced Learning (SPL) during continued training, allowing the model to advance from simple to complex tasks. Implemented on Llama-2-7B across five languages, X-CIT was evaluated against three objective benchmarks and an LLM-as-a-judge benchmark, improving the strongest baseline by an average of 1.97% and 8.2% in these two benchmarks, respectively.
pdf
bib
abs
WET: Overcoming Paraphrasing Vulnerabilities in Embeddings-as-a-Service with Linear Transformation Watermarks
Anudeex Shetty
|
Qiongkai Xu
|
Jey Han Lau
Embeddings-as-a-Service (EaaS) is a service offered by large language model (LLM) developers to supply embeddings generated by LLMs. Previous research suggests that EaaS is prone to imitation attacks—attacks that clone the underlying EaaS model by training another model on the queried embeddings. As a result, EaaS watermarks are introduced to protect the intellectual property of EaaS providers. In this paper, we first show that existing EaaS watermarks can be removed by paraphrasing when attackers clone the model. Subsequently, we propose a novel watermarking technique that involves linearly transforming the embeddings, and show that it is empirically and theoretically robust against paraphrasing.
pdf
bib
abs
HoPE: A Novel Positional Encoding Without Long-Term Decay for Enhanced Context Awareness and Extrapolation
Yuhan Chen
|
Ang Lv
|
Jian Luan
|
Bin Wang
|
Wei Liu
Many positional encodings (PEs) are designed to exhibit long-term decay, based on an entrenched and long-standing inductive opinion: tokens farther away from the current position carry less relevant information. We argue that long-term decay is outdated in the era of LLMs, as LLMs are now applied to tasks demanding precise retrieval of in-context information from arbitrary positions. Firstly, we present empirical analyses on various PEs, demonstrating that models inherently learn attention with only a local-decay pattern while forming a U-shape pattern globally, contradicting the principle of long-term decay. Furthermore, we conduct a detailed analysis of rotary position encoding (RoPE, a prevalent relative positional encoding in LLMs), and found that the U-shape attention is caused by some learned components, which are also the key factor limiting RoPE’s expressiveness and extrapolation. Inspired by these insights, we propose High-frequency rotary Position Encoding (HoPE). HoPE replaces the specific components in RoPE with position-independent ones, retaining only high-frequency signals, which also breaks the principle of long-term decay in theory. HoPE achieves two major advantages: (1) Without constraints imposed by long-term decay, contradictory factors that limit attention optimization are removed. Thus, the model’s context awareness is enhanced. (2) HoPE exhibits greater robustness to the out-of-distribution behavior in attention patterns during extrapolation. The effectiveness of HoPE is validated through extensive experiments and with a large language model of up to 3 billion parameters.
pdf
bib
abs
One QuantLLM for ALL: Fine-tuning Quantized LLMs Once for Efficient Deployments
Ke Yi
|
Yuhui Xu
|
Heng Chang
|
Yuan Meng
|
Tong Zhang
|
Jia Li
Large Language Models (LLMs) have advanced rapidly but face significant memory demands. While quantization has shown promise for LLMs, current methods typically require lengthy training to alleviate the performance degradation from quantization loss. However, deploying LLMs across diverse scenarios with different resource constraints, e.g., servers and personal computers, requires repeated training per application, which amplifies the lengthy training problem. Given that, it is advantageous to train a once-for-all (OFA) supernet capable of yielding diverse optimal subnets for downstream applications through one-shot training. Nonetheless, the scale of current language models impedes efficiency and amplifies interference from weight sharing between subnets. We make an initial attempt to extend the once-for-all framework to large language models. Specifically, we decouple shared weights to eliminate the interference and incorporate Low-Rank adapters for training efficiency. Furthermore, we observe the imbalance allocation of training resources from the traditional uniform sampling. A non-parametric scheduler is introduced to adjust the sampling rate for each quantization configuration, achieving a more balanced allocation among subnets with varying demands. We validate the approach on LLaMA2 families and Mistral on downstream evaluation, demonstrating high performance while significantly reducing deployment time faced with multiple scenarios.
pdf
bib
abs
Beyond Logits: Aligning Feature Dynamics for Effective Knowledge Distillation
Guoqiang Gong
|
Jiaxing Wang
|
Jin Xu
|
Deping Xiang
|
Zicheng Zhang
|
Leqi Shen
|
Yifeng Zhang
|
JunhuaShu JunhuaShu
|
ZhaolongXing ZhaolongXing
|
Zhen Chen
|
Pengzhang Liu
|
Ke Zhang
Knowledge distillation (KD) compresses large language models (LLMs), known as teacher models, into lightweight versions called student models, enabling efficient inference and downstream applications. However, prevailing approaches accomplish this by predominantly focusing on matching the final output distributions of student/teacher models. Drawing on the perspective that transformers can be viewed as discretizing ordinary differential equation (ODEs) on integer time steps (corresponding to layer indices), where intermediate features evolve across layers, we argue that effective KD requires aligning the entire feature dynamics between teacher and student models, which we call feature dynamics distillation (FDD). This alignment involves matching both the feature trajectory and its first-order derivative, rather than just the final states. Our approach extends the original KD objective with two additional loss terms: layer-wise feature KD, which matches discretized feature trajectory, and layer feature delta KD, which matches first-order changes in features across adjacent layers. Extensive experiments on various tasks validate the effectiveness of our distillation method.
pdf
bib
abs
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
Jingyang Yuan
|
Huazuo Gao
|
Damai Dai
|
Junyu Luo
|
Liang Zhao
|
Zhengyan Zhang
|
Zhenda Xie
|
Yuxing Wei
|
Lean Wang
|
Zhiping Xiao
|
Yuqing Wang
|
Chong Ruan
|
Ming Zhang
|
Wenfeng Liang
|
Wangding Zeng
Long-context modeling is crucial for next-generation language models, yet the high computational cost of standard attention mechanisms poses significant computational challenges. Sparse attention offers a promising direction for improving efficiency while maintaining model capabilities. We present NSA, a Natively trained Sparse Attention mechanism that integrates algorithmic innovations with hardware-aligned optimizations to achieve efficient long-context modeling. NSA employs a dynamic hierarchical sparse strategy, combining coarse-grained token compression with fine-grained token selection to preserve both global context awareness and local precision. Our approach advances sparse attention design with two key innovations: (1) We achieve substantial speedups through arithmetic intensity-balanced algorithm design, with implementation optimizations for modern hardware. (2) We enable end-to-end training, reducing pretraining computation without sacrificing model performance. As shown in Figure 1, experiments show the model pretrained with NSA maintains or exceeds Full Attention models across general benchmarks, long-context tasks, and instruction-based reasoning. Meanwhile, NSA achieves substantial speedups over Full Attention on 64k-length sequences across decoding, forward propagation, and backward propagation, validating its efficiency throughout the model lifecycle.
pdf
bib
abs
DRAE: Dynamic Retrieval-Augmented Expert Networks for Lifelong Learning and Task Adaptation in Robotics
Yayu Long
|
Kewei Chen
|
Long Jin
|
Mingsheng Shang
We introduce Dynamic Retrieval-Augmented Expert Networks (DRAE), a groundbreaking architecture that addresses the challenges of lifelong learning, catastrophic forgetting, and task adaptation by combining the dynamic routing capabilities of Mixture-of-Experts (MoE); leveraging the knowledge-enhancement power of Retrieval-Augmented Generation (RAG); incorporating a novel hierarchical reinforcement learning (RL) framework; and coordinating through ReflexNet-SchemaPlanner-HyperOptima (RSHO).DRAE dynamically routes expert models via a sparse MoE gating mechanism, enabling efficient resource allocation while leveraging external knowledge through parametric retrieval (P-RAG) to augment the learning process. We propose a new RL framework with ReflexNet for low-level task execution, SchemaPlanner for symbolic reasoning, and HyperOptima for long-term context modeling, ensuring continuous adaptation and memory retention. Experimental results show that DRAE significantly outperforms baseline approaches in long-term task retention and knowledge reuse, achieving an average task success rate of 82.5% across a set of dynamic robotic manipulation tasks, compared to 74.2% for traditional MoE models. Furthermore, DRAE maintains an extremely low forgetting rate, outperforming state-of-the-art methods in catastrophic forgetting mitigation. These results demonstrate the effectiveness of our approach in enabling flexible, scalable, and efficient lifelong learning for robotics.
pdf
bib
abs
MT-RAIG: Novel Benchmark and Evaluation Framework for Retrieval-Augmented Insight Generation over Multiple Tables
Kwangwook Seo
|
Donguk Kwon
|
Dongha Lee
Recent advancements in table-based reasoning have expanded beyond factoid-level QA to address insight-level tasks, where systems should synthesize implicit knowledge in the table to provide explainable analyses. Although effective, existing studies remain confined to scenarios where a single gold table is given alongside the user query, failing to address cases where users seek comprehensive insights from multiple unknown tables. To bridge these gaps, we propose MT-RAIG Bench, design to evaluate systems on Retrieval-Augmented Insight Generation over Mulitple-Tables. Additionally, to tackle the suboptimality of existing automatic evaluation methods in the table domain, we further introduce a fine-grained evaluation framework MT-RAIG Eval, which achieves better alignment with human quality judgments on the generated insights. We conduct extensive experiments and reveal that even frontier LLMs still struggle with complex multi-table reasoning, establishing our MT-RAIG Bench as a challenging testbed for future research.
pdf
bib
abs
Enhancing Chain-of-Thought Reasoning with Critical Representation Fine-tuning
Chenxi Huang
|
Shaotian Yan
|
Liang Xie
|
Binbin Lin
|
Sinan Fan
|
Yue Xin
|
Deng Cai
|
Chen Shen
|
Jieping Ye
Representation Fine-tuning (ReFT), a recently proposed Parameter-Efficient Fine-Tuning (PEFT) method, has attracted widespread attention for significantly improving parameter efficiency by editing representation space alone. In this work, we investigate applying ReFT to complex reasoning tasks. However, directly using the native ReFT method, which modifies fixed representations at the beginning and end of each layer, yields suboptimal performance, as these fixed-position representations have uncertain impact on the outputs. We observe that, in complex reasoning tasks, there often exist certain critical representations. These representations either integrate significant information from preceding layers or regulate subsequent layer representations. Through layer-by-layer propagation, they exert a substantial influence on the final output. Naturally, fine-tuning these critical representations has the potential to greatly enhance reasoning performance. Building upon these insights, we propose **C**ritical **R**epresentation **F**ine-**T**uning (CRFT), a novel method that identifies and optimizes these critical representations through information flow analysis. CRFT operates within a supervised learning framework, dynamically optimizing critical representations in a low-rank linear subspace while freezing the base model. The effectiveness and efficiency of our method are validated across eight benchmarks for arithmetic and commonsense reasoning, using LLaMA and Mistral model families. Notably, our method improves the accuracy of LLaMA-2-7B and ReFT by 18.2 and 3.8, respectively, on GSM8K, while using only 0.016 of the model parameters, significantly less than other PEFT methods. Furthermore, our method also adapts effectively to few-shot settings, boosting one-shot accuracy by 16.4. Our work highlights the untapped potential of representation-level optimization for CoT reasoning, offering a lightweight yet powerful alternative to traditional PEFT methods.
pdf
bib
abs
Does the Emotional Understanding of LVLMs Vary Under High-Stress Environments and Across Different Demographic Attributes?
Jaewook Lee
|
Yeajin Jang
|
Oh-Woog Kwon
|
Harksoo Kim
According to psychological and neuroscientific research, a high-stress environment can restrict attentional resources and intensify negative affect, thereby impairing the ability to understand emotions. Furthermore, demographic attributes such as race, gender, and age group have been repeatedly reported to cause significant differences in emotional expression and recognition. This study is the first to systematically verify whether these psychological findings observed in humans also apply to the latest Large Vision Language Models (LVLMs). We constructed low-stress versus high-stress environments and generated an image dataset (a total of 540 images) that combines race, gender, and age group. Based on this, we applied the Pretend prompt technique to induce LVLMs to interpret others’ emotions from the standpoint of the assigned environment and persona. An analysis of the models’ emotional understanding ability, using EQ-Bench-based metrics, revealed that (1) under high-stress environments, the accuracy of emotion understanding significantly declined in most LVLMs, and (2) performance disparities were confirmed across race, gender, and age group. These findings suggest that the effects of high-stress and demographic attributes identified in human research may also be reflected in LVLMs.
pdf
bib
abs
S2WTM: Spherical Sliced-Wasserstein Autoencoder for Topic Modeling
Suman Adhya
|
Debarshi Kumar Sanyal
Modeling latent representations in a hyperspherical space has proven effective for capturing directional similarities in high-dimensional text data, benefiting topic modeling. Variational autoencoder-based neural topic models (VAE-NTMs) commonly adopt the von Mises-Fisher prior to encode hyperspherical structure. However, VAE-NTMs often suffer from posterior collapse, where the KL divergence term in the objective function highly diminishes, leading to ineffective latent representations. To mitigate this issue while modeling hyperspherical structure in the latent space, we propose the Spherical Sliced Wasserstein Autoencoder for Topic Modeling (S2WTM). S2WTM employs a prior distribution supported on the unit hypersphere and leverages the Spherical Sliced-Wasserstein distance to align the aggregated posterior distribution with the prior. Experimental results demonstrate that S2WTM outperforms state-of-the-art topic models, generating more coherent and diverse topics while improving performance on downstream tasks.
pdf
bib
abs
Learning to Look at the Other Side: A Semantic Probing Study of Word Embeddings in LLMs with Enabled Bidirectional Attention
Zhaoxin Feng
|
Jianfei Ma
|
Emmanuele Chersoni
|
Xiaojing Zhao
|
Xiaoyi Bao
Autoregressive Large Language Models (LLMs) demonstrate exceptional performance in language understanding and generation. However, their application in text embedding tasks has been relatively slow, along with the analysis of their semantic representation in probing tasks, due to the constraints of the unidirectional attention mechanism. This paper aims to explore whether such constraints can be overcome by enabling bidirectional attention in LLMs. We tested different variants of the Llama architecture through additional training steps, progressively enabling bidirectional attention and unsupervised/supervised contrastive learning. Our results show that bidirectional attention improves the LLMs’ ability to represent subsequent context but weakens their utilization of preceding context, while contrastive learning training can help to maintain both abilities.
pdf
bib
abs
Tracing and Dissecting How LLMs Recall Factual Knowledge for Real World Questions
Yiqun Wang
|
Chaoqun Wan
|
Sile Hu
|
Yonggang Zhang
|
Xiang Tian
|
Yaowu Chen
|
Xu Shen
|
Jieping Ye
Recent advancements in large language models (LLMs) have shown promising ability to perform commonsense reasoning, bringing machines closer to human-like understanding. However, deciphering the internal reasoning processes of LLMs remains challenging due to the complex interdependencies among generated tokens, especially in practical question-answering. In this study, we introduce a two-dimensional analysis framework—comprising token back-tracing and individual token decoding—to uncover how LLMs conduct factual knowledge recall. Through explanatory analysis of three typical reasoning datasets, we identify a consistent three-phase pattern: Subject Augmentation and Broadcasting, Object Retrieval and Reranking, and Conclusion Fusion and Generation. Our findings reveal that LLMs do not lack relevant knowledge but struggle to select the most accurate information based on context during the retrieval and rerank phase. Leveraging these findings, we apply representation engineering and selective fine-tuning to target specific modules responsible for retrieval and rerank errors. Experimental results show large improvements in response accuracy for both in-domain and out-of-domain settings, validating the rationality of the interpreting result.
pdf
bib
abs
Employing Discourse Coherence Enhancement to Improve Cross-Document Event and Entity Coreference Resolution
Xinyu Chen
|
Peifeng Li
|
Qiaoming Zhu
Cross-Document Coreference Resolution (CDCR) aims to identify and group together mentions of a specific event or entity that occur across multiple documents. In contrast to the within-document tasks, in which event and entity mentions are linked by rich and coherent contexts, cross-document mentions lack such critical contexts, which presents a significant challenge in establishing connections among them. To address this issue, we introduce a novel task Cross-Document Discourse Coherence Enhancement (CD-DCE) to enhance the discourse coherence between two cross-document event or entity mentions. Specifically, CD-DCE first selects coherent texts and then adds them between two cross-document mentions to form a new coherent document. Subsequently, the coherent text is employed to represent the event or entity mentions and to resolve any coreferent mentions. Experimental results on the three popular datasets demonstrate that our proposed method outperforms several state-of-the-art baselines.
pdf
bib
abs
Data Whisperer: Efficient Data Selection for Task-Specific LLM Fine-Tuning via Few-Shot In-Context Learning
Shaobo Wang
|
Xiangqi Jin
|
Ziming Wang
|
Jize Wang
|
Jiajun Zhang
|
Kaixin Li
|
Zichen Wen
|
Zhong Li
|
Conghui He
|
Xuming Hu
|
Linfeng Zhang
Fine-tuning large language models (LLMs) on task-specific data is essential for their effective deployment. As dataset sizes grow, efficiently selecting optimal subsets for training becomes crucial to balancing performance and computational costs. Traditional data selection methods often require fine-tuning a scoring model on the target dataset, which is time-consuming and resource-intensive, or rely on heuristics that fail to fully leverage the model’s predictive capabilities. To address these challenges, we propose Data Whisperer, an efficient, training-free, attention-based method that leverages few-shot in-context learning with the model to be fine-tuned. Comprehensive evaluations were conducted on both raw and synthetic datasets across diverse tasks and models. Notably, Data Whisperer achieves superior performance compared to the full GSM8K dataset on the Llama-3-8B-Instruct model, using just 10% of the data, and outperforms existing methods with a 3.1-point improvement and a 7.4× speedup.
pdf
bib
abs
Synthesizing Post-Training Data for LLMs through Multi-Agent Simulation
Shuo Tang
|
Xianghe Pang
|
Zexi Liu
|
Bohan Tang
|
Rui Ye
|
Tian Jin
|
Xiaowen Dong
|
Yanfeng Wang
|
Siheng Chen
Post-training is essential for enabling large language models (LLMs) to follow human instructions. However, its effectiveness depends on high-quality instruction data, which is challenging to obtain in the real world due to privacy concerns, data scarcity, and high annotation costs. To fill this gap, inspired by the recent success of using LLMs to simulate human society, we propose MATRIX, a multi-agent simulator that automatically generates diverse text-based scenarios, capturing a wide range of real-world human needs in a realistic and scalable manner. Leveraging these outputs, we introduce a novel scenario-driven instruction generator MATRIX-Gen for controllable and highly realistic data synthesis. Extensive experiments demonstrate that our framework effectively generates both general and domain-specific data. On AlpacaEval 2 and Arena-Hard benchmarks, Llama-3-8B-Base, post-trained on datasets synthesized by MATRIX-Gen with just 20K instruction-response pairs, outperforms Meta’s Llama-3-8B-Instruct model, which was trained on over 10M pairs.
pdf
bib
abs
SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs
Yige Xu
|
Xu Guo
|
Zhiwei Zeng
|
Chunyan Miao
Chain-of-Thought (CoT) reasoning enables Large Language Models (LLMs) to solve complex reasoning tasks by generating intermediate reasoning steps. However, most existing approaches focus on hard token decoding, which constrains reasoning within the discrete vocabulary space and may not always be optimal. While recent efforts explore continuous-space reasoning, they often require full-model fine-tuning and suffer from catastrophic forgetting, limiting their applicability to state-of-the-art LLMs that already perform well in zero-shot settings with a proper instruction. To address this challenge, we propose a novel approach for continuous-space reasoning that does not require modifying the LLM. Specifically, we employ a lightweight fixed assistant model to speculatively generate instance-specific soft thought tokens as the initial chain of thoughts, which are then mapped into the LLM’s representation space via a trainable projection module. Experimental results on five reasoning benchmarks demonstrate that our method enhances LLM reasoning performance through supervised, parameter-efficient fine-tuning. Source code is available at https://github.com/xuyige/SoftCoT.
pdf
bib
abs
FCMR: Robust Evaluation of Financial Cross-Modal Multi-Hop Reasoning
Seunghee Kim
|
Changhyeon Kim
|
Taeuk Kim
Real-world decision-making often requires integrating and reasoning over information from multiple modalities. While recent multimodal large language models (MLLMs) have shown promise in such tasks, their ability to perform multi-hop reasoning across diverse sources remains insufficiently evaluated. Existing benchmarks, such as MMQA, face challenges due to (1) data contamination and (2) a lack of complex queries that necessitate operations across more than two modalities, hindering accurate performance assessment. To address this, we present Financial Cross-Modal Multi-Hop Reasoning (FCMR), a benchmark created to analyze the reasoning capabilities of MLLMs by urging them to combine information from textual reports, tables, and charts within the financial domain. FCMR is categorized into three difficulty levels—Easy, Medium, and Hard—facilitating a step-by-step evaluation. In particular, problems at the Hard level require precise cross-modal three-hop reasoning and are designed to prevent the disregard of any modality. Experiments on this new benchmark reveal that even state-of-the-art MLLMs struggle, with the best-performing model (Claude 3.5 Sonnet) achieving only 30.4% accuracy on the most challenging tier. We also conduct analysis to provide insights into the inner workings of the models, including the discovery of a critical bottleneck in the information retrieval phase.
pdf
bib
abs
Beyond Prompt Engineering: Robust Behavior Control in LLMs via Steering Target Atoms
Mengru Wang
|
Ziwen Xu
|
Shengyu Mao
|
Shumin Deng
|
Zhaopeng Tu
|
Huajun Chen
|
Ningyu Zhang
Precise control over language model generation is vital for ensuring both safety and reliability. Although prompt engineering and steering are commonly used to intervene in model behaviors, the vast number of parameters in models often results in highly intertwined internal representations. This interdependency can limit control precision and sometimes lead to unintended side effects. Recent research has explored the use of sparse autoencoders (SAE) to disentangle knowledge in high-dimensional spaces for steering.However, these applications have been limited to toy tasks owing to the nontrivial issue of locating “atomic knowledge components”. In this paper, we propose Steering Target Atoms (STA), a novel method that isolates and manipulates disentangled knowledge components to enhance safety. Comprehensive experiments demonstrate the effectiveness of our approach. Further analysis reveals that steering exhibits superior robustness and flexibility, particularly in adversarial scenarios. We also apply the steering strategy to the large reasoning model, confirming its effectiveness in precise reasoning control.
pdf
bib
abs
MobiLoRA: Accelerating LoRA-based LLM Inference on Mobile Devices via Context-aware KV Cache Optimization
Borui Li
|
Yitao Wang
|
Haoran Ma
|
Ligeng Chen
|
Jun Xiao
|
Shuai Wang
Deploying large language models (LLMs) with low-rank adaptation (LoRA) on mobile devices is promising due to their capability to complete diverse domain-specific tasks while ensuring privacy and accessibility. In this paper, we introduce MobiLoRA to accelerate LoRA-based LLM inference on mobile devices. MobiLoRA focuses on optimizing the key-value (KV) caches due to the limited computing and memory resources of mobile devices. The key insight of MobiLoRA lies in the utilization of two contexts for on-device LoRA serving: semantic-level contexts, such as prompts with shared prefixes, and system-level contexts, such as the application status (e.g., foreground or killed) of LLM requests. Specifically, for semantic-level contexts, MobiLoRA proposes similarity-aware delta encoding, which leverages token-wise similarity in KV caches across LoRA adapters for efficient storage and reuse. Furthermore, MobiLoRA advocates context-aware KV cache management to optimize cache retention and eviction considering the system-level contexts. We fully implement MobiLoRA and compare it with state-of-the-art LLM serving frameworks using real-world mobile device traces. Results show that MobiLoRA accelerates LoRA-based LLM inference by 57.6% on mobile devices.
pdf
bib
abs
Language Models Resist Alignment: Evidence From Data Compression
Jiaming Ji
|
Kaile Wang
|
Tianyi Alex Qiu
|
Boyuan Chen
|
Jiayi Zhou
|
Changye Li
|
Hantao Lou
|
Josef Dai
|
Yunhuai Liu
|
Yaodong Yang
Large language models (LLMs) may exhibit unintended or undesirable behaviors. Recent works have concentrated on aligning LLMs to mitigate harmful outputs. Despite these efforts, some anomalies indicate that even a well-conducted alignment process can be easily circumvented, whether intentionally or accidentally. Does alignment fine-tuning yield have robust effects on models, or are its impacts merely superficial? In this work, we make the first exploration of this phenomenon from both theoretical and empirical perspectives. Empirically, we demonstrate the elasticity of post-alignment models, i.e., the tendency to revert to the behavior distribution formed during the pre-training phase upon further fine-tuning. Leveraging compression theory, we formally deduce that fine-tuning disproportionately undermines alignment relative to pre-training, potentially by orders of magnitude. We validate the presence of elasticity through experiments on models of varying types and scales. Specifically, we find that model performance declines rapidly before reverting to the pre-training distribution, after which the rate of decline drops significantly. Furthermore, we further reveal that elasticity positively correlates with the increased model size and the expansion of pre-training data. Our findings underscore the need to address the inherent elasticity of LLMs to mitigate their resistance to alignment.
pdf
bib
abs
Beyond the Answer: Advancing Multi-Hop QA with Fine-Grained Graph Reasoning and Evaluation
Qichuan Liu
|
Chentao Zhang
|
Chenfeng Zheng
|
Guosheng Hu
|
Xiaodong Li
|
Zhihong Zhang
Recent advancements in large language models (LLMs) have significantly improved the performance of multi-hop question answering (MHQA) systems. Despite the success of MHQA systems, the evaluation of MHQA is not deeply investigated. Existing evaluations mainly focus on comparing the final answers of the reasoning method and given ground-truths. We argue that the reasoning process should also be evaluated because wrong reasoning process can also lead to the correct final answers. Motivated by this, we propose a “Planner-Executor-Reasoner” (PER) architecture, which forms the core of the Plan-anchored Data Preprocessing (PER-DP) and the Plan-guided Multi-Hop QA (PER-QA).The former provides the ground-truth of intermediate reasoning steps and final answers, and the latter offers them of a reasoning method. Moreover, we design a fine-grained evaluation metric called Plan-aligned Stepwise Evaluation (PSE), which evaluates the intermediate reasoning steps from two aspects: planning and solving. Extensive experiments on ten types of questions demonstrate competitive reasoning performance, improved explainability of the MHQA system, and uncover issues such as “fortuitous reasoning continuance” and “latent reasoning suspension” in RAG-based MHQA systems. Besides, we also demonstrate the potential of our approach in data contamination scenarios.
pdf
bib
abs
Mamba Knockout for Unraveling Factual Information Flow
Nir Endy
|
Idan Daniel Grosbard
|
Yuval Ran-Milo
|
Yonatan Slutzky
|
Itay Tshuva
|
Raja Giryes
This paper investigates the flow of factual information in Mamba State-Space Model (SSM)-based language models. We rely on theoretical and empirical connections to Transformer-based architectures and their attention mechanisms. Exploiting this relationship, we adapt attentional interpretability techniques originally developed for Transformers—specifically, the Attention Knockout methodology—to both Mamba-1 and Mamba-2. Using them we trace how information is transmitted and localized across tokens and layers, revealing patterns of subject-token information emergence and layer-wise dynamics. Notably, some phenomena vary between mamba models and Transformer based models, while others appear universally across all models inspected—hinting that these may be inherent to LLMs in general. By further leveraging Mamba’s structured factorization, we disentangle how distinct “features” either enable token-to-token information exchange or enrich individual tokens, thus offering a unified lens to understand Mamba internal operations.
pdf
bib
abs
Small Changes, Big Impact: How Manipulating a Few Neurons Can Drastically Alter LLM Aggression
Jaewook Lee
|
Junseo Jang
|
Oh-Woog Kwon
|
Harksoo Kim
Recent remarkable advances in Large Language Models (LLMs) have led to innovations in various domains such as education, healthcare, and finance, while also raising serious concerns that they can be easily misused for malicious purposes. Most previous research has focused primarily on observing how jailbreak attack techniques bypass safety mechanisms like Reinforcement Learning through Human Feedback (RLHF). However, whether there are neurons within LLMs that directly govern aggression has not been sufficiently investigated. To fill this gap, this study identifies specific neurons (“aggression neurons”) closely related to the expression of aggression and systematically analyzes how manipulating them affects the model’s overall aggression. Specifically, using a large-scale synthetic text corpus (aggressive and non-aggressive), we measure the activation frequency of each neuron, then apply masking and activation techniques to quantitatively evaluate changes in aggression by layer and by manipulation ratio. Experimental results show that, in all models, manipulating only a small number of neurons can increase aggression by up to 33%, and the effect is even more extreme when aggression neurons are concentrated in certain layers. Moreover, even models of the same scale exhibit nonlinear changes in aggression patterns, suggesting that simple external safety measures alone may not be sufficient for complete defense.
pdf
bib
abs
Marco-o1 v2: Towards Widening The Distillation Bottleneck for Reasoning Models
Huifeng Yin
|
Yu Zhao
|
Minghao Wu
|
Xuanfan Ni
|
Bo Zeng
|
Huaiyu.wh Huaiyu.wh
|
Tianqi Shi
|
Liangying Shao
|
Chenyang Lyu
|
Longyue Wang
|
Weihua Luo
|
Kaifu Zhang
Large Reasoning Models (LRMs) such as OpenAI o1 and DeepSeek-R1 have shown remarkable reasoning capabilities by scaling test-time compute and generating long Chain-of-Thought (CoT). Distillation post-training on LRMs-generated data is a straightforward yet effective method to enhance the reasoning abilities of smaller models, but faces a critical bottleneck: we found that distilled long CoT data poses learning difficulty for small models and leads to the inheritance of biases (i.e., formalistic long-time thinking) when using Supervised Fine-tuning (SFT) and Reinforcement Learning (RL) methods. To alleviate this bottleneck, we propose constructing data from scratch using Monte Carlo Tree Search (MCTS). We then exploit a set of CoT-aware approaches, including Thoughts Length Balance, Fine-grained DPO, and Joint Post-training Objective, to enhance SFT and RL on the MCTS data. We conducted evaluation on various benchmarks such as math (GSM8K, MATH, AIME). instruction-following (Multi-IF) and planning (Blocksworld), results demonstrate our CoT-aware approaches substantially improve the reasoning performance of distilled models compared to standard distilled models via reducing the hallucinations in long-time thinking.
pdf
bib
abs
Curiosity-Driven Reinforcement Learning from Human Feedback
Haoran Sun
|
Yekun Chai
|
Shuohuan Wang
|
Yu Sun
|
Hua Wu
|
Haifeng Wang
Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences, but often at the cost of reduced output diversity. This trade-off between diversity and alignment quality remains a significant challenge. Drawing inspiration from curiosity-driven exploration in reinforcement learning, we introduce curiosity-driven RLHF (CD-RLHF), a framework that incorporates intrinsic rewards for novel states, alongside traditional sparse extrinsic rewards, to optimize both output diversity and alignment quality. We demonstrate the effectiveness of CD-RLHF through extensive experiments on a range of tasks, including text summarization and instruction following. Our approach achieves significant gains in diversity on multiple diversity-oriented metrics while maintaining alignment with human preferences comparable to standard RLHF. We will make our code publicly available.
pdf
bib
abs
T2A-Feedback: Improving Basic Capabilities of Text-to-Audio Generation via Fine-grained AI Feedback
Zehan Wang
|
Ke Lei
|
Chen Zhu
|
Jiawei Huang
|
Sashuai Zhou
|
Luping Liu
|
Xize Cheng
|
Shengpeng Ji
|
Zhenhui Ye
|
Tao Jin
|
Zhou Zhao
Text-to-audio (T2A) generation has achieved remarkable progress in generating a variety of audio outputs from language prompts. However, current state-of-the-art T2A models still struggle to satisfy human preferences for prompt-following and acoustic quality when generating complex multi-event audio. To improve the performance of the model in these high-level applications, we propose to enhance the basic capabilities of the model with AI feedback learning. First, we introduce fine-grained AI audio scoring pipelines to: 1) verify whether each event in the text prompt is present in the audio (Event Occurrence Score), 2) detect deviations in event sequences from the language description (Event Sequence Score), and 3) assess the overall acoustic and harmonic quality of the generated audio (Acoustic&Harmonic Quality). We evaluate these three automatic scoring pipelines and find that they correlate significantly better with human preferences than other evaluation metrics. This highlights their value as both feedback signals and evaluation metrics. Utilizing our robust scoring pipelines, we construct a large audio preference dataset, T2A-FeedBack, which contains 41k prompts and 249k audios, each accompanied by detailed scores. Moreover, we introduce T2A-EpicBench, a benchmark that focuses on long captions, multi-events, and story-telling scenarios, aiming to evaluate the advanced capabilities of T2A models. Finally, we demonstrate how T2A-FeedBack can enhance current state-of-the-art audio model. With simple preference tuning, the audio generation model exhibits significant improvements in both simple (AudioCaps test set) and complex (T2A-EpicBench) scenarios.
pdf
bib
abs
CoE: A Clue of Emotion Framework for Emotion Recognition in Conversations
Zhiyu Shen
|
Yunhe Pang
|
Yanghui Rao
|
Jianxing Yu
Emotion Recognition in Conversations (ERC) is crucial for machines to understand dynamic human emotions. While Large Language Models (LLMs) show promise, their performance is often limited by challenges in interpreting complex conversational streams. We introduce a Clue of Emotion (CoE) framework, which progressively integrates key conversational clues to enhance the ERC task. Building on CoE, we implement a multi-stage auxiliary learning strategy that incorporates role-playing, speaker identification, and emotion reasoning tasks, each targeting different aspects of conversational emotion understanding and enhancing the model’s ability to interpret emotional contexts. Our experiments on EmoryNLP, MELD, and IEMOCAP demonstrate that CoE consistently outperforms state-of-the-art methods, achieving a 2.92% improvement on EmoryNLP. These results underscore the effectiveness of clues and multi-stage auxiliary learning for ERC, offering valuable insights for future research.
pdf
bib
abs
MPO: Multilingual Safety Alignment via Reward Gap Optimization
Weixiang Zhao
|
Yulin Hu
|
Yang Deng
|
Tongtong Wu
|
Wenxuan Zhang
|
Jiahe Guo
|
An Zhang
|
Yanyan Zhao
|
Bing Qin
|
Tat-Seng Chua
|
Ting Liu
Large language models (LLMs) have become increasingly central to AI applications worldwide, necessitating robust multilingual safety alignment to ensure secure deployment across diverse linguistic contexts. Existing preference learning methods for safety alignment, such as RLHF and DPO, are primarily monolingual and struggle with noisy multilingual data. To address these limitations, we introduce Multilingual reward gaP Optimization (MPO), a novel approach that leverages the well-aligned safety capabilities of the dominant language (e.g., English) to improve safety alignment across multiple languages. MPO directly minimizes the reward gap difference between the dominant language and target languages, effectively transferring safety capabilities while preserving the original strengths of the dominant language. Extensive experiments on three LLMs, LLaMA-3.1, Gemma-2 and Qwen2.5, validate MPO’s efficacy in multilingual safety alignment without degrading general multilingual utility.
pdf
bib
abs
QualiSpeech: A Speech Quality Assessment Dataset with Natural Language Reasoning and Descriptions
Siyin Wang
|
Wenyi Yu
|
Xianzhao Chen
|
Xiaohai Tian
|
Jun Zhang
|
Lu Lu
|
Yu Tsao
|
Junichi Yamagishi
|
Yuxuan Wang
|
Chao Zhang
This paper explores a novel perspective to speech quality assessment by leveraging natural language descriptions, offering richer, more nuanced insights than traditional numerical scoring methods. Natural language feedback provides instructive recommendations and detailed evaluations, yet existing datasets lack the comprehensive annotations needed for this approach. To bridge this gap, we introduce QualiSpeech, a comprehensive low-level speech quality assessment dataset encompassing 11 key aspects and detailed natural language comments that include reasoning and contextual insights. Additionally, we propose the QualiSpeech Benchmark to evaluate the low-level speech understanding capabilities of auditory large language models (LLMs). Experimental results demonstrate that finetuned auditory LLMs can reliably generate detailed descriptions of noise and distortion, effectively identifying their types and temporal characteristics. The results further highlight the potential for incorporating reasoning to enhance the accuracy and reliability of quality assessments. The dataset can be found at https://huggingface.co/datasets/tsinghua-ee/QualiSpeech.
pdf
bib
abs
On the Relation Between Fine-Tuning, Topological Properties, and Task Performance in Sense-Enhanced Embeddings
Deniz Ekin Yavas
|
Timothée Bernard
|
Benoit Crabbé
|
Laura Kallmeyer
Topological properties of embeddings, such as isotropy and uniformity, are closely linked to their expressiveness, and improving these properties enhances the embeddings’ ability to capture nuanced semantic distinctions. However, fine-tuning can reduce the expressiveness of the embeddings of language models. This study investigates the relation between fine-tuning, topology of the embedding space, and task performance in the context of sense knowledge enhancement, focusing on identifying the topological properties that contribute to the success of sense-enhanced embeddings. We experiment with two fine-tuning methods: *Supervised Contrastive Learning (SCL)* and *Supervised Predictive Learning (SPL)*. Our results show that SPL, the most standard approach, exhibits varying effectiveness depending on the language model and is inconsistent in producing successful sense-enhanced embeddings. In contrast, SCL achieves this consistently. Furthermore, while the embeddings with only increased *sense-alignment* show reduced task performance, those that also exhibit high *isotropy* and balance *uniformity* with *sense-alignment* achieve the best results. Additionally, our findings indicate that supervised and unsupervised tasks benefit from these topological properties to varying degrees.
pdf
bib
abs
Finding Needles in Images: Can Multi-modal LLMs Locate Fine Details?
Parth Thakkar
|
Ankush Agarwal
|
Prasad Kasu
|
Pulkit Bansal
|
Chaitanya Devaguptapu
While Multi-modal Large Language Models (MLLMs) have shown impressive capabilities in document understanding tasks, their ability to locate and reason about fine-grained details within complex documents remains understudied. Consider searching a restaurant menu for a specific nutritional detail or identifying a disclaimer in a lengthy newspaper article — tasks that demand careful attention to small but significant details within a broader narrative, akin to Finding Needles in Images (NiM). To address this gap, we introduce NiM-Benchmark, a carefully curated benchmark spanning diverse real-world documents including newspapers, menus, and lecture images, specifically designed to evaluate MLLMs’ capability in these intricate tasks. Building on this, we further propose Spot-IT, a simple yet effective approach that enhances MLLMs capability through intelligent patch selection and Gaussian attention, motivated from how humans zoom and focus when searching documents. Our extensive experiments reveal both the capabilities and limitations of current MLLMs in handling fine-grained document understanding tasks, while demonstrating the effectiveness of our approach. Spot-IT achieves significant improvements over baseline methods, particularly in scenarios requiring precise detail extraction from complex layouts.
pdf
bib
abs
Don’t Half-listen: Capturing Key-part Information in Continual Instruction Tuning
Yongquan He
|
Wenyuan Zhang
|
Xuancheng Huang
|
Peng Zhang
|
Lingxun Meng
|
Xiang Zhou
|
Ke Zeng
|
Xunliang Cai
Instruction tuning for large language models (LLMs) can drive them to produce results consistent with human goals in specific downstream tasks. However, the process of continual instruction tuning (CIT) for LLMs may bring about the catastrophic forgetting (CF) problem, where previously learned abilities are degraded. Recent methods try to alleviate the CF problem by modifying models or replaying data, which may only remember the surface-level pattern of instructions and get confused on held-out tasks. In this paper, we propose a novel continual instruction tuning method based on Key-part Information Gain (KPIG). Our method computes the information gain on masked parts to dynamically replay data and refine the training objective, which enables LLMs to capture task-aware information relevant to the correct response and alleviate overfitting to general descriptions in instructions. In addition, we propose two metrics, P-score and V-score, to measure the generalization and instruction-following abilities of LLMs. Experiments demonstrate our method achieves superior performance on both seen and held-out tasks.
pdf
bib
abs
Generating Plausible Distractors for Multiple-Choice Questions via Student Choice Prediction
Yooseop Lee
|
Suin Kim
|
Yohan Jo
In designing multiple-choice questions (MCQs) in education, creating plausible distractors is crucial for identifying students’ misconceptions and gaps in knowledge and accurately assessing their understanding. However, prior studies on distractor generation have not paid sufficient attention to enhancing the difficulty of distractors, resulting in reduced effectiveness of MCQs. This study presents a pipeline for training a model to generate distractors that are more likely to be selected by students. First, we train a pairwise ranker to reason about students’ misconceptions and assess the relative plausibility of two distractors. Using this model, we create a dataset of pairwise distractor ranks and then train a distractor generator via Direct Preference Optimization (DPO) to generate more plausible distractors. Experiments on computer science subjects (Python, DB, MLDL) demonstrate that our pairwise ranker effectively identifies students’ potential misunderstandings and achieves ranking accuracy comparable to human experts. Furthermore, our distractor generator outperforms several baselines in generating plausible distractors and produces questions with a higher item discrimination index (DI).
pdf
bib
abs
Exploring Explanations Improves the Robustness of In-Context Learning
Ukyo Honda
|
Tatsushi Oka
In-context learning (ICL) has emerged as a successful paradigm for leveraging large language models (LLMs).However, it often struggles to generalize beyond the distribution of the provided demonstrations.A recent advancement in enhancing robustness is ICL with explanations (X-ICL), which improves prediction reliability by guiding LLMs to understand and articulate the reasoning behind correct labels.Building on this approach, we introduce an advanced framework that extends X-ICL by systematically exploring explanations for all possible labels (X2-ICL), thereby enabling more comprehensive and robust decision-making.Experimental results on multiple natural language understanding datasets validate the effectiveness of X2-ICL, demonstrating significantly improved robustness to out-of-distribution data compared to the existing ICL approaches.
pdf
bib
abs
Prediction Hubs are Context-Informed Frequent Tokens in LLMs
Beatrix Miranda Ginn Nielsen
|
Iuri Macocco
|
Marco Baroni
Hubness, the tendency for a few points to be among the nearest neighbours of a disproportionate number of other points, commonly arises when applying standard distance measures to high-dimensional data, often negatively impacting distance-based analysis. As autoregressive large language models (LLMs) operate on high-dimensional representations, we ask whether they are also affected by hubness. We first prove that the only large-scale representation comparison operation performed by LLMs, namely that between context and unembedding vectors to determine continuation probabilities, is not characterized by the concentration of distances phenomenon that typically causes the appearance of nuisance hubness. We then empirically show that this comparison still leads to a high degree of hubness, but the hubs in this case do not constitute a disturbance. They are rather the result of context-modulated frequent tokens often appearing in the pool of likely candidates for next token prediction. However, when other distances are used to compare LLM representations, we do not have the same theoretical guarantees, and, indeed, we see nuisance hubs appear. There are two main takeaways. First, hubness, while omnipresent in high-dimensional spaces, is not a negative property that needs to be mitigated when LLMs are being used for next token prediction. Second, when comparing representations from LLMs using Euclidean or cosine distance, there is a high risk of nuisance hubs and practitioners should use mitigation techniques if relevant.
pdf
bib
abs
Capability Salience Vector: Fine-grained Alignment of Loss and Capabilities for Downstream Task Scaling Law
Qiming Ge
|
Shuhao Xing
|
Songyang Gao
|
Yunhua Zhou
|
Yicheng Zou
|
Songyang Zhang
|
Zhi Chen
|
Hang Yan
|
Qi Zhang
|
Qipeng Guo
|
Kai Chen
Scaling law builds the relationship between training computation and validation loss, enabling researchers to effectively predict the loss trending of models across different levels of computation. However, a gap still remains between validation loss and the model’s downstream capabilities, making it untrivial to apply scaling law to direct performance prediction for downstream tasks. The loss typically represents a cumulative penalty for predicted tokens, which are implicitly considered to have equal importance. Nevertheless, our studies have shown evidence that when considering different training data distributions, we cannot directly model the relationship between downstream capability and computation or token loss. To bridge the gap between validation loss and downstream task capabilities, in this work, we introduce Capability Salience Vector, which decomposes the overall loss and assigns different importance weights to tokens to assess a specific meta-capability, aligning the validation loss with downstream task performance in terms of the model’s capabilities. Experiments on various popular benchmarks demonstrate that our proposed Capability Salience Vector could significantly improve the predictability of language model performance on downstream tasks.
pdf
bib
abs
CRUXEVAL-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution
Ruiyang Xu
|
Jialun Cao
|
Yaojie Lu
|
Ming Wen
|
Hongyu Lin
|
Xianpei Han
|
Ben He
|
Shing-Chi Cheung
|
Le Sun
Code benchmarks such as HumanEval are widely adopted to evaluate Large Language Models’ (LLMs) coding capabilities. However, there is an unignorable programming language bias in existing code benchmarks – over 95% code generation benchmarks are dominated by Python, leaving the LLMs’ capabilities in other programming languages such as Java and C/C++ unknown. Moreover, coding task bias is also crucial. Most benchmarks focus on code generation capability, while benchmarks for code reasoning (given input, reasoning output; and given output, reasoning input), an essential coding capability, are insufficient. Yet, constructing multi-lingual benchmarks can be expensive and labor-intensive, and codes in contest websites such as Leetcode suffer from data contamination during training. To fill this gap, we propose CRUXEVAL-X, a multi-lingual code reasoning benchmark that contains 19 programming languages. It comprises at least 600 subjects for each language, along with 19K content-consistent tests in total. In particular, the construction pipeline of CRUXEVAL-X works in a fully automated and test-guided manner, which iteratively generates and repairs based on execution feedback. Also, to cross language barriers (e.g., dynamic/static type systems in Python/C++), we formulated various transition rules between language pairs to facilitate translation. Our intensive evaluation of 24 representative LLMs reveals the correlation between language pairs. For example, TypeScript and JavaScript show a significant positive correlation, while Racket has less correlation with other languages. More interestingly, even a model trained solely on Python can achieve at most 34.4% Pass@1 in other languages, revealing the cross-language generalization of LLMs.
pdf
bib
abs
Graph of Records: Boosting Retrieval Augmented Generation for Long-context Summarization with Graphs
Haozhen Zhang
|
Tao Feng
|
Jiaxuan You
Retrieval-augmented generation (RAG) has revitalized Large Language Models (LLMs) by injecting non-parametric factual knowledge. Compared with long-context LLMs, RAG is considered an effective summarization tool in a more concise and lightweight manner, which can interact with LLMs multiple times using diverse queries to get comprehensive responses. However, the LLM-generated historical responses, which contain potentially insightful information, are largely neglected and discarded by existing approaches, leading to suboptimal results. In this paper, we propose graph of records (GoR), which leverages historical responses generated by LLMs to enhance RAG for long-context global summarization. Inspired by the retrieve-then-generate paradigm of RAG, we construct a graph by establishing an edge between the retrieved text chunks and the corresponding LLM-generated response. To further uncover the intricate correlations between them, GoR features a graph neural network and an elaborately designed BERTScore-based objective for self-supervised model training, enabling seamless supervision signal backpropagation between reference summaries and node embeddings. We comprehensively compare GoR with 12 baselines across four long-context summarization datasets, and the results indicate that our proposed method reaches the best performance (e.g., 15%, 8%, and 19% improvement over retrievers w.r.t. Rouge-L, Rouge-1, and Rouge-2 on the WCEP dataset). Extensive experiments further demonstrate the effectiveness of GoR.
pdf
bib
abs
Rubrik’s Cube: Testing a New Rubric for Evaluating Explanations on the CUBE dataset
Diana Galvan-Sosa
|
Gabrielle Gaudeau
|
Pride Kavumba
|
Yunmeng Li
|
Hongyi Gu
|
Zheng Yuan
|
Keisuke Sakaguchi
|
Paula Buttery
The performance and usability of Large-Language Models (LLMs) are driving their use in explanation generation tasks. However, despite their widespread adoption, LLM explanations have been found to be unreliable, making it difficult for users to distinguish good from bad explanations. To address this issue, we present Rubrik’s CUBE–an education-inspired rubric and a dataset of 26k explanations, written and later quality-annotated using the rubric by both humans and six open- and closed-source LLMs. The CUBE dataset focuses on two reasoning and two language tasks, providing the necessary diversity for us to effectively test our proposed rubric. Using Rubrik, we find that explanations are influenced by both task and perceived difficulty. Low quality stems primarily from a lack of conciseness in LLM-generated explanations, rather than cohesion and word choice. The full dataset, rubric, and code are available at https://github.com/RubriksCube/rubriks_cube.
pdf
bib
abs
A Dual-Mind Framework for Strategic and Expressive Negotiation Agent
Yutong Liu
|
Lida Shi
|
Rui Song
|
Hao Xu
Negotiation agents need to influence the attitudes or intentions of users to reach a consensus. Strategy planning and expressive optimization are crucial aspects of effective negotiations. However, previous studies have typically focused on only one of these aspects, neglecting the fact that their combined synergistic effect can lead to better performance. Inspired by the dual-process theory in human cognition, we propose a Dual-Mind Negotiation Agent (DMNA) framework. This framework integrates an intuitive module for rapid, experience-based response and a deliberative module for slow, expression optimization. The intuitive module is trained using Monte Carlo Tree Search (MCTS) and Direct Preference Optimization (DPO), enabling it to make suitable strategic planning and expression. The deliberative module employs a multifaceted reflexion mechanism to enhance the quality of expression. Experiments conducted on negotiation datasets confirm that DMNA achieves state-of-the-art results, demonstrating an enhancement in the negotiation ability of agents.
pdf
bib
abs
Ref-Long: Benchmarking the Long-context Referencing Capability of Long-context Language Models
Junjie Wu
|
Gefei Gu
|
Yanan Zheng
|
Dit-Yan Yeung
|
Arman Cohan
Long-context language models (LCLMs) have exhibited impressive capabilities in long-context understanding tasks. Among these, long-context referencing—a crucial task that requires LCLMs to attribute items of interest to specific parts of long-context data—remains underexplored. To bridge this gap, this paper proposes Referencing Evaluation for Long-context Language Models (Ref-Long), a novel benchmark designed to assess the long-context referencing capability of LCLMs. Specifically, Ref-Long requires LCLMs to identify the indexes of documents that reference a specific key, emphasizing contextual relationships between the key and the documents over simple retrieval. Based on the task design, we construct three subsets ranging from synthetic to realistic scenarios to form the Ref-Long benchmark. Experimental results of 13 LCLMs reveal significant shortcomings in long-context referencing, even among advanced models like GPT-4o. To further investigate these challenges, we conduct comprehensive analyses, including human evaluations, task format adjustments, fine-tuning experiments, and error analyses, leading to several key insights. Our data and code will be publicly released, and the data is also attached in the submission.
pdf
bib
abs
Revisiting Scaling Laws for Language Models: The Role of Data Quality and Training Strategies
Zhengyu Chen
|
Siqi Wang
|
Teng Xiao
|
Yudong Wang
|
Shiqi Chen
|
Xunliang Cai
|
Junxian He
|
Jingang Wang
Traditional scaling laws in natural language processing suggest that increasing model size and training data enhances performance. However, recent studies reveal deviations, particularly in large language models, where performance improvements decelerate—a phenomenon known as sub-scaling. This paper revisits these scaling laws by examining the impact of data quality and training strategies on model performance. Through extensive empirical analysis of over 400 models, we identify high data density and non-optimal resource allocation as key factors contributing to sub-scaling. High data density leads to diminishing returns due to redundant information, while optimal resource allocation is crucial for sustained performance improvements. We propose a sub-optimal scaling law that better predicts performance in sub-scaling regimes, highlighting the importance of data quality and diversity.
pdf
bib
abs
Limited Generalizability in Argument Mining: State-Of-The-Art Models Learn Datasets, Not Arguments
Marc Feger
|
Katarina Boland
|
Stefan Dietze
Identifying arguments is a necessary prerequisite for various tasks in automated discourse analysis, particularly within contexts such as political debates, online discussions, and scientific reasoning. In addition to theoretical advances in understanding the constitution of arguments, a significant body of research has emerged around practical argument mining, supported by a growing number of publicly available datasets. On these benchmarks, BERT-like transformers have consistently performed best, reinforcing the belief that such models are broadly applicable across diverse contexts of debate. This study offers the first large-scale re-evaluation of such state-of-the-art models, with a specific focus on their ability to generalize in identifying arguments. We evaluate four transformers, three standard and one enhanced with contrastive pre-training for better generalization, on 17 English sentence-level datasets as most relevant to the task. Our findings show that, to varying degrees, these models tend to rely on lexical shortcuts tied to content words, suggesting that apparent progress may often be driven by dataset-specific cues rather than true task alignment. While the models achieve strong results on familiar benchmarks, their performance drops markedly when applied to unseen datasets. Nonetheless, incorporating both task-specific pre-training and joint benchmark training proves effective in enhancing both robustness and generalization.
pdf
bib
abs
Enhancing Machine Translation with Self-Supervised Preference Data
Haoxiang Sun
|
Ruize Gao
|
Pei Zhang
|
Baosong Yang
|
Rui Wang
Model alignment methods like Direct Preference Optimization and Contrastive Preference Optimization have enhanced machine translation performance by leveraging preference data to enable models to reject suboptimal outputs. During preference data construction, previous approaches primarily rely on humans, strong models like GPT4 or model self-sampling. In this study, we first explain the shortcomings of this practice. Then, we propose Self-Supervised Preference Optimization (SSPO), a novel framework which efficiently constructs translation preference data for iterative DPO training. Applying SSPO to 14B parameters large language models (LLMs) achieves comparable or better performance than GPT-4o on FLORES and multi-domain test datasets. We release an augmented MQM dataset in https://github.com/sunny-sjtu/MQM-aug.
pdf
bib
abs
Unveil: Unified Visual-Textual Integration and Distillation for Multi-modal Document Retrieval
Hao Sun
|
Yingyan Hou
|
Jiayan Guo
|
Bo Wang
|
Chunyu Yang
|
Jinsong Ni
|
Yan Zhang
Document retrieval in real-world scenarios faces significant challenges due to diverse document formats and modalities. Traditional text-based approaches rely on tailored parsing techniques that disregard layout information and are prone to errors, while recent parsing-free visual methods often struggle to capture fine-grained textual semantics in text-rich scenarios. To address these limitations, we propose Unveil, a novel visual-textual embedding framework that effectively integrates textual and visual features for robust document representation. Through knowledge distillation, we transfer the semantic understanding capabilities from the visual-textual embedding model to a purely visual model, enabling efficient parsing-free retrieval while preserving semantic fidelity. Experimental results demonstrate that our visual-textual embedding method surpasses existing approaches, while knowledge distillation successfully bridges the performance gap between visual-textual and visual-only methods, improving both retrieval accuracy and efficiency.
pdf
bib
abs
Don’t Get Lost in the Trees: Streamlining LLM Reasoning by Overcoming Tree Search Exploration Pitfalls
Ante Wang
|
Linfeng Song
|
Ye Tian
|
Dian Yu
|
Haitao Mi
|
Xiangyu Duan
|
Zhaopeng Tu
|
Jinsong Su
|
Dong Yu
Recent advancements in tree search algorithms guided by verifiers have significantly enhanced the reasoning capabilities of large language models (LLMs), but at the cost of increased computational resources. In this work, we identify two key challenges contributing to this inefficiency: over-exploration due to redundant states with semantically equivalent content, and under-exploration caused by high variance in verifier scoring leading to frequent trajectory switching. To address these issues, we propose FETCH – an e ffici ent tree sear ch framework, which is a flexible, plug-and-play system compatible with various tree search algorithms.Our framework mitigates over-exploration by merging semantically similar states using agglomerative clustering of text embeddings obtained from a fine-tuned SimCSE model. To tackle under-exploration, we enhance verifiers by incorporating temporal difference learning with adjusted 𝜆-returns during training to reduce variance, and employing a verifier ensemble to aggregate scores during inference. Experiments on GSM8K, GSM-Plus, and MATH datasets demonstrate that our methods significantly improve reasoning accuracy and computational efficiency across four different tree search algorithms, paving the way for more practical applications of LLM-based reasoning. The code is available at https://github.com/DeepLearnXMU/Fetch.
pdf
bib
abs
MEXMA: Token-level objectives improve sentence representations
João Maria Janeiro
|
Benjamin Piwowarski
|
Patrick Gallinari
|
Loic Barrault
Cross-lingual sentence encoders (CLSE) create fixed-size sentence representations with aligned translations. Current pre-trained CLSE approaches use sentence-level objectives only. This can lead to loss of information, especially for tokens, which then degrades the sentence representation. We propose MEXMA, a novel approach that integrates both sentence-level and token-level objectives. The sentence representation in one language is used to predict masked tokens in another language, with both the sentence representation and *all tokens directly update the encoder*. We show that adding token-level objectives greatly improves the sentence representation quality across several tasks. Our approach outperforms current pre-trained cross-lingual sentence encoders on bitext mining as well as several downstream tasks. We also analyse the information encoded in our tokens, and how the sentence representation is built from them.
pdf
bib
abs
Uncertainty-Aware Iterative Preference Optimization for Enhanced LLM Reasoning
Lei Li
|
Hehuan Liu
|
Yaxin Zhou
|
ZhaoYang Gui
|
Xudong Weng
|
Yi Yuan
|
Zheng Wei
|
Zang Li
Direct Preference Optimization (DPO) has recently emerged as an efficient and effective method for aligning large language models with human preferences. However, constructing high-quality preference datasets remains challenging, often necessitating expensive manual or powerful LM annotations. Additionally, standard DPO exhibits suboptimal performance in complex reasoning tasks, such as mathematical and code reasoning. In this paper, we introduce an approach to collect preference pairs through iterative sampling and execution feedback, tailored to the current learning state (e.g. well-learned, mis-learned, and unlearned) of the policy model. To alleviate the failures of DPO and improve its applicability in reasoning tasks, we propose , an iterative uncertainty-aware preference optimization method that achieves fine-grained preference control by assessing model confidence. We validate our approach across three reasoning tasks, incorporating five established reasoning datasets and one self-curated dataset. Our experimental results demonstrate an overall improvement of 3.6% over the standard DPO method and show the model exhibits promising generalizability.
pdf
bib
abs
AgentDropout: Dynamic Agent Elimination for Token-Efficient and High-Performance LLM-Based Multi-Agent Collaboration
Zhexuan Wang
|
Yutong Wang
|
Xuebo Liu
|
Liang Ding
|
Miao Zhang
|
Jie Liu
|
Min Zhang
Multi-agent systems (MAS) based on large language models (LLMs) have demonstrated significant potential in collaborative problem-solving. However, they still face substantial challenges of low communication efficiency and suboptimal task performance, making the careful design of the agents’ communication topologies particularly important. Inspired by the management theory that roles in an efficient team are often dynamically adjusted, we propose AgentDropout, which identifies redundant agents and communication across different communication rounds by optimizing the adjacency matrices of the communication graphs and eliminates them to enhance both token efficiency and task performance. Compared to state-of-the-art methods, AgentDropout achieves an average reduction of 21.6% in prompt token consumption and 18.4% in completion token consumption, along with a performance improvement of 1.14 on the tasks. Furthermore, the extended experiments demonstrate that AgentDropout achieves notable domain transferability and structure robustness, revealing its reliability and effectiveness. We release our code at https://github.com/wangzx1219/AgentDropout.
pdf
bib
abs
Towards Dynamic Theory of Mind: Evaluating LLM Adaptation to Temporal Evolution of Human States
Yang Xiao
|
Jiashuo Wang
|
Qiancheng Xu
|
Changhe Song
|
Chunpu Xu
|
Yi Cheng
|
Wenjie Li
|
Pengfei Liu
As Large Language Models (LLMs) increasingly participate in human-AI interactions, evaluating their Theory of Mind (ToM) capabilities - particularly their ability to track dynamic mental states - becomes crucial. While existing benchmarks assess basic ToM abilities, they predominantly focus on static snapshots of mental states, overlooking the temporal evolution that characterizes real-world social interactions. We present **DynToM**, a novel benchmark specifically designed to evaluate LLMs’ ability to understand and track the temporal progression of mental states across interconnected scenarios. Through a systematic four-step framework, we generate 1,100 social contexts encompassing 5,500 scenarios and 78,100 questions, each validated for realism and quality. Our comprehensive evaluation of ten state-of-the-art LLMs reveals that their average performance underperforms humans by 44.7%, with performance degrading significantly when tracking and reasoning about the shift of mental states. This performance gap highlights fundamental limitations in current LLMs’ ability to model the dynamic nature of human mental states.
pdf
bib
abs
Marco-Bench-MIF: On Multilingual Instruction-Following Capability of Large Language
Bo Zeng
|
Chenyang Lyu
|
Sinuo Liu
|
Mingyan Zeng
|
Minghao Wu
|
Xuanfan Ni
|
Tianqi Shi
|
Yu Zhao
|
Yefeng Liu
|
Chenyu Zhu
|
Ruizhe Li
|
Jiahui Geng
|
Qing Li
|
Yu Tong
|
Longyue Wang
|
Weihua Luo
|
Kaifu Zhang
Instruction-following capability has become a major ability to be evaluated for Large Language Models. However, existing datasets, such as IFEval, are either predominantly monolingual and centered on English or simply machine translated to other languages, limiting their applicability in multilingual contexts. In this paper, we present an carefully-curated extension of IFEval to a localized multilingual version named Marco-Bench-MIF, covering 30 languages with varying levels of localization. Our benchmark addresses linguistic constraints (e.g., modifying capitalization requirements for Chinese) and cultural references (e.g., substituting region-specific company names in prompts) via a hybrid pipeline combining translation with verification. Through comprehensive evaluation of 20+ LLMs on our Marco-Bench-MIF, we found that: (1) 25-35% accuracy gap between high/low-resource languages, (2) model scales largely impact performance by 45-60% yet persists script-specific challenges, and (3) machine-translated data underestimates accuracy by 7-22% versus localized data. Our analysis identifies challenges in multilingual instruction following, including keyword consistency preservation and compositional constraint adherence across languages. Our Marco-Bench-MIF will be made publicly available to the community.
pdf
bib
abs
Representation Bending for Large Language Model Safety
Ashkan Yousefpour
|
Taeheon Kim
|
Ryan Sungmo Kwon
|
Seungbeen Lee
|
Wonje Jeung
|
Seungju Han
|
Alvin Wan
|
Harrison Ngan
|
Youngjae Yu
|
Jonghyun Choi
Large Language Models (LLMs) have emerged as powerful tools, but their inherent safety risks – ranging from harmful content generation to broader societal harms – pose significant challenges. These risks can be amplified by the recent adversarial attacks, fine-tuning vulnerabilities, and the increasing deployment of LLMs in high-stakes environments. Existing safety-enhancing techniques, such as fine-tuning with human feedback or adversarial training, are still vulnerable as they address specific threats and often fail to generalize across unseen attacks, or require manual system-level defenses. This paper introduces RepBend, a novel approach that fundamentally disrupts the representations underlying harmful behaviors in LLMs, offering a scalable solution to enhance (potentially inherent) safety. RepBend brings the idea of activation steering – simple vector arithmetic for steering model’s behavior during inference – to loss-based fine-tuning. Through extensive evaluation, RepBend achieves state-of-the-art performance, outperforming prior methods such as Circuit Breaker, RMU, and NPO, with up to 95% reduction in attack success rates across diverse jailbreak benchmarks, all with negligible reduction in model usability and general capabilities.
pdf
bib
abs
Analyzing LLMs’ Knowledge Boundary Cognition Across Languages Through the Lens of Internal Representations
Chenghao Xiao
|
Hou Pong Chan
|
Hao Zhang
|
Mahani Aljunied
|
Lidong Bing
|
Noura Al Moubayed
|
Yu Rong
While understanding the knowledge boundaries of LLMs is crucial to prevent hallucination, research on the knowledge boundaries of LLMs has predominantly focused on English. In this work, we present the first study to analyze how LLMs recognize knowledge boundaries across different languages by probing their internal representations when processing known and unknown questions in multiple languages. Our empirical studies reveal three key findings: 1) LLMs’ perceptions of knowledge boundaries are encoded in the middle to middle-upper layers across different languages. 2) Language differences in knowledge boundary perception follow a linear structure, which motivates our proposal of a training-free alignment method that effectively transfers knowledge boundary perception ability across languages, thereby helping reduce hallucination risk in low-resource languages; 3) Fine-tuning on bilingual question pair translation further enhances LLMs’ recognition of knowledge boundaries across languages. Given the absence of standard testbeds for cross-lingual knowledge boundary analysis, we construct a multilingual evaluation suite comprising three representative types of knowledge boundary data. Our code and datasets are publicly available at
https://github.com/DAMO-NLP-SG/LLM-Multilingual-Knowledge-Boundaries.
pdf
bib
abs
Enhancing Retrieval-Augmented Generation via Evidence Tree Search
Hao Sun
|
Hengyi Cai
|
Yuchen Li
|
Xuanbo Fan
|
Xiaochi Wei
|
Shuaiqiang Wang
|
Yan Zhang
|
Dawei Yin
Retrieval-Augmented Generation (RAG) is widely used to enhance Large Language Models (LLMs) by grounding responses in external knowledge. However, in real-world applications, retrievers often return lengthy documents with redundant or irrelevant content, confusing downstream readers. While evidence retrieval aims to address this by extracting key information, it faces critical challenges: (1) inability to model synergistic inter-dependencies among evidence sentences, (2) lack of supervision for evaluating multi-sentence evidence quality, and (3) computational inefficiency in navigating exponentially growing search spaces of candidate evidence sets. To tackle these challenges, we propose ETS (Evidence Tree Search), a novel framework that reformulates evidence retrieval as a dynamic tree expansion process. Our approach first constructs an evidence tree where each path represents a candidate evidence set, explicitly modeling inter-sentence dependencies through context-aware node selection. We then leverage Monte Carlo Tree Search (MCTS) to efficiently assess evidence quality and introduce an Early-Terminating Beam Search strategy to efficiently accelerate the model inference. Extensive experiments on five datasets demonstrate that ETS significantly outperforms existing methods across different readers. Our code and datasets will be released to facilitate future research.
pdf
bib
abs
HalluLens: LLM Hallucination Benchmark
Yejin Bang
|
Ziwei Ji
|
Alan Schelten
|
Anthony Hartshorn
|
Tara Fowler
|
Cheng Zhang
|
Nicola Cancedda
|
Pascale Fung
Large language models (LLMs) often generate responses that deviate from user input or training data, a phenomenon known as “hallucination.” These hallucinations undermine user trust and hinder the adoption of generative AI systems. Addressing hallucinations is important for the advancement of LLMs. This paper introduces a comprehensive hallucination benchmark HalluLens, incorporating both extrinsic and intrinsic evaluation tasks, built upon a clear taxonomy of hallucination. A major challenge in benchmarking hallucinations is the lack of a unified framework due to inconsistent definitions and categorizations. We disentangle LLM hallucination from “factuality” and propose a taxonomy distinguishing extrinsic and intrinsic hallucinations to promote consistency and facilitate research. We emphasize extrinsic hallucinations – where generated content deviates from training data – as they become increasingly relevant with LLM advancements. However, no benchmark is solely dedicated to extrinsic hallucinations. To address this gap, HalluLens introduces three new extrinsic tasks with dynamic test set generation to mitigate data leakage and ensure robustness. We release codebase for extrinsic hallucination benchmark.
pdf
bib
abs
DEEPER Insight into Your User: Directed Persona Refinement for Dynamic Persona Modeling
Aili Chen
|
Chengyu Du
|
Jiangjie Chen
|
Jinghan Xu
|
Yikai Zhang
|
Siyu Yuan
|
Zulong Chen
|
Liangyue Li
|
Yanghua Xiao
To advance personalized applications such as recommendation systems and user behavior prediction, recent research increasingly adopts large language models (LLMs) for human-readable persona modeling. In dynamic real-world scenarios, effective persona modeling necessitates leveraging streaming behavior data to continually optimize user personas.However, existing methods—whether regenerating personas or incrementally extending them with new behaviors—often fail to achieve sustained improvements in persona quality or future behavior prediction accuracy. To address this, we propose DEEPER, a novel approach for dynamic persona modeling that enables continual persona optimization. Specifically, we enhance the model’s direction-search capability through an iterative reinforcement learning framework, allowing it to automatically identify effective update directions and optimize personas using discrepancies between user behaviors and model predictions.Extensive experiments on dynamic persona modeling involving 4,800 users across 10 domains highlight ’s superior persona optimization capabilities, delivering an impressive 32.2% average reduction in user behavior prediction error over four update rounds—outperforming the best baseline by a remarkable 22.92%.
pdf
bib
abs
Asclepius: A Spectrum Evaluation Benchmark for Medical Multi-Modal Large Language Models
Jie Liu
|
Wenxuan Wang
|
Su Yihang
|
Jingyuan Huang
|
Yudi Zhang
|
Cheng-Yi Li
|
Wenting Chen
|
Xiaohan Xing
|
Kao-Jung Chang
|
Linlin Shen
|
Michael R. Lyu
The significant breakthroughs of Medical Multi-Modal Large Language Models (Med-MLLMs) renovate modern healthcare with robust information synthesis and medical decision support. However, these models are often evaluated on benchmarks that are unsuitable for the Med-MLLMs due to the intricate nature of the real-world diagnostic frameworks, which encompass diverse medical specialties and involve complex clinical decisions. Thus, a clinically representative benchmark is highly desirable for credible Med-MLLMs evaluation. To this end, we introduce Asclepius, a novel Med-MLLM benchmark that comprehensively assesses Med-MLLMs in terms of: distinct medical specialties (cardiovascular, gastroenterology, etc.) and different diagnostic capacities (perception, disease analysis, etc.). Grounded in 3 proposed core principles, Asclepius ensures a comprehensive evaluation by encompassing 15 medical specialties, stratifying into 3 main categories and 8 sub-categories of clinical tasks, and exempting overlap with the existing VQA dataset. We further provide an in-depth analysis of 6 Med-MLLMs and compare them with 3 human specialists, providing insights into their competencies and limitations in various medical contexts. Our work not only advances the understanding of Med-MLLMs’ capabilities but also sets a precedent for future evaluations and the safe deployment of these models in clinical environments.
pdf
bib
abs
InstructPart: Task-Oriented Part Segmentation with Instruction Reasoning
Zifu Wan
|
Yaqi Xie
|
Ce Zhang
|
Zhiqiu Lin
|
Zihan Wang
|
Simon Stepputtis
|
Deva Ramanan
|
Katia P. Sycara
Large multimodal foundation models, particularly in the domains of language and vision, have significantly advanced various tasks, including robotics, autonomous driving, information retrieval, and grounding. However, many of these models perceive objects as indivisible, overlooking the components that constitute them. Understanding these components and their associated affordances provides valuable insights into an object’s functionality, which is fundamental for performing a wide range of tasks. In this work, we introduce a novel real-world benchmark, InstructPart, comprising hand-labeled part segmentation annotations and task-oriented instructions to evaluate the performance of current models in understanding and executing part-level tasks within everyday contexts. Through our experiments, we demonstrate that task-oriented part segmentation remains a challenging problem, even for state-of-the-art Vision-Language Models (VLMs). In addition to our benchmark, we introduce a simple baseline that achieves a twofold performance improvement through fine-tuning with our dataset. With our dataset and benchmark, we aim to facilitate research on task-oriented part segmentation and enhance the applicability of VLMs across various domains, including robotics, virtual reality, information retrieval, and other related fields. Project website: https://zifuwan.github.io/InstructPart/.
pdf
bib
abs
GRaMPa: Subword Regularisation by Skewing Uniform Segmentation Distributions with an Efficient Path-counting Markov Model
Thomas Bauwens
|
David Kaczér
|
Miryam De Lhoneux
Stochastically sampling word segmentations from a subword tokeniser, also called subword regularisation, is a known way to increase robustness of language models to out-of-distribution inputs, such as text containing spelling errors. Recent work has observed that usual augmentations that make popular deterministic subword tokenisers stochastic still cause only a handful of all possible segmentations to be sampled. It has been proposed to uniformly sample across these instead, through rejection sampling of paths in an unweighted segmentation graph. In this paper, we argue that uniformly random segmentation in turn skews the distributions of certain segmentational properties (e.g. token lengths and amount of tokens produced) away from uniformity, which still ends up hiding meaningfully diverse tokenisations. We propose an alternative uniform sampler using the same segmentation graph, but weighted by counting the paths through it. Our sampling algorithm, GRaMPa, provides hyperparameters allowing sampled tokenisations to skew towards fewer, longer tokens. Furthermore, GRaMPa is single-pass, guaranteeing significantly better computational complexity than previous approaches relying on rejection sampling. We show experimentally that language models trained with GRaMPa outperform existing regularising tokenisers in a data-scarce setting on token-level tasks such as dependency parsing, especially with spelling errors present.
pdf
bib
abs
Evaluating the Evaluation of Diversity in Commonsense Generation
Tianhui Zhang
|
Bei Peng
|
Danushka Bollegala
In commonsense generation, given a set of input concepts, a model must generate a response that is not only commonsense bearing, but also capturing multiple diverse viewpoints. Numerous evaluation metrics based on form- and content-level overlap have been proposed in prior work for evaluating the diversity of a commonsense generation model. However, it remains unclear as to which metrics are best suited for evaluating the diversity in commonsense generation. To address this gap, we conduct a systematic meta-evaluation of diversity metrics for commonsense generation. We find that form-based diversity metrics tend to consistently overestimate the diversity in sentence sets, where even randomly generated sentences are assigned overly high diversity scores. We then use an Large Language Model (LLM) to create a novel dataset annotated for the diversity of sentences generated for a commonsense generation task, and use it to conduct a meta-evaluation of the existing diversity evaluation metrics. Our experimental results show that content-based diversity evaluation metrics consistently outperform the form-based counterparts, showing high correlations with the LLM-based ratings. We recommend that future work on commonsense generation should use content-based metrics for evaluating the diversity of their outputs.
pdf
bib
abs
Generate First, Then Sample: Enhancing Fake News Detection with LLM-Augmented Reinforced Sampling
Zhao Tong
|
Yimeng Gu
|
Huidong Liu
|
Qiang Liu
|
Shu Wu
|
Haichao Shi
|
Xiao-Yu Zhang
The spread of fake news on online platforms has long been a pressing concern. Considering this, extensive efforts have been made to develop fake news detectors. However, a major drawback of these models is their relatively low performance—lagging by more than 20%—in identifying *fake* news compared to *real* news, making them less suitable for practical deployment. This gap is likely due to an imbalance in the dataset and the model’s inadequate understanding of data distribution on the targeted platform. In this work, we focus on improving the model’s effectiveness in detecting *fake* news. To achieve this, we **first** adopt an LLM to **generate** fake news in three different styles, which are later incorporated into the training set to augment the representation of fake news. **Then**, we apply Reinforcement Learning to dynamically **sample** fake news, allowing the model to learn the optimal real-to-fake news ratio for training an effective fake news detector on the targeted platform. This approach allows our model to perform effectively even with a limited amount of annotated news data and consistently improve detection accuracy across different platforms. Experimental results demonstrate that our approach achieves state-of-the-art performance on two benchmark datasets, improving *fake* news detection performance by 24.02% and 11.06% respectively.
pdf
bib
abs
ChemActor: Enhancing Automated Extraction of Chemical Synthesis Actions with LLM-Generated Data
Yu Zhang
|
Ruijie Yu
|
Jidong Tian
|
Feng Zhu
|
Jiapeng Liu
|
Xiaokang Yang
|
Yaohui Jin
|
Yanyan Xu
With the increasing interest in robotic synthesis in the context of organic chemistry, the automated extraction of chemical procedures from literature is critical. However, this task remains challenging due to the inherent ambiguity of chemical language and the high cost of human annotation required for developing reliable computer-aided extraction protocols. Here, we present ChemActor, a fully fine-tuned large language model (LLM), as a chemical executor to convert between unstructured experimental procedures and structured action sequences. We propose a sequential LLM-generated data framework to address the challenges of insufficient and low-quality annotated data. This framework integrates a data selection module that selects data based on distribution divergence, with a general-purpose LLM, to generate machine-executable actions from a single molecule input. Additionally, we introduce a novel multi-round LLMs circle review metric, which reflects the model’s advanced understanding of chemical experimental procedures. Extensive experiments on reaction-to-description (R2D) and description-to-action (D2A) tasks demonstrate that ChemActor, augmented by LLM-generated data, achieves state-of-the-art performance, outperforming the baseline model by 10%. The code is available at: https://github.com/Zhanghahah/ChemActor.
pdf
bib
abs
Towards Fully Exploiting LLM Internal States to Enhance Knowledge Boundary Perception
Shiyu Ni
|
Keping Bi
|
Jiafeng Guo
|
Lulu Yu
|
Baolong Bi
|
Xueqi Cheng
Large language models (LLMs) exhibit impressive performance across diverse tasks but often struggle to accurately gauge their knowledge boundaries, leading to confident yet incorrect responses. This paper explores leveraging LLMs’ internal states to enhance their perception of knowledge boundaries from efficiency and risk perspectives. We investigate whether LLMs can estimate their confidence using internal states before response generation, potentially saving computational resources. Our experiments on datasets like Natural Questions, HotpotQA, and MMLU reveal that LLMs demonstrate significant pre-generation perception, which is further refined post-generation, with perception gaps remaining stable across varying conditions. To mitigate risks in critical domains, we introduce Consistency-based Confidence Calibration (C3), which assesses confidence consistency through question reformulation. C3 significantly improves LLMs’ ability to recognize their knowledge gaps, enhancing the unknown perception rate by 5.6% on NQ and 4.9% on HotpotQA. Our findings suggest that pre-generation confidence estimation can optimize efficiency, while C3 effectively controls output risks, advancing the reliability of LLMs in practical applications.
pdf
bib
abs
ALGEN: Few-shot Inversion Attacks on Textual Embeddings via Cross-Model Alignment and Generation
Yiyi Chen
|
Qiongkai Xu
|
Johannes Bjerva
With the growing popularity of Large Language Models (LLMs) and vector databases, private textual data is increasingly processed and stored as numerical embeddings. However, recent studies have proven that such embeddings are vulnerable to inversion attacks, where original text is reconstructed to reveal sensitive information. Previous research has largely assumed access to millions of sentences to train attack models, e.g., through data leakage or nearly unrestricted API access. With our method, a single data point is sufficient for a partially successful inversion attack. With as little as 1k data samples, performance reaches an optimum across a range of black-box encoders, without training on leaked data. We present a Few-shot Textual Embedding Inversion Attack using Cross-Model **AL**ignment and **GEN**eration (__ALGEN__), by aligning victim embeddings to the attack space and using a generative model to reconstruct text. We find that __ALGEN__ attacks can be effectively transferred across domains and languages, revealing key information. We further examine a variety of defense mechanisms against **ALGEN**, and find that none are effective, highlighting the vulnerabilities posed by inversion attacks. By significantly lowering the cost of inversion and proving that embedding spaces can be aligned through one-step optimization, we establish a new textual embedding inversion paradigm with broader applications for embedding alignment in NLP.
pdf
bib
abs
Decoding on Graphs: Faithful and Sound Reasoning on Knowledge Graphs through Generation of Well-Formed Chains
Kun Li
|
Tianhua Zhang
|
Xixin Wu
|
Hongyin Luo
|
James R. Glass
|
Helen M. Meng
Knowledge Graphs (KGs) can serve as reliable knowledge sources for question answering (QA) due to their structured representation of knowledge. Existing research on the utilization of KG for large language models (LLMs) prevalently relies on subgraph retriever or iterative prompting, overlooking the potential synergy of LLMs’ step-wise reasoning capabilities and KGs’ structural nature. In this paper, we present DoG (Decoding on Graph), a novel framework that facilitates a deep synergy between LLMs and KGs. We first define a concept, well-formed chain, which consists of a sequence of interrelated fact triplets on the KGs, starting from question entities and leading to answers. We argue that this concept can serve as a principle for making faithful and sound reasoning for KGQA. To enable LLMs to generate well-formed chains, we propose graph-aware constrained decoding, in which a constraint derived from the topology of the KG regulates the decoding process of the LLMs. This constrained decoding method ensures the generation of well-formed chains while making full use of the step-wise reasoning capabilities of LLMs. Based on the above, DoG, a training-free approach, is able to provide faithful and sound reasoning trajectories grounded on the KGs. Experiments across various KGQA tasks with different background KGs demonstrate that DoG achieves superior and robust performance. DoG also shows general applicability with various open-source LLMs.
pdf
bib
abs
STaR-SQL: Self-Taught Reasoner for Text-to-SQL
Mingqian He
|
Yongliang Shen
|
Wenqi Zhang
|
Qiuying Peng
|
Jun Wang
|
Weiming Lu
Generating step-by-step “chain-of-thought” rationales has proven effective for improving the performance of large language models on complex reasoning tasks. However, applying such techniques to structured tasks, such as text-to-SQL, remains largely unexplored. In this paper, we introduce Self-Taught Reasoner for text-to-SQL (STaR-SQL), a novel approach that reframes SQL query generation as a reasoning-driven process. Our method prompts the LLM to produce detailed reasoning steps for SQL queries and fine-tunes it on rationales that lead to correct outcomes. Unlike traditional methods, STaR-SQL dedicates additional test-time computation to reasoning, thereby positioning LLMs as spontaneous reasoners rather than mere prompt-based agents. To further scale the inference process, we incorporate an outcome-supervised reward model (ORM) as a verifier, which enhances SQL query accuracy. Experimental results on the challenging Spider benchmark demonstrate that STaR-SQL significantly improves text-to-SQL performance, achieving an execution accuracy of 86.6%. This surpasses a few-shot baseline by 31.6% and a baseline fine-tuned to predict answers directly by 18.0%. Additionally, STaR-SQL outperforms agent-like prompting methods that leverage more powerful yet closed-source models such as GPT-4. These findings underscore the potential of reasoning-augmented training for structured tasks and open the door to extending self-improving reasoning models to text-to-SQL generation and beyond.
pdf
bib
abs
Fairness Beyond Performance: Revealing Reliability Disparities Across Groups in Legal NLP
Santosh T.y.s.s
|
Irtiza Chowdhury
Fairness in NLP must extend beyond performance parity to encompass equitable reliability across groups. This study exposes a criticalblind spot: models often make less reliable or overconfident predictions for marginalized groups, even when overall performance appearsfair. Using the FairLex benchmark as a case study in legal NLP, we systematically evaluate both performance and reliability dispari-ties across demographic, regional, and legal attributes spanning four jurisdictions. We show that domain-specific pre-training consistentlyimproves both performance and reliability, especially for underrepresented groups. However, common bias mitigation methods frequentlyworsen reliability disparities, revealing a trade-off not captured by performance metrics alone. Our results call for a rethinking of fairnessin high-stakes NLP: To ensure equitable treatment, models must not only be accurate, but also reliably self-aware across all groups.
pdf
bib
abs
Beyond Similarity: A Gradient-based Graph Method for Instruction Tuning Data Selection
Yang Zhao
|
Li Du
|
Xiao Ding
|
Yangou Ouyang
|
Hepeng Wang
|
Kai Xiong
|
Jinglong Gao
|
Zhouhao Sun
|
Dongliang Xu
|
Qing Yang
|
Dongchen Li
|
Bing Qin
|
Ting Liu
Large language models (LLMs) have shown great potential across various industries due to their remarkable ability to generalize through instruction tuning. However, the limited availability of domain-specific data significantly hampers their performance on specialized tasks. While existing methods primarily focus on selecting training data from general datasets that are similar to the target domain, they often fail to consider the joint distribution of instructions, resulting in inefficient learning and suboptimal knowledge transfer. To address these challenges, we introduce **G2IS** (**G**radient-based **G**raph **I**nstruction **S**election), a novel method that constructs a mixed gradient-based instruction graph to capture the joint distribution and interdependencies among instructions. By accounting for the relationships between instructions, G2IS improves domain adaptation efficiency. Additionally, we propose a gradient walk algorithm to refine the data selection process, enhancing both training effectiveness and efficiency. Our experiments demonstrate that G2IS outperforms traditional methods across various domain adaptation tasks, yielding significant performance gains, particularly in complex, data-scarce scenarios. These results underscore the potential of G2IS in advancing the development of large, domain-specific models.
pdf
bib
abs
FastMCTS: A Simple Sampling Strategy for Data Synthesis
Peiji Li
|
Kai Lv
|
Yunfan Shao
|
Yichuan Ma
|
Linyang Li
|
Xiaoqing Zheng
|
Xipeng Qiu
|
Qipeng Guo
Synthetic high-quality multi-step reasoning data can significantly enhance the performance of large language models on various tasks. However, most existing methods rely on rejection sampling, which generates trajectories independently and suffers from inefficiency and imbalanced sampling across problems of varying difficulty. In this work, we introduce FastMCTS, an innovative data synthesis strategy inspired by Monte Carlo Tree Search. FastMCTS provides a more efficient sampling method for multi-step reasoning data, offering step-level evaluation signals and promoting balanced sampling across problems of different difficulty levels. Experiments on both English and Chinese reasoning datasets demonstrate that FastMCTS generates over 30% more correct reasoning paths compared to rejection sampling as the number of generated tokens scales up. Furthermore, under comparable synthetic data budgets, models trained on FastMCTS-generated data outperform those trained on rejection sampling data by 3.9% across multiple benchmarks. As a lightweight sampling strategy, FastMCTS offers a practical and efficient alternative for synthesizing high-quality reasoning data.
pdf
bib
abs
Dialogue-RAG: Enhancing Retrieval for LLMs via Node-Linking Utterance Rewriting
Qiwei Li
|
Teng Xiao
|
Zuchao Li
|
Ping Wang
|
Mengjia Shen
|
Hai Zhao
Large Language Models (LLMs) and Retrieval Augmented Generation (RAG) methods have demonstrated significant potential on tasks across multiple domains. However, ellipses and coreferences, as common phenomena in dialogue scenes, pose challenges to LLMs’ understanding and RAG’s retrieval accuracy. The previous works ignore the negative impact of this fuzzy data on RAG system.We explore the capabilities of LLMs and RAG systems in dialogue scenarios and use Incomplete Utterance Rewriting (IUR) to complete the key information in dialogue to enhance retrieval.Besides, we propose a lightweight IUR model for query rewriting. It is an end-to-end framework for node linking and iterative inference, incorporating two newly proposed probing semantic features derived from generative pre-training. This framework treats IUR as a series of link decisions on the input sequence and the incrementally constructed rewriting outputs.To test the performance of RAG system in the model multi-round dialogue scenario, we construct an RAG dialogue dataset on English and Chinese, Dialogue-RAG-MULTI-v1.0.Experiment results show that utterance rewriting can effectively improve the retrieval and generation ability of RAG system in dialogue scenes. Experiments on IUR tasks demonstrate the excellent performance of our lightweight IUR method.
pdf
bib
abs
Using Information Theory to Characterize Prosodic Typology: The Case of Tone, Pitch-Accent and Stress-Accent
Ethan Wilcox
|
Cui Ding
|
Giovanni Acampa
|
Tiago Pimentel
|
Alex Warstadt
|
Tamar I Regev
This paper argues that the relationship between lexical identity and prosody—one well-studied parameter of linguistic variation—can be characterized using information theory. We predict that languages that use prosody to make lexical distinctions should exhibit a higher mutual information between word identity and prosody, compared to languages that don’t. We test this hypothesis in the domain of pitch, which is used to make lexical distinctions in tonal languages, like Cantonese. We use a dataset of speakers reading sentences aloud in ten languages across five language families to estimate the mutual information between the text and their pitch curves. We find that, across languages, pitch curves display similar amounts of entropy. However, these curves are easier to predict given their associated text in the tonal languages, compared to pitch- and stress-accent languages, and thus the mutual information is higher in these languages, supporting our hypothesis. Our results support perspectives that view linguistic typology as gradient, rather than categorical.
pdf
bib
abs
Evaluating LLMs for Portuguese Sentence Simplification with Linguistic Insights
Arthur Mariano Rocha De Azevedo Scalercio
|
Elvis A. De Souza
|
Maria José Bocorny Finatto
|
Aline Paes
Sentence simplification (SS) focuses on adapting sentences to enhance their readability and accessibility. While large language models (LLMs) match task-specific baselines in English SS, their performance in Portuguese remains underexplored. This paper presents a comprehensive performance comparison of 26 state-of-the-art LLMs in Portuguese SS, alongside two simplification models trained explicitly for this task and language. They are evaluated under a one-shot setting across scientific, news, and government datasets. We benchmark the models with our newly introduced Gov-Lang-BR corpus (1,703 complex-simple sentence pairs from Brazilian government agencies) and two established datasets: PorSimplesSent and Museum-PT. Our investigation takes advantage of both automatic metrics and large-scale linguistic analysis to examine the transformations achieved by the LLMs. Furthermore, a qualitative assessment of selected generated outputs provides deeper insights into simplification quality. Our findings reveal that while open-source LLMs have achieved impressive results, closed-source LLMs continue to outperform them in Portuguese SS.
pdf
bib
abs
LaTIM: Measuring Latent Token-to-Token Interactions in Mamba Models
Hugo Pitorro
|
Marcos Vinicius Treviso
State space models (SSMs), such as Mamba, have emerged as an efficient alternative to transformers for long-context sequence modeling. However, despite their growing adoption, SSMs lack the interpretability tools that have been crucial for understanding and improving attention-based architectures. While recent efforts provide insights into Mamba’s internal mechanisms, they struggle to capture precisetoken-level interactions at the layer level, leaving gaps in understanding how Mamba selectively processes sequences across layers. In this work, we introduce LaTIM, a novel token-level decomposition method for both Mamba-1 and Mamba-2 that enables fine-grained interpretability. We extensively evaluate our method across diverse tasks, including machine translation, copying, and retrieval-based generation, demonstrating its effectiveness in revealing Mamba’s token-to-token interaction patterns.
pdf
bib
abs
Improving Low-Resource Morphological Inflection via Self-Supervised Objectives
Adam Wiemerslage
|
Katharina Von Der Wense
Self-supervised objectives have driven major advances in NLP by leveraging large-scale unlabeled data, but such resources are scarce for many of the world’s languages. Surprisingly, they have not been explored much for character-level tasks, where smaller amounts of data have the potential to be beneficial. We investigate the effectiveness of self-supervised auxiliary tasks for morphological inflection – a character-level task highly relevant for language documentation – in extremely low-resource settings, training encoder-decoder transformers for 19 languages and 13 auxiliary objectives. Autoencoding yields the best performance when unlabeled data is very limited, while character masked language modeling (CMLM) becomes more effective as data availability increases. Though objectives with stronger inductive biases influence model predictions intuitively, they rarely outperform standard CMLM. However, sampling masks based on known morpheme boundaries consistently improves performance, highlighting a promising direction for low-resource morphological modeling.
pdf
bib
abs
Don’t Reinvent the Wheel: Efficient Instruction-Following Text Embedding based on Guided Space Transformation
Yingchaojie Feng
|
Yiqun Sun
|
Yandong Sun
|
Minfeng Zhu
|
Qiang Huang
|
Anthony Kum Hoe Tung
|
Wei Chen
In this work, we investigate an important task named instruction-following text embedding, which generates dynamic text embeddings that adapt to user instructions, highlighting specific attributes of text. Despite recent advancements, existing approaches suffer from significant computational overhead, as they require re-encoding the entire corpus for each new instruction. To address this challenge, we propose GSTransform, a novel instruction-following text embedding framework based on Guided Space Transformation. Our key observation is that instruction-relevant information is inherently encoded in generic embeddings but remains underutilized. Instead of repeatedly encoding the corpus for each instruction, GSTransform is a lightweight transformation mechanism that adapts pre-computed embeddings in real time to align with user instructions, guided by a small amount of text data with instruction-focused label annotation. We conduct extensive experiments on three instruction-awareness downstream tasks across nine real-world datasets, demonstrating that GSTransform improves instruction-following text embedding quality over state-of-the-art methods while achieving dramatic speedups of 6~300× in real-time processing on large-scale datasets. The source code is available at https://github.com/YingchaojieFeng/GSTransform.
pdf
bib
abs
BOOKCOREF: Coreference Resolution at Book Scale
Giuliano Martinelli
|
Tommaso Bonomo
|
Pere-Lluís Huguet Cabot
|
Roberto Navigli
Coreference Resolution systems are typically evaluated on benchmarks containing small- to medium-scale documents.When it comes to evaluating long texts, however, existing benchmarks, such as LitBank, remain limited in length and do not adequately assess system capabilities at the book scale, i.e., when co-referring mentions span hundreds of thousands of tokens.To fill this gap, we first put forward a novel automatic pipeline that produces high-quality Coreference Resolution annotations on full narrative texts. Then, we adopt this pipeline to create the first book-scale coreference benchmark, BOOKCOREF, with an average document length of more than 200,000 tokens.We carry out a series of experiments showing the robustness of our automatic procedure and demonstrating the value of our resource, which enables current long-document coreference systems to gain up to +20 CoNLL-F1 points when evaluated on full books.Moreover, we report on the new challenges introduced by this unprecedented book-scale setting, highlighting that current models fail to deliver the same performance they achieve on smaller documents.We release our data and code to encourage research and development of new book-scale Coreference Resolution systems at https://github.com/sapienzanlp/bookcoref.
pdf
bib
abs
OMGM: Orchestrate Multiple Granularities and Modalities for Efficient Multimodal Retrieval
Wei Yang
|
Jingjing Fu
|
Rui Wang
|
Jinyu Wang
|
Lei Song
|
Jiang Bian
Vision-language retrieval-augmented generation (RAG) has become an effective approach for tackling Knowledge-Based Visual Question Answering (KB-VQA), which requires external knowledge beyond the visual content presented in images. The effectiveness of Vision-language RAG systems hinges on multimodal retrieval, which is inherently challenging due to the diverse modalities and knowledge granularities in both queries and knowledge bases. Existing methods have not fully tapped into the potential interplay between these elements. We propose a multimodal RAG system featuring a coarse-to-fine, multi-step retrieval that harmonizes multiple granularities and modalities to enhance efficacy. Our system begins with a broad initial search aligning knowledge granularity for cross-modal retrieval, followed by a multimodal fusion reranking to capture the nuanced multimodal information for top entity selection. A text reranker then filters out the most relevant fine-grained section for augmented generation. Extensive experiments on the InfoSeek and Encyclopedic-VQA benchmarks show our method achieves state-of-the-art retrieval performance and highly competitive answering results, underscoring its effectiveness in advancing KB-VQA systems. Our code can be found at https://github.com/ChaoLinAViy/OMGM.
pdf
bib
abs
Alleviating Hallucinations from Knowledge Misalignment in Large Language Models via Selective Abstention Learning
Lei Huang
|
Xiaocheng Feng
|
Weitao Ma
|
Yuchun Fan
|
Xiachong Feng
|
Yuxuan Gu
|
Yangfan Ye
|
Liang Zhao
|
Weihong Zhong
|
Baoxin Wang
|
Dayong Wu
|
Guoping Hu
|
Lingpeng Kong
|
Tong Xiao
|
Ting Liu
|
Bing Qin
Large language models (LLMs) are known to suffer from severe hallucination issues. One of the main causes lies in the knowledge misalignment between the pre-training stage and the supervised fine-tuning stage. The unfamiliar knowledge encountered during fine-tuning may encourage LLMs to generate facts that are not grounded in parametric knowledge. To address this, we propose Seal, a novel training objective with an abstention mechanism, in which the model learns to selectively reject tokens that misalign with the desired knowledge distribution via a special [REJ] token. This allows the model the option of acknowledging the insufficiency of knowledge rather than blindly assigning high probability to all ground-truth answers. We further propose a regularized decoding objective that penalizes uncertain predictions during inference by using the [REJ] probability learned during training. Extensive experiments on six short-form and long-form QA datasets with three LLMs of different sizes demonstrate that our method effectively alleviates hallucinations caused by knowledge misalignment. Further analysis highlights the adaptations of our method in answer refusal scenarios and its ability to effectively maintain the model’s instruction-following capabilities.
pdf
bib
abs
Retrospective Learning from Interactions
Zizhao Chen
|
Mustafa Omer Gul
|
Yiwei Chen
|
Gloria Geng
|
Anne Wu
|
Yoav Artzi
Multi-turn interactions between large language models (LLMs) and users naturally include implicit feedback signals. If an LLM responds in an unexpected way to an instruction, the user is likely to signal it by rephrasing the request, expressing frustration, or pivoting to an alternative task. Such signals are task-independent and occupy a relatively constrained subspace of language, allowing the LLM to identify them even if it fails on the actual task. We introduce ReSpect, a method to learn from such signals in past interactions via retrospection without additional annotations. We deploy ReSpect in a new multimodal interaction scenario, where humans instruct a multimodal LLM to solve an abstract reasoning task with a combinatorial solution space. Through thousands of interactions with humans, we show how ReSpect gradually improves task completion rate from 31% to 82%, all without any external annotation.
pdf
bib
abs
Personalized Generation In Large Model Era: A Survey
Yiyan Xu
|
Jinghao Zhang
|
Alireza Salemi
|
Xinting Hu
|
Wenjie Wang
|
Fuli Feng
|
Hamed Zamani
|
Xiangnan He
|
Tat-Seng Chua
In the era of large models, content generation is gradually shifting to Personalized Generation (PGen), tailoring content to individual preferences and needs. This paper presents the first comprehensive survey on PGen, investigating existing research in this rapidly growing field. We conceptualize PGen from a unified perspective, systematically formalizing its key components, core objectives, and abstract workflows. Based on this unified perspective, we propose a multi-level taxonomy, offering an in-depth review of technical advancements, commonly used datasets, and evaluation metrics across multiple modalities, personalized contexts, and tasks. Moreover, we envision the potential applications of PGen and highlight open challenges and promising directions for future exploration. By bridging PGen research across multiple modalities, this survey serves as a valuable resource for fostering knowledge sharing and interdisciplinary collaboration, ultimately contributing to a more personalized digital landscape.
pdf
bib
abs
Graph Counselor: Adaptive Graph Exploration via Multi-Agent Synergy to Enhance LLM Reasoning
Junqi Gao
|
Xiang Zou
|
Ying Ai
|
Dong Li
|
Yichen Niu
|
Biqing Qi
|
Jianxing Liu
Graph Retrieval Augmented Generation (GraphRAG) effectively enhances external knowledge integration capabilities by explicitly modeling knowledge relationships, thereby improving the factual accuracy and generation quality of Large Language Models (LLMs) in specialized domains. However, existing methods suffer from two inherent limitations: 1) Inefficient Information Aggregation: They rely on a single agent and fixed iterative patterns, making it difficult to adaptively capture multi-level textual, structural, and degree information within graph data. 2) Rigid Reasoning Mechanism: They employ preset reasoning schemes, which cannot dynamically adjust reasoning depth nor achieve precise semantic correction. To overcome these limitations, we propose Graph Counselor, an GraphRAG method based on multi-agent collaboration. This method uses the Adaptive Graph Information Extraction Module (AGIEM), where Planning, Thought, and Execution Agents work together to precisely model complex graph structures and dynamically adjust information extraction strategies, addressing the challenges of multi-level dependency modeling and adaptive reasoning depth. Additionally, the Self-Reflection with Multiple Perspectives (SR) module improves the accuracy and semantic consistency of reasoning results through self-reflection and backward reasoning mechanisms. Experiments demonstrate that Graph Counselor outperforms existing methods in multiple graph reasoning tasks, exhibiting higher reasoning accuracy and generalization ability.Our code is available at https://github.com/gjq100/Graph-Counselor.git.
pdf
bib
abs
SOTOPIA-: Dynamic Strategy Injection Learning and Social Instruction Following Evaluation for Social Agents
Wenyuan Zhang
|
Tianyun Liu
|
Mengxiao Song
|
Xiaodong Li
|
Tingwen Liu
Despite the abundance of prior social strategies possessed by humans, there remains a paucity of research dedicated to their transfer and integration into social agents. Our proposed SOTOPIA-Ω framework aims to address and bridge this gap, with a particular focus on enhancing the social capabilities of language agents. This framework dynamically injects a variety of social strategies into expert agents, thereby automating the construction of high-quality social dialogue training corpus. Additionally, we introduce the concept of Social Instruction Following (S-IF) and propose two new S-IF evaluation metrics that are complementary to social capability. We demonstrate that several 7B models trained on high-quality corpus not only significantly surpasses the expert agent (GPT-4) in achieving social goals but also enhances S-IF performance. Analysis and variant experiments validate the advantages of dynamic construction, which can especially break the agent’s prolonged deadlock.
pdf
bib
abs
Can Language Models Replace Programmers for Coding? REPOCOD Says ‘Not Yet’
Shanchao Liang
|
Nan Jiang
|
Yiran Hu
|
Lin Tan
Recently, a number of repository-level code generation benchmarks–such as CoderEval, DevEval, RepoEval, RepoBench, and LongCode-Arena–have emerged to evaluate the capabilities of large language models (LLMs) beyond standalone benchmarks like HumanEval and MBPP. Thus, a natural question is, would LLMs have similar performance in real world coding tasks as their performance in these benchmarks? Unfortunately, one cannot answer this question, since these benchmarks consist of short completions, synthetic examples, or focus on limited scale repositories, failing to represent real-world coding tasks.To address these challenges, we create RepoCod, a Python code-generation benchmark containing complex tasks with realistic dependencies in real-world large projects and appropriate metrics for evaluating source code. It includes 980 whole-function generation tasks from 11 popular projects, 50.8% of which require repository-level context. RepoCod includes 314 developer-written test cases per instance for better evaluation. We evaluate ten LLMs on RepoCod and find that none achieves more than 30% pass@1 on RepoCod, indicating the necessity of building stronger LLMs that can help developers in real-world software development. In addition, we found that retrieval-augmented generation achieves better results than using target function dependencies as context.
pdf
bib
abs
Leveraging In-Context Learning for Political Bias Testing of LLMs
Patrick Haller
|
Jannis Vamvas
|
Rico Sennrich
|
Lena Ann Jäger
A growing body of work has been querying LLMs with political questions to evaluate their potential biases. However, this probing method has limited stability, making comparisons between models unreliable. In this paper, we argue that LLMs need more context. We propose a new probing task, Questionnaire Modeling (QM), that uses human survey data as in-context examples. We show that QM improves the stability of question-based bias evaluation, and demonstrate that it may be used to compare instruction-tuned models to their base versions. Experiments with LLMs of various sizes indicate that instruction tuning can indeed change the direction of bias. Furthermore, we observe a trend that larger models are able to leverage in-context examples more effectively, and generally exhibit smaller bias scores in QM. Data and code are publicly available.
pdf
bib
abs
ACORD: An Expert-Annotated Retrieval Dataset for Legal Contract Drafting
Steven H Wang
|
Maksim Zubkov
|
Kexin Fan
|
Sarah Harrell
|
Yuyang Sun
|
Wei Chen
|
Andreas Plesner
|
Roger Wattenhofer
Contract clause retrieval is foundational to contract drafting because lawyers rarely draft contracts from scratch; instead, they locate and revise the most relevant precedent clauses. We introduce the Atticus Clause Retrieval Dataset (ACORD), the first expert-annotated benchmark specifically designed for contract clause retrieval to support contract drafting tasks. ACORD focuses on complex contract clauses such as Limitation of Liability, Indemnification, Change of Control, and Most Favored Nation. It includes 114 queries and over 126,000 query-clause pairs, each ranked on a scale from 1 to 5 stars. The task is to find the most relevant precedent clauses to a query. The bi-encoder retriever paired with pointwise LLMs re-rankers shows promising results. However, substantial improvements are still needed to manage the complex legal work typically undertaken by lawyers effectively. As the first expert-annotated benchmark for contract clause retrieval, ACORD can serve as a valuable IR benchmark for the NLP community.
pdf
bib
abs
LLMs know their vulnerabilities: Uncover Safety Gaps through Natural Distribution Shifts
Qibing Ren
|
Hao Li
|
Dongrui Liu
|
Zhanxu Xie
|
Xiaoya Lu
|
Yu Qiao
|
Lei Sha
|
Junchi Yan
|
Lizhuang Ma
|
Jing Shao
Safety concerns in large language models (LLMs) have gained significant attention due to their exposure to potentially harmful data during pre-training. In this paper, we identify a new safety vulnerability in LLMs: their susceptibility to natural distribution shifts between attack prompts and original toxic prompts, where seemingly benign prompts, semantically related to harmful content, can bypass safety mechanisms. To explore this issue, we introduce a novel attack method, ActorBreaker, which identifies actors related to toxic prompts within pre-training distribution to craft multi-turn prompts that gradually lead LLMs to reveal unsafe content. ActorBreaker is grounded in Latour’s actor-network theory, encompassing both human and non-human actors to capture a broader range of vulnerabilities. Our experimental results demonstrate that ActorBreaker outperforms existing attack methods in terms of diversity, effectiveness, and efficiency across aligned LLMs. To address this vulnerability, we propose expanding safety training to cover a broader semantic space of toxic content. We thus construct a multi-turn safety dataset using ActorBreaker. Fine-tuning models on our dataset shows significant improvements in robustness, though with some trade-offs in utility. Code is available at https://github.com/AI45Lab/ActorAttack.
pdf
bib
abs
WAFFLE: Fine-tuning Multi-Modal Model for Automated Front-End Development
Shanchao Liang
|
Nan Jiang
|
Shangshu Qian
|
Lin Tan
Web development involves turning UI designs into functional webpages, which can be difficult for both beginners and experienced developers due to the complexity of HTML’s hierarchical structures and styles. While Large Language Models (LLMs) have shown promise in generating source code, two major challenges persist in UI-to-HTML code generation: (1) effectively representing HTML’s hierarchical structure for LLMs, and (2) bridging the gap between the visual nature of UI designs and the text-based format of HTML code. To tackle these challenges, we introduce Waffle, a new fine-tuning strategy that uses a structure-aware attention mechanism to improve LLMs’ understanding of HTML’s structure and a contrastive fine-tuning approach to align LLMs’ understanding of UI images and HTML code. Models fine-tuned with Waffle show up to 9.00 pp (percentage point) higher HTML match, 0.0982 higher CW-SSIM, 32.99 higher CLIP, and 27.12 pp higher LLEM on our new benchmark WebSight-Test and an existing benchmark Design2Code, outperforming current fine-tuning methods.
pdf
bib
abs
Math Neurosurgery: Isolating Language Models’ Math Reasoning Abilities Using Only Forward Passes
Bryan R Christ
|
Zachary Gottesman
|
Jonathan Kropko
|
Thomas Hartvigsen
Math reasoning is an active area of Large Language Model (LLM) research because it is a hallmark of artificial intelligence and has implications in several domains, including math education. However, few works have explored how math reasoning is encoded within LLM parameters and if it is a skill that can be isolated within models. Doing so could allow targeted intervention to improve math performance without altering non-math behavior and foster understanding of how models encode math reasoning. We introduce Math Neurosurgery (MathNeuro), a computationally efficient method we use to isolate math-specific parameters in LLMs using only forward passes. MathNeuro builds on existing work by using weights and activations to calculate parameter importance, but isolates math-specific parameters by filtering out those important for general language tasks. Through pruning parameters MathNeuro identifies, we delete a LLM’s math reasoning ability without significantly impacting its general language ability. Scaling the identified parameters by a small constant improves a pretrained or instruction-tuned LLM’s performance by 4-17% on GSM8K and 5-35% on MATH while leaving non-math behavior unaltered. MathNeuro is also data efficient: most of its effectiveness holds when identifying math-specific parameters using a single sample. MathNeuro highlights the potential for future work to intervene on math-specific parameters.
pdf
bib
abs
Multiple LLM Agents Debate for Equitable Cultural Alignment
Dayeon Ki
|
Rachel Rudinger
|
Tianyi Zhou
|
Marine Carpuat
Large Language Models (LLMs) need to adapt their predictions to diverse cultural contexts to benefit diverse communities across the world. While previous efforts have focused on single-LLM, single-turn approaches, we propose to exploit the complementary strengths of multiple LLMs to promote cultural adaptability. We introduce a Multi-Agent Debate framework, where two LLM-based agents debate over a cultural scenario and collaboratively reach a final decision. We propose two variants: one where either LLM agents exclusively debate and another where they dynamically choose between self-reflection and debate during their turns. We evaluate these approaches on 7 open-weight LLMs (and 21 LLM combinations) using the NormAd-ETI benchmark for social etiquette norms in 75 countries. Experiments show that debate improves both overall accuracy and cultural group parity over single-LLM baselines. Notably, multi-agent debate enables relatively small LLMs (7-9B) to achieve accuracies comparable to that of a much larger model (27B parameters).
pdf
bib
abs
RefreshKV: Updating Small KV Cache During Long-form Generation
Fangyuan Xu
|
Tanya Goyal
|
Eunsol Choi
Generating long sequences of tokens given a long-context input is a very compute-intensive inference scenario for large language models (LLMs). One prominent inference speed-up approach is constructing a smaller key-value (KV) cache, relieving LLMs from computing attention over a long sequence of tokens. While such methods work well to generate short sequences, their performance degrades rapidly for long-form generation. Most KV compression happens once, prematurely removing tokens that can be useful later in the generation. We propose a new inference-time method, RefreshKV, that flexibly alternates between full context attention and attention over a subset of input tokens during generation. After each full attention step, we update the smaller KV cache based on the attention pattern over the entire input. Applying our method to off-the-shelf LLMs achieves comparable speedup to eviction-based methods while improving performance for various long-form generation tasks. Lastly, we show that continued pretraining with our inference setting brings further gains in performance.
pdf
bib
abs
SEA: Low-Resource Safety Alignment for Multimodal Large Language Models via Synthetic Embeddings
Weikai Lu
|
Hao Peng
|
Huiping Zhuang
|
Cen Chen
|
Ziqian Zeng
Multimodal Large Language Models (MLLMs) have serious security vulnerabilities. While safety alignment using multimodal datasets consisting of text and data of additional modalities can effectively enhance MLLM’s security, it is costly to construct these datasets. Existing low-resource security alignment methods, including textual alignment, have been found to struggle with the security risks posed by additional modalities. To address this, we propose Synthetic Embedding augmented safety Alignment (SEA), which optimizes embeddings of additional modality through gradient updates to expand textual datasets. This enables multimodal safety alignment training even when only textual data is available. Extensive experiments on image, video, and audio-based MLLMs demonstrate that SEA can synthesize a high-quality embedding on a single RTX3090 GPU within 24 seconds. SEA significantly improves the security of MLLMs when faced with threats from additional modalities. To assess the security risks introduced by video and audio, we also introduced a new benchmark called VA-SafetyBench. High attack success rates across multiple MLLMs validate its challenge. Our code and data will be available at https://github.com/ZeroNLP/SEA.
pdf
bib
abs
Chain-of-Reasoning: Towards Unified Mathematical Reasoning in Large Language Models via a Multi-Paradigm Perspective
Yiyao Yu
|
Yuxiang Zhang
|
Dongdong Zhang
|
Xiao Liang
|
Hengyuan Zhang
|
Xingxing Zhang
|
Mahmoud Khademi
|
Hany Hassan Awadalla
|
Junjie Wang
|
Yujiu Yang
|
Furu Wei
Large Language Models (LLMs) have made notable progress in mathematical reasoning, yet they often rely on single-paradigm reasoning that limits their effectiveness across diverse tasks. In this paper, we introduce Chain-of-Reasoning (CoR), a novel unified framework that integrates multiple reasoning paradigms — Natural Language Reasoning (NLR), Algorithmic Reasoning (AR), and Symbolic Reasoning (SR) — to enable synergistic collaboration. CoR generates multiple potential answers using different reasoning paradigms and synthesizes them into a coherent final solution. We propose a Progressive Paradigm Training (PPT) strategy that allows models to progressively master these paradigms, culminating in the development of at CoR-Math-7B. Experimental results demonstrate that CoR-Math-7B significantly outperforms current SOTA models, achieving up to a 41.0% absolute improvement over GPT-4o in theorem proving tasks and a 15% improvement over RL-based methods on the MATH benchmark in arithmetic tasks. These results show the enhanced mathematical comprehensive ability of our model, enabling zero-shot generalization across tasks.The code is available at https://github.com/microsoft/CoR.
pdf
bib
abs
Language Models Grow Less Humanlike beyond Phase Transition
Tatsuya Aoyama
|
Ethan Wilcox
LMs’ alignment with human reading behavior (i.e. psychometric predictive power; PPP) is known to improve during pretraining up to a tipping point, beyond which it either plateaus or degrades. Various factors, such as word frequency, recency bias in attention, and context size, have been theorized to affect PPP, yet there is no current account that explains why such a tipping point exists, and how it interacts with LMs’ pretraining dynamics more generally. We hypothesize that the underlying factor is a pretraining phase transition, characterized by the rapid emergence of specialized attention heads. We conduct a series of correlational and causal experiments to show that such a phase transition is responsible for the tipping point in PPP. We then show that, rather than producing attention patterns that contribute to the degradation in PPP, phase transitions alter the subsequent learning dynamics of the model, such that further training keeps damaging PPP.
pdf
bib
abs
PCoT: Persuasion-Augmented Chain of Thought for Detecting Fake News and Social Media Disinformation
Arkadiusz Modzelewski
|
Witold Sosnowski
|
Tiziano Labruna
|
Adam Wierzbicki
|
Giovanni Da San Martino
Disinformation detection is a key aspect of media literacy. Psychological studies have shown that knowledge of persuasive fallacies helps individuals detect disinformation. Inspired by these findings, we experimented with large language models (LLMs) to test whether infusing persuasion knowledge enhances disinformation detection. As a result, we introduce the Persuasion-Augmented Chain of Thought (PCoT), a novel approach that leverages persuasion to improve disinformation detection in zero-shot classification. We extensively evaluate PCoT on online news and social media posts. Moreover, we publish two novel, up-to-date disinformation datasets: EUDisinfo and MultiDis. These datasets enable the evaluation of PCoT on content entirely unseen by the LLMs used in our experiments, as the content was published after the models’ knowledge cutoffs. We show that, on average, PCoT outperforms competitive methods by 15% across five LLMs and five datasets. These findings highlight the value of persuasion in strengthening zero-shot disinformation detection.
pdf
bib
abs
Coordinating Chaos: A Structured Review of Linguistic Coordination Methodologies
Benjamin Roger Litterer
|
David Jurgens
|
Dallas Card
Linguistic coordination—a phenomenon where conversation partners end up having similar patterns of language use—has been established across a variety of contexts and for multiple linguistic features. However, the study of language coordination has been accompanied by a diverse and inconsistently applied set of measures and theoretical perspectives. This diversity has significant consequences, as replication studies have highlighted the brittleness of certain measures and called influential findings into question. While prior work has addressed specific modeling decisions and model types, linguistic coordination research has yet to fully examine, synthesize, and critique the space of modeling choices available. In this work, we present a framework to organize the linguistic coordination literature. Using this schema, we provide a high-level overview of the choices involved in the measurement process and synthesize relevant critiques. Based on both gaps and limitations surfaced from this review, we suggest directions for further exploration and evaluation. In doing so, we provide the clarity required for linguistic coordination research to arrive at interpretable and sound conclusions.
pdf
bib
abs
iNews: A Multimodal Dataset for Modeling Personalized Affective Responses to News
Tiancheng Hu
|
Nigel Collier
Understanding how individuals perceive and react to information is fundamental for advancing social and behavioral sciences and developing human-centered AI systems. Current approaches often lack the granular data needed to model these personalized responses, relying instead on aggregated labels that obscure the rich variability driven by individual differences. We introduce iNews, a novel large-scale dataset specifically designed to facilitate the modeling of personalized affective responses to news content. Our dataset comprises annotations from 291 demographically diverse UK participants across 2,899 multimodal Facebook news posts from major UK outlets, with an average of 5.18 annotators per sample. For each post, annotators provide multifaceted labels including valence, arousal, dominance, discrete emotions, content relevance judgments, sharing likelihood, and modality importance ratings. Crucially, we collect comprehensive annotator persona information covering demographics, personality, media trust, and consumption patterns, which explain 15.2% of annotation variance - substantially higher than existing NLP datasets. Incorporating this information yields a 7% accuracy gain in zero-shot prediction and remains beneficial even with 32-shot in-context learning.
pdf
bib
abs
Mind the Gesture: Evaluating AI Sensitivity to Culturally Offensive Non-Verbal Gestures
Akhila Yerukola
|
Saadia Gabriel
|
Nanyun Peng
|
Maarten Sap
Gestures are an integral part of non-verbal communication, with meanings that vary across cultures, and misinterpretations that can have serious social and diplomatic consequences. As AI systems become more integrated into global applications, ensuring they do not inadvertently perpetuate cultural offenses is critical. To this end, we introduce Multi-Cultural Set of Inappropriate Gestures and Nonverbal Signs (MC-SIGNS), a dataset of 288 gesture-country pairs annotated for offensiveness, cultural significance, and contextual factors across 25 gestures and 85 countries. Through systematic evaluation using MC-SIGNS, we uncover critical limitations: text-to-image (T2I) systems exhibit strong US-centric biases, performing better at detecting offensive gestures in US contexts than in non-US ones; large language models (LLMs) tend to over-flag gestures as offensive; and vision-language models (VLMs) default to US-based interpretations when responding to universal concepts like wishing someone luck, frequently suggesting culturally inappropriate gestures. These findings highlight the urgent need for culturally-aware AI safety mechanisms to ensure equitable global deployment of AI technologies.
pdf
bib
abs
500xCompressor: Generalized Prompt Compression for Large Language Models
Zongqian Li
|
Yixuan Su
|
Nigel Collier
Prompt compression is important for large language models (LLMs) to increase inference speed, reduce costs, and improve user experience. However, current methods face challenges such as low compression ratios and potential training-test overlap during evaluation. To address these issues, we propose 500xCompressor, a method that compresses natural language contexts into a minimum of one special token and demonstrates strong generalization ability. The 500xCompressor introduces approximately 0.3% additional parameters and achieves compression ratios ranging from 6x to 500x, achieving 27-90% reduction in calculations and 55-83% memory savings when generating 100-400 tokens for new and reused prompts at 500x compression, while retaining 70-74% (F1) and 77-84% (Exact Match) of the LLM capabilities compared to using non-compressed prompts. It is designed to compress any text, answer various types of questions, and can be utilized by the original LLM without requiring fine-tuning. Initially, 500xCompressor was pretrained on the ArxivCorpus, followed by fine-tuning on the ArxivQA dataset, and subsequently evaluated on strictly unseen and cross-domain question answering (QA) datasets. This study shows that KV values outperform embeddings in preserving information at high compression ratios. The highly compressive nature of natural language prompts, even for detailed information, suggests potential for future applications and the development of a new LLM language.
pdf
bib
abs
Estimating Privacy Leakage of Augmented Contextual Knowledge in Language Models
James Flemings
|
Bo Jiang
|
Wanrong Zhang
|
Zafar Takhirov
|
Murali Annavaram
Language models (LMs) rely on their parametric knowledge augmented with relevant contextual knowledge for certain tasks, such as question answering. However, the contextual knowledge can contain private information that may be leaked when answering queries, and estimating this privacy leakage is not well understood. A straightforward approach of directly comparing an LM’s output to the contexts can overestimate the privacy risk, since the LM’s parametric knowledge might already contain the augmented contextual knowledge. To this end, we introduce context influence, a metric that builds on differential privacy, a widely-adopted privacy notion, to estimate the privacy leakage of contextual knowledge during decoding. Our approach effectively measures how each subset of the context influences an LM’s response while separating the specific parametric knowledge of the LM. Using our context influence metric, we demonstrate that context privacy leakage occurs when contextual knowledge is out of distribution with respect to parametric knowledge. Moreover, we experimentally demonstrate how context influence properly attributes the privacy leakage to augmented contexts, and we evaluate how factors– such as model size, context size, generation position, etc.– affect context privacy leakage. The practical implications of our results will inform practitioners of the privacy risk associated with augmented contextual knowledge.
pdf
bib
abs
Document-Level Event-Argument Data Augmentation for Challenging Role Types
Joseph Gatto
|
Omar Sharif
|
Parker Seegmiller
|
Sarah Masud Preum
Event Argument Extraction (EAE) is a daunting information extraction problem — with significant limitations in few-shot cross-domain (FSCD) settings. A common solution to FSCD modeling is data augmentation. Unfortunately, existing augmentation methods are not well-suited to a variety of real-world EAE contexts, including (i) modeling long documents (documents with over 10 sentences), and (ii) modeling challenging role types (i.e., event roles with little to no training data and semantically outlying roles). We introduce two novel LLM-powered data augmentation methods for generating extractive document-level EAE samples using zero in-domain training data. We validate the generalizability of our approach on four datasets — showing significant performance increases in low-resource settings. Our highest performing models provide a 13-pt increase in F1 score on zero-shot role extraction in FSCD evaluation.
pdf
bib
abs
Mapping the Podcast Ecosystem with the Structured Podcast Research Corpus
Benjamin Roger Litterer
|
David Jurgens
|
Dallas Card
Podcasts provide highly diverse content to a massive listener base through a unique on-demand modality. However, limited data has prevented large-scale computational analysis of the podcast ecosystem. To fill this gap, we introduce a massive dataset of over 1.1M podcast transcripts that is largely comprehensive of all English language podcasts available through public RSS feeds from May and June of 2020. This data is not limited to text, but includes metadata, inferred speaker roles, and audio features and speaker turns for a subset of 370K episodes. Using this data, we conduct a foundational investigation into the content, structure, and responsiveness of this ecosystem. Together, our data and analyses open the door to continued computational research of this popular and impactful medium.
pdf
bib
abs
Unravelling the Logic: Investigating the Generalisation of Transformers in Numerical Satisfiability Problems
Tharindu Madusanka
|
Marco Valentino
|
Iqra Zahid
|
Ian Pratt-Hartmann
|
Riza Batista-Navarro
Transformer models have achieved remarkable performance in many formal reasoning tasks. Nonetheless, the extent of their comprehension pertaining to logical semantics and rules of inference remains somewhat uncertain. Evaluating such understanding necessitates a rigorous examination of these models’ generalisation capacity to out-of-distribution data. In this study, we probe the generalisation prowess of Transformer models with respect to the hitherto unexplored domain of numerical satisfiability problems. Our investigation reveals that Transformers exhibit minimal scale and noise invariance, alongside limited vocabulary and number invariance. However, even when Transformer models experience a notable decline in performance on out-of-distribution test sets, they often still surpass the random baseline by a considerable margin.
pdf
bib
abs
The Nature of NLP: Analyzing Contributions in NLP Papers
Aniket Pramanick
|
Yufang Hou
|
Saif M. Mohammad
|
Iryna Gurevych
Natural Language Processing (NLP) is an established and dynamic field. Despite this, what constitutes NLP research remains debated. In this work, we address the question by quantitatively examining NLP research papers. We propose a taxonomy of research contributions and introduce _NLPContributions_, a dataset of nearly 2k NLP research paper abstracts, carefully annotated to identify scientific contributions and classify their types according to this taxonomy. We also introduce a novel task of automatically identifying contribution statements and classifying their types from research papers. We present experimental results for this task and apply our model to ~29k NLP research papers to analyze their contributions, aiding in the understanding of the nature of NLP research. We show that NLP research has taken a winding path — with the focus on language and human-centric studies being prominent in the 1970s and 80s, tapering off in the 1990s and 2000s, and starting to rise again since the late 2010s. Alongside this revival, we observe a steady rise in dataset and methodological contributions since the 1990s, such that today, on average, individual NLP papers contribute in more ways than ever before. Our dataset and analyses offer a powerful lens for tracing research trends and offer potential for generating informed, data-driven literature surveys.
pdf
bib
abs
\mathtt{GeLLM^3O}: Generalizing Large Language Models for Multi-property Molecule Optimization
Vishal Dey
|
Xiao Hu
|
Xia Ning
Despite recent advancements, most computational methods for molecule optimization are constrained to single- or double-property optimization tasks and suffer from poor scalability and generalizability to novel optimization tasks. Meanwhile, Large Language Models (LLMs) demonstrate remarkable out-of-domain generalizability to novel tasks. To demonstrate LLMs’ potential for molecule optimization, we introduce \mathtt{MuMOInstruct}, the first high-quality instruction-tuning dataset specifically focused on multi-property molecule optimization tasks. Leveraging \mathtt{MuMOInstruct}, we develop \mathtt{GeLLM^3O}s, a series of instruction-tuned LLMs for molecule optimization. Extensive evaluations across 5 in-domain and 5 out-of-domain tasks demonstrate that \mathtt{GeLLM^3O}s consistently outperform state-of-the-art baselines. \mathtt{GeLLM^3O}s also exhibit outstanding zero-shot generalization to unseen tasks, significantly outperforming powerful closed-source LLMs. Such strong generalizability demonstrates the tremendous potential of \mathtt{GeLLM^3O}s as foundational models for molecule optimization, thereby tackling novel optimization tasks without resource-intensive retraining. \mathtt{MuMOInstruct} and code are accessible through https://github.com/ninglab/GeLLMO.
pdf
bib
abs
Follow-up Question Generation For Enhanced Patient-Provider Conversations
Joseph Gatto
|
Parker Seegmiller
|
Timothy E. Burdick
|
Inas S. Khayal
|
Sarah DeLozier
|
Sarah Masud Preum
Follow-up question generation is an essential feature of dialogue systems as it can reduce conversational ambiguity and enhance modeling complex interactions. Conversational contexts often pose core NLP challenges such as (i) extracting relevant information buried in fragmented data sources, and (ii) modeling parallel thought processes. These two challenges occur frequently in medical dialogue as a doctor asks questions based not only on patient utterances but also their prior EHR data and current diagnostic hypotheses. Asking medical questions in asynchronous conversations compounds these issues as doctors can only rely on static EHR information to motivate follow-up questions. To address these challenges, we introduce FollowupQ, a novel framework for enhancing asynchronous medical conversation.FollowupQ is a multi-agent framework that processes patient messages and EHR data to generate personalized follow-up questions, clarifying patient-reported medical conditions. FollowupQ reduces requisite provider follow-up communications by 34%. It also improves performance by 17% and 5% on real and synthetic data, respectively. We also release the first public dataset of asynchronous medical messages with linked EHR data alongside 2,300 follow-up questions written by clinical experts for the wider NLP research community.
pdf
bib
abs
Unveiling Privacy Risks in LLM Agent Memory
Bo Wang
|
Weiyi He
|
Shenglai Zeng
|
Zhen Xiang
|
Yue Xing
|
Jiliang Tang
|
Pengfei He
Large Language Model (LLM) agents have become increasingly prevalent across various real-world applications. They enhance decision-making by storing private user-agent interactions in the memory module for demonstrations, introducing new privacy risks for LLM agents. In this work, we systematically investigate the vulnerability of LLM agents to our proposed Memory EXTRaction Attack (MEXTRA) under a black-box setting. To extract private information from memory, we propose an effective attacking prompt design and an automated prompt generation method based on different levels of knowledge about the LLM agent. Experiments on two representative agents demonstrate the effectiveness of MEXTRA. Moreover, we explore key factors influencing memory leakage from both the agent designer’s and the attacker’s perspectives. Our findings highlight the urgent need for effective memory safeguards in LLM agent design and deployment.
pdf
bib
abs
Watching the Watchers: Exposing Gender Disparities in Machine Translation Quality Estimation
Emmanouil Zaranis
|
Giuseppe Attanasio
|
Sweta Agrawal
|
Andre Martins
Quality estimation (QE)—the automatic assessment of translation quality—has recently become crucial across several stages of the translation pipeline, from data curation to training and decoding. While QE metrics have been optimized to align with human judgments, whether they encode social biases has been largely overlooked. Biased QE risks favoring certain demographic groups over others, e.g., by exacerbating gaps in visibility and usability. This paper defines and investigates gender bias of QE metrics and discusses its downstream implications for machine translation (MT). Experiments with state-of-the-art QE metrics across multiple domains, datasets, and languages reveal significant bias. When a human entity’s gender in the source is undisclosed, masculine-inflected translations score higher than feminine-inflected ones, and gender-neutral translations are penalized. Even when contextual cues disambiguate gender, using context-aware QE metrics leads to more errors in selecting the correct translation inflection for feminine referents than for masculine ones. Moreover, a biased QE metric affects data filtering and quality-aware decoding. Our findings underscore the need for a renewed focus on developing and evaluating QE metrics centered on gender.
pdf
bib
abs
Language Constrained Multimodal Hyper Adapter For Many-to-Many Multimodal Summarization
Nayu Liu
|
Fanglong Yao
|
Haoran Luo
|
Yong Yang
|
Chen Tang
|
Bo Lv
Multimodal summarization (MS) combines text and visuals to generate summaries. Recently, many-to-many multimodal summarization (M3S) garnered interest as it enables a unified model for multilingual and cross-lingual MS. Existing methods have made progress by facilitating the transfer of common multimodal summarization knowledge. While, prior M3S models that fully share parameters neglect the language-specific knowledge learning, where potential interference between languages may limit the flexible adaptation of MS modes across different language combinations and hinder further collaborative improvements in joint M3S training. Based on this observation, we propose Language Constrained Multimodal Hyper Adapter (LCMHA) for M3S. LCMHA integrates language-specific multimodal adapters into multilingual pre-trained backbones via a language constrained hypernetwork, enabling relaxed parameter sharing that enhances language-specific learning while preserving shared MS knowledge learning. In addition, a language-regularized hypernetwork is designed to balance intra- and inter-language learning, generating language-specific adaptation weights and enhancing the retention of distinct language features through the regularization of generated parameters. Experimental results on the M3Sum benchmark show LCMHA’s effectiveness and scalability across multiple multilingual pre-trained backbones.
pdf
bib
abs
PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models
Mingyang Song
|
Zhaochen Su
|
Xiaoye Qu
|
Jiawei Zhou
|
Yu Cheng
Process-level Reward Models (PRMs) are crucial for complex reasoning and decision-making tasks, where each intermediate step plays an important role in the reasoning process. Since language models are prone to various types of errors during the reasoning process, PRMs are required to possess nuanced capabilities for detecting various implicit error types in real-world scenarios. However, current benchmarks primarily focus on step correctness, failing to evaluate PRMs’ performance systematically. To address this gap, we introduce PRMBench, a process-level benchmark specifically designed to assess the fine-grained error detection capabilities of PRMs. PRMBench comprises 6,216 carefully designed problems and 83,456 step-level labels, evaluating models across multiple dimensions, including simplicity, soundness, and sensitivity. In our experiments on 25 models, spanning both open-source PRMs and closed-source large language models prompted as critic models, we uncover significant weaknesses in current PRMs. These findings underscore the challenges inherent in process-level evaluation and highlight key directions for future research, establishing PRMBench as a robust testbed for advancing research on PRM evaluation and development.
pdf
bib
abs
Efficient Ensemble for Fine-tuning Language Models on Multiple Datasets
Dongyue Li
|
Ziniu Zhang
|
Lu Wang
|
Hongyang R. Zhang
This paper develops an ensemble method for fine-tuning a language model to multiple datasets. Existing methods, such as quantized LoRA (QLoRA), are efficient when adapting to a single dataset. When training on multiple datasets of different tasks, a common setup in practice, it remains unclear how to design an efficient adaptation for fine-tuning language models. We propose to use an ensemble of multiple smaller adapters instead of a single adapter per task. We design an efficient algorithm that partitions n datasets into m groups, where m is typically much smaller than n in practice, and train one adapter for each group before taking a weighted combination to form the ensemble. The algorithm leverages a first-order approximation property of low-rank adaptation to quickly obtain the fine-tuning performances of dataset combinations since methods like LoRA stay close to the base model. Hence, we use the gradients of the base model to estimate its behavior during fine-tuning. Empirically, this approximation holds with less than 1% error on models with up to 34 billion parameters, leading to an estimation of true fine-tuning performances under 5% error while speeding up computation compared to base fine-tuning by 105 times. When applied to fine-tune Llama and GPT models on ten text classification tasks, our approach provides up to 10% higher average test accuracy over QLoRA, with only 9% more FLOPs. On a Llama model with 34 billion parameters, an ensemble of QLoRA increases test accuracy by 3% compared to QLoRA, with only 8% more FLOPs.
pdf
bib
abs
Library-Like Behavior In Language Models is Enhanced by Self-Referencing Causal Cycles
Munachiso S Nwadike
|
Zangir Iklassov
|
Toluwani Aremu
|
Tatsuya Hiraoka
|
Benjamin Heinzerling
|
Velibor Bojkovic
|
Hilal AlQuabeh
|
Martin Takáč
|
Kentaro Inui
We introduce the concept of the self-referencing causal cycle (abbreviated ReCall )—a mechanism that enables large language models (LLMs) to bypass the limitations of unidirectional causality, which underlies a phenomenon known as the reversal curse. When an LLM is prompted with sequential data, it often fails to recall preceding context. For example, when we ask an LLM to recall the line preceding “O say does that star-spangled banner yet wave” in the U.S. National Anthem, it often fails to correctly return “Gave proof through the night that our flag was still there”—this is due to the reversal curse. It occurs because language models such as ChatGPT and Llama generate text based on preceding tokens, requiring facts to be learned and reproduced in a consistent token order. While the reversal curse is often viewed as a limitation, we offer evidence of an alternative view: it is not always an obstacle in practice. We find that ReCall is driven by what we designate as cycle tokens—sequences that connect different parts of the training data, enabling recall of preceding tokens from succeeding ones. Through rigorous probabilistic formalization and controlled experiments, we demonstrate how the cycles they induce influence a model’s ability to reproduce information. To facilitate reproducibility, we provide our code and experimental details at https://anonymous.4open.science/r/remember-B0B8/.
pdf
bib
abs
Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models
Lang Gao
|
Jiahui Geng
|
Xiangliang Zhang
|
Preslav Nakov
|
Xiuying Chen
Jailbreaking in Large Language Models (LLMs) is a major security concern as it can deceive LLMs into generating harmful text. However, understanding of how jailbreaking works remains limited, hindering the development of effective defense strategies. To address this issue, we conduct a large-scale analysis of seven different jailbreak methods and identify that disagreements among methods stem from insufficient observation samples.We introduce the concept of a safety boundary and discover that jailbreaks shift harmful activations outside this boundary, where LLMs become less sensitive to harmful information. Our analysis reveals that low and middle layers play a critical role in these shifts, while deeper layers have a lesser impact.Building on these insights, we propose a novel defense mechanism called Activation Boundary Defense (ABD), which adaptively constrains activations within the safety boundary. To enhance its effectiveness, we use Bayesian optimization to selectively apply the defense to the low and middle layers.Experiments on several benchmark datasets demonstrate that ABD achieves an average Defense Success Rate (DSR) of over 98% against various jailbreak attacks, with less than a 2% impact on the model’s general capabilities.
pdf
bib
abs
ASPERA: A Simulated Environment to Evaluate Planning for Complex Action Execution
Alexandru Coca
|
Mark Gaynor
|
Zhenxing Zhang
|
Jianpeng Cheng
|
Bo-Hsiang Tseng
|
Peter Boothroyd
|
Hector Martinez Alonso
|
Diarmuid O Seaghdha
|
Anders Johannsen
This work evaluates the potential of large language models (LLMs) to power digital assistants capable of complex action execution. Such assistants rely on pre-trained programming knowledge to execute multi-step goals by composing objects and functions defined in assistant libraries into action execution programs. To achieve this, we develop ASPERA, a framework comprising an assistant library simulation and a human-assisted LLM data generation engine. Our engine allows developers to guide LLM generation of high-quality tasks consisting of complex user queries, simulation state and corresponding validation programs, tackling data availability and evaluation robustness challenges. Alongside the framework we release Asper-Bench, an evaluation dataset of 250 challenging tasks generated using ASPERA, which we use to show that program generation grounded in custom assistant libraries is a significant challenge to LLMs compared to dependency-free code generation.
pdf
bib
abs
ReflectDiffu: Reflect between Emotion-intent Contagion and Mimicry for Empathetic Response Generation via a RL-Diffusion Framework
Jiahao Yuan
|
Zixiang Di
|
Zhiqing Cui
|
Guisong Yang
|
Usman Naseem
Empathetic response generation necessitates the integration of emotional and intentional dynamics to foster meaningful interactions. Existing research either neglects the intricate interplay between emotion and intent, leading to suboptimal controllability of empathy, or resorts to large language models (LLMs), which incur significant computational overhead. In this paper, we introduce ReflectDiffu, a lightweight and comprehensive framework for empathetic response generation. This framework incorporates emotion contagion to augment emotional expressiveness and employs an emotion-reasoning mask to pinpoint critical emotional elements. Additionally, it integrates intent mimicry within reinforcement learning for refinement during diffusion. By harnessing an intent twice reflect mechanism of Exploring-Sampling-Correcting, ReflectDiffu adeptly translates emotional decision-making into precise intent actions, thereby addressing empathetic response misalignments stemming from emotional misrecognition. Through reflection, the framework maps emotional states to intents, markedly enhancing both response empathy and flexibility. Comprehensive experiments reveal that ReflectDiffu outperforms existing models regarding relevance, controllability, and informativeness, achieving state-of-the-art results in both automatic and human evaluations.
pdf
bib
abs
SARA: Salience-Aware Reinforced Adaptive Decoding for Large Language Models in Abstractive Summarization
Nayu Liu
|
Junnan Zhu
|
Yiming Ma
|
Zhicong Lu
|
Wenlei Xu
|
Yong Yang
|
Jiang Zhong
|
Kaiwen Wei
LLMs have improved the fluency and informativeness of abstractive summarization but remain prone to hallucinations, where generated content deviates from the source document. Recent PMI decoding strategies mitigate over-reliance on prior knowledge by comparing output probabilities with and without source documents, effectively enhancing contextual utilization and improving faithfulness. However, existing strategies often neglect the explicit use of salient contextual information and rely on static hyperparameters to fix the balance between contextual and prior knowledge, limiting their flexibility. In this work, we propose Salience-Aware Reinforced Adaptive decoding (SARA), which incorporates salient information and allows the model to adaptively determine reliance on the source document’s context, salient context, and the model’s prior knowledge based on pointwise mutual information. Moreover, a tokenwise adaptive decoding mechanism via reinforcement learning is proposed in SARA to dynamically adjust the contributions of context and prior knowledge at each decoding timestep. Experiments on CNN/DM, WikiHow, and NYT50 datasets show that SARA consistently improves the quality and faithfulness of summaries across various LLM backbones without modifying their weights.
pdf
bib
abs
Embedding-Converter: A Unified Framework for Cross-Model Embedding Transformation
Jinsung Yoon
|
Sercan O Arik
Embedding models play a crucial role in machine learning. However, the continuous development of new models presents a major challenge: migrating to a potentially superior model often requires the computationally expensive process of re-embedding entire datasets—without any guarantee of performance improvement. This paper presents Embedding-Converter, a novel framework for efficiently transforming embeddings between different models, thus avoiding costly ‘re-embedding’. The proposed approach achieves 100 times faster and cheaper computations in real-world applications. Experiments show that Embedding-Converter not only streamlines transitions to new models, but can also improve upon the source model’s performance, approaching that of the target model. This facilitates efficient evaluation and broader adoption of new embedding models by significantly reducing the overhead of model switching. Furthermore, Embedding-Converter addresses latency limitations by enabling the use of smaller models for online tasks while still benefiting from the performance of larger models offline. By promoting the release of converters alongside new embedding models, Embedding-Converter fosters a more dynamic and accessible ecosystem for embedding model development and deployment.
pdf
bib
abs
Improving Automatic Evaluation of Large Language Models (LLMs) in Biomedical Relation Extraction via LLMs-as-the-Judge
Md Tahmid Rahman Laskar
|
Israt Jahan
|
Elham Dolatabadi
|
Chun Peng
|
Enamul Hoque
|
Jimmy Huang
Large Language Models (LLMs) have demonstrated impressive performance in biomedical relation extraction, even in zero-shot scenarios. However, evaluating LLMs in this task remains challenging due to their ability to generate human-like text, often producing synonyms or abbreviations of gold-standard answers, making traditional automatic evaluation metrics unreliable. On the other hand, while human evaluation is more reliable, it is costly and time-consuming, making it impractical for real-world applications. This paper investigates the use of LLMs-as-the-Judge as an alternative evaluation method for biomedical relation extraction. We benchmark 8 LLMs as judges to evaluate the responses generated by 5 other LLMs across 3 biomedical relation extraction datasets. Unlike other text-generation tasks, we observe that LLM-based judges perform quite poorly (usually below 50% accuracy) in the biomedical relation extraction task. Our findings reveal that it happens mainly because relations extracted by LLMs do not adhere to any standard format. To address this, we propose structured output formatting for LLM-generated responses that helps LLM-Judges to improve their performance by about 15% (on average). We also introduce a domain adaptation technique to further enhance LLM-Judge performance by effectively transferring knowledge between datasets. We release both our human-annotated and LLM-annotated judgment data (36k samples in total) for public use here: https://github.com/tahmedge/llm_judge_biomedical_re.
pdf
bib
abs
Answering Complex Geographic Questions by Adaptive Reasoning with Visual Context and External Commonsense Knowledge
Fan Li
|
Jianxing Yu
|
Jielong Tang
|
Wenqing Chen
|
Hanjiang Lai
|
Yanghui Rao
|
Jian Yin
This paper focuses on a new task of answering geographic reasoning questions based on the given image (called GeoVQA). Unlike traditional VQA tasks, GeoVQA asks for details about the image-related culture, landscape, etc. This requires not only the identification of the objects in the image, their properties and relations, but also the understanding of the geographic knowledge of the objects, such as location, transportation, landmark, cuisine, etc. This background knowledge does not explicitly appear in the image, nor is there an extra-textual description. Without this missing but necessary knowledge, it is difficult for existing matching-based methods to infer the correct answer. To tackle these challenges, we propose a new geographic reasoning framework for our task. We first analyze the image and describe its fine-grained content by text and keywords using a multi-modal retrieval augmented technique, so as to deduce an answer in a unified textual modality. Next, we retrieve the crucial geographic commonsense knowledge. To reduce the retrieval complexity, we design a dynamic method that can adaptively collect the relevant clues for each reasoning step. The step in the incorrect direction will be pruned according to some judgment criteria. The remaining steps can help us form a reasoning chain to derive a correct answer. Moreover, we create a large-scale dataset GVQA with 41,329 samples to conduct the evaluation. The results demonstrate the effectiveness of our approach.
pdf
bib
abs
Safety Alignment via Constrained Knowledge Unlearning
Zesheng Shi
|
Yucheng Zhou
|
Jing Li
|
Yuxin Jin
|
Yu Li
|
Daojing He
|
Fangming Liu
|
Saleh Alharbi
|
Jun Yu
|
Min Zhang
Despite significant progress in safety alignment, large language models (LLMs) remain susceptible to jailbreak attacks. Existing defense mechanisms have not fully deleted harmful knowledge in LLMs, which allows such attacks to bypass safeguards and produce harmful outputs. To address this challenge, we propose a novel safety alignment strategy, Constrained Knowledge Unlearning (CKU), which focuses on two primary objectives: knowledge localization and retention, and unlearning harmful knowledge. CKU works by scoring neurons in specific multilayer perceptron (MLP) layers to identify a subset U of neurons associated with useful knowledge. During the unlearning process, CKU prunes the gradients of neurons in U to preserve valuable knowledge while effectively mitigating harmful content. Experimental results demonstrate that CKU significantly enhances model safety without compromising overall performance, offering a superior balance between safety and utility compared to existing methods. Additionally, our analysis of neuron knowledge sensitivity across various MLP layers provides valuable insights into the mechanics of safety alignment and model knowledge editing.
pdf
bib
abs
Response Wide Shut? Surprising Observations in Basic Vision Language Model Capabilities
Shivam Chandhok
|
Wan-Cyuan Fan
|
Vered Shwartz
|
Vineeth N. Balasubramanian
|
Leonid Sigal
Vision-language Models (VLMs) have emerged as general-purpose tools for addressing a variety of complex computer vision problems. Such models have been shown to be highly capable, but, at the same time, lacking some basic visual understanding skills. In this paper, we set out to understand the limitations of SoTA VLMs on fundamental visual tasks (object classification, spatial understanding, and ability to delineate individual object instances through counting), by constructing a series of tests that probe which components of design, specifically, may be lacking. Importantly, we go significantly beyond the current benchmarks, which simply measure the final performance of VLM response, by also comparing and contrasting it to the performance of probes trained directly on features obtained from the visual encoder, intermediate vision-language projection and LLM-decoder output. In doing so, we uncover shortcomings in VLMs and make a number of important observations about their capabilities, robustness and how they process visual information. We hope our insights will guide progress in further improving VLMs.
pdf
bib
abs
EffiVLM-BENCH: A Comprehensive Benchmark for Evaluating Training-Free Acceleration in Large Vision-Language Models
Zekun Wang
|
MingHua Ma
|
Zexin Wang
|
Rongchuan Mu
|
Liping Shan
|
Ming Liu
|
Bing Qin
Large Vision-Language Models (LVLMs) have achieved remarkable success, yet their significant computational demands hinder practicaldeployment. While efforts to improve LVLM efficiency are growing, existing methods lack comprehensive evaluation across diverse backbones, benchmarks, and metrics. In this work, we systematically evaluate mainstream acceleration techniques for LVLMs, categorized into token and parameter compression. We introduce EffiVLM-BENCH, a unified framework for assessing not only absolute performance but also generalization and loyalty, while exploring Pareto-optimal trade-offs. Our extensive experiments and in-depth analyses offer insights into optimal strategies for accelerating LVLMs. We open-source code and recipes for EffiVLM-BENCH to foster future research.
pdf
bib
abs
Pre-Training Curriculum for Multi-Token Prediction in Language Models
Ansar Aynetdinov
|
Alan Akbik
Multi-token prediction (MTP) is a recently proposed pre-training objective for language models. Rather than predicting only the next token (NTP), MTP predicts the next *k* tokens at each prediction step, using multiple prediction heads. MTP has shown promise in improving downstream performance, inference speed, and training efficiency, particularly for large models. However, prior work has shown that smaller language models (SLMs) struggle with the MTP objective. To address this, we propose a curriculum learning strategy for MTP training, exploring two variants: a forward curriculum, which gradually increases the complexity of the pre-training objective from NTP to MTP, and a reverse curriculum, which does the opposite. Our experiments show that the forward curriculum enables SLMs to better leverage the MTP objective during pre-training, improving downstream NTP performance and generative output quality, while retaining the benefits of self-speculative decoding. The reverse curriculum achieves stronger NTP performance and output quality, but fails to provide any self-speculative decoding benefits.
pdf
bib
abs
Can We Further Elicit Reasoning in LLMs? Critic-Guided Planning with Retrieval-Augmentation for Solving Challenging Tasks
Xingxuan Li
|
Weiwen Xu
|
Ruochen Zhao
|
Fangkai Jiao
|
Shafiq Joty
|
Lidong Bing
Large language models excel at problem-solving but often struggle with complex reasoning and factual accuracy. While chain-of-thought and retrieval-augmented generation help break down problems and retrieve knowledge, they still falter on challenging tasks like competitive programming due to frequent reasoning errors and irrelevant retrieval. To address this, we introduce Critic-guided planning with Retrieval-augmentation, CR-Planner, a novel framework that leverages fine-tuned critic models to guide both reasoning and retrieval processes through planning. CR-Planner iteratively selects and executes sub-goals, guided by critic models. A sub-goal critic identifies promising sub-goals from reasoning, query generation, and retrieval, while an execution critic evaluates outputs of sub-goal executions. We employ Monte Carlo Tree Search to collect data for critic training, allowing systematic exploration of action sequences and effective navigation toward the final answer. We evaluate CR-Planner on challenging domain-knowledge-intensive and reasoning-heavy tasks, including competitive programming, theorem-driven math reasoning, and complex domain retrieval problems. It significantly outperforms baselines, demonstrating effectiveness in both reasoning and retrieval.
pdf
bib
abs
On Many-Shot In-Context Learning for Long-Context Evaluation
Kaijian Zou
|
Muhammad Khalifa
|
Lu Wang
Many-shot in-context learning (ICL) has emerged as a unique setup to both utilize and test the ability of large language models to handle long context. This paper delves into long-context language model (LCLM) evaluation through many-shot ICL. We first ask: what types of ICL tasks benefit from additional demonstrations, and how effective are they in evaluating LCLMs? We find that classification and summarization tasks show performance improvements with additional demonstrations, while translation and reasoning tasks do not exhibit clear trends. Next, we investigate the extent to which different tasks necessitate retrieval versus global context understanding. We develop metrics to categorize ICL tasks into two groups: (i) similar-sample learning (SSL): tasks where retrieval of the most similar examples is sufficient for good performance, and (ii) all-sample learning (ASL): tasks that necessitate a deeper comprehension of all examples in the prompt. Lastly, we introduce a new many-shot ICL benchmark built on existing ICL tasks, MANYICLBENCH, to characterize model’s ability on both fronts and benchmark 12 LCLMs using MANYICLBENCH. We find that while state-of-the-art models demonstrate good performance up to 64k tokens in SSL tasks, many models experience significant performance drops at only 16k tokens in ASL tasks.
pdf
bib
abs
HelpSteer3: Human-Annotated Feedback and Edit Data to Empower Inference-Time Scaling in Open-Ended General-Domain Tasks
Zhilin Wang
|
Jiaqi Zeng
|
Olivier Delalleau
|
Daniel Egert
|
Ellie Evans
|
Hoo-Chang Shin
|
Felipe Soares
|
Yi Dong
|
Oleksii Kuchaiev
Inference-Time Scaling has been critical to the success of recent models such as OpenAI o1 and DeepSeek R1. However, many techniques used to train models for inference-time scaling require tasks to have answers that can be verified, limiting their application to domains such as math, coding and logical reasoning. We take inspiration from how humans make first attempts, ask for detailed feedback from others and make improvements based on such feedback across a wide spectrum of open-ended endeavors. To this end, we collect HelpSteer3 data to train dedicated Feedback and Edit Models that are capable of performing inference-time scaling for open-ended general-domain tasks. In our setup, one model generates an initial response, which are given feedback by a second model, that are then used by a third model to edit the response. We show that performance on Arena Hard, a benchmark strongly predictive of Chatbot Arena Elo can be boosted by scaling the number of initial response drafts, effective feedback and edited responses. When scaled optimally, our setup based on 70B models from the Llama 3 family can reach SoTA performance on Arena Hard at 92.7 as of 5 Mar 2025, surpassing OpenAI o1-preview-2024-09-12 with 90.4 and DeepSeek R1 with 92.3.
pdf
bib
abs
CulturalBench: A Robust, Diverse and Challenging Benchmark for Measuring LMs’ Cultural Knowledge Through Human-AI Red-Teaming
Yu Ying Chiu
|
Liwei Jiang
|
Bill Yuchen Lin
|
Chan Young Park
|
Shuyue Stella Li
|
Sahithya Ravi
|
Mehar Bhatia
|
Maria Antoniak
|
Yulia Tsvetkov
|
Vered Shwartz
|
Yejin Choi
Robust, diverse, and challenging cultural knowledge benchmarks are essential for measuring our progress towards making LMs that are helpful across diverse cultures. We introduce CulturalBench: a set of 1,696 human-written and human-verified questions to assess LMs’ cultural knowledge, covering 45 global regions including underrepresented ones like Bangladesh, Zimbabwe, and Peru. Questions are each verified by five independent annotators and span 17 diverse topics ranging from food preferences to greeting etiquette. We construct CulturalBench using methods inspired by Human-AI Red-Teaming. Compared to human performance (92.4% accuracy), the hard version of CulturalBench is challenging even for the best-performing frontier LMs, ranging from 28.7% to 61.5% in accuracy. We find that LMs often struggle with tricky questions that have multiple correct answers (e.g., What utensils do the Chinese usually use?), revealing a tendency to overfit to a single answer. Our results indicate that GPT-4o substantially outperform other models across cultures, besting local providers (e.g., Mistral on European culture and DeepSeek on Chinese culture). Across the board, models under-perform on questions related to North Africa, South America and Middle East.
pdf
bib
abs
Balancing the Budget: Understanding Trade-offs Between Supervised and Preference-Based Finetuning
Mohit Raghavendra
|
Junmo Kang
|
Alan Ritter
Post-training of Large Language Models often involves a pipeline of Supervised Finetuning (SFT) followed by Preference Finetuning (PFT) using methods like Direct Preference Optimization. Both stages require annotated data that are very different in structure and costs. We study how to optimally allocate a fixed training data budget between the two stages, through extensive experiments spanning four diverse tasks, multiple model sizes and various data annotation costs. Our findings reveal that just SFT on the base model dominates performance in low-data regimes (<1,000 annotated examples). With larger data-budgets, we observe that a combination of SFT and PFT, often with increasing portions allocated towards preference data yields optimal performance. However, completely eliminating SFT and running PFT directly on the base model yields suboptimal performance, described as the cold start problem on tasks like mathematics. We observe that this is due to the distribution shift arising from using DPO directly on the base model to elicit step-by-step reasoning. This limitation can be effectively addressed by allocating even a small portion (<10%) of the budget to SFT first, resulting in performance improvements of 15-20% on analytical benchmarks like GSM8k. These results provide actionable insights for researchers and practitioners optimizing model development under budget constraints, where high-quality data curation often represents a significant portion of the total costs of model development.
pdf
bib
abs
All That Glitters is Not Novel: Plagiarism in AI Generated Research
Tarun Gupta
|
Danish Pruthi
Automating scientific research is considered the final frontier of science. Recently, several papers claim autonomous research agents can generate novel research ideas. Amidst the prevailing optimism, we document a critical concern: a considerable fraction of such research documents are smartly plagiarized. Unlike past efforts where experts evaluate the novelty and feasibility of research ideas, we request 13 experts to operate under a different situational logic: to identify similarities between LLM-generated research documents and existing work. Concerningly, the experts identify 24% of the 50 evaluated research documents to be either paraphrased (with one-to-one methodological mapping), or significantly borrowed from existing work. These reported instances are cross-verified by authors of the source papers. Experts find an additional 32% ideas to partially overlap with prior work, and a small fraction to be completely original. Problematically, these LLM-generated research documents do not acknowledge original sources, and bypass inbuilt plagiarism detectors. Lastly, through controlled experiments we show that automated plagiarism detectors are inadequate at catching plagiarized ideas from such systems. We recommend a careful assessment of LLM-generated research, and discuss the implications of our findings on academic publishing.
pdf
bib
abs
Writing Like the Best: Exemplar-Based Expository Text Generation
Yuxiang Liu
|
Kevin Chen-Chuan Chang
We introduce the Exemplar-Based Expository Text Generation task, aiming to generate an expository text on a new topic using an exemplar on a similar topic. Current methods fall short due to their reliance on extensive exemplar data, difficulty in adapting topic-specific content, and issues with long-text coherence. To address these challenges, we propose the concept of Adaptive Imitation and present a novel Recurrent Plan-then-Adapt (RePA) framework. RePA leverages large language models (LLMs) for effective adaptive imitation through a fine-grained plan-then-adapt process. RePA also enables recurrent segment-by-segment imitation, supported by two memory structures that enhance input clarity and output coherence. We also develop task-specific evaluation metrics–imitativeness, adaptiveness, and adaptive-imitativeness–using LLMs as evaluators. Experimental results across our collected three diverse datasets demonstrate that RePA surpasses existing baselines in producing factual, consistent, and relevant texts for this task.
pdf
bib
abs
Temporal Relation Extraction in Clinical Texts: A Span-based Graph Transformer Approach
Rochana Chaturvedi
|
Peyman Baghershahi
|
Sourav Medya
|
Barbara Di Eugenio
Temporal information extraction from unstructured text is essential for contextualizing events and deriving actionable insights, particularly in the medical domain. We address the task of extracting clinical events and their temporal relations using the well-studied I2B2 2012 Temporal Relations Challenge corpus. This task is inherently challenging due to complex clinical language, long documents, and sparse annotations. We introduce GraphTREx, a novel method integrating span-based entity-relation extraction, clinical large pre-trained language models (LPLMs), and Heterogeneous Graph Transformers (HGT) to capture local and global dependencies. Our HGT component facilitates information propagation across the document through innovative global landmarks that bridge distant entities and improves the state-of-the-art with 5.5% improvement in the tempeval F1 score over the previous best and up to 8.9% improvement on long-range relations, which presents a formidable challenge. We further demonstrate generalizability by establishing a strong baseline on the E3C corpus. Not only does this work advance temporal information extraction, but also lays the groundwork for improved diagnostic and prognostic models through enhanced temporal reasoning.
pdf
bib
abs
Finding A Voice: Exploring the Potential of African American Dialect and Voice Generation for Chatbots
Sarah E. Finch
|
Ellie S. Paek
|
Ikseon Choi
|
Jinho D. Choi
As chatbots become integral to daily life, personalizing systems is key for fostering trust, engagement, and inclusivity. This study examines how linguistic similarity affects chatbot performance, focusing on integrating African American English (AAE) into virtual agents to better serve the African American community. We develop text-based and spoken chatbots using large language models and text-to-speech technology, then evaluate them with AAE speakers against standard English chatbots. Our results show that while text-based AAE chatbots often underperform, spoken chatbots benefit from an African American voice and AAE elements, improving performance and preference. These findings underscore the complexities of linguistic personalization and the dynamics between text and speech modalities, highlighting technological limitations that affect chatbots’ AA speech generation and pointing to promising future research directions.
pdf
bib
abs
Delta-KNN: Improving Demonstration Selection in In-Context Learning for Alzheimer’s Disease Detection
Chuyuan Li
|
Raymond Li
|
Thalia S. Field
|
Giuseppe Carenini
Alzheimer’s Disease (AD) is a progressive neurodegenerative disorder that leads to dementia, and early intervention can greatly benefit from analyzing linguistic abnormalities. In this work, we explore the potential of Large Language Models as health assistants for AD diagnosis from patient-generated text using in-context learning (ICL), where tasks are defined through a few input-output examples. Empirical results reveal that conventional ICL methods, such as similarity-based selection, perform poorly for AD diagnosis, likely due to the inherent complexity of this task. To address this, we introduce Delta-KNN, a novel demonstration selection strategy that enhances ICL performance. Our method leverages a delta score to assess the relative gains of each training example, coupled with a KNN-based retriever that dynamically selects optimal “representatives” for a given input.Experiments on two AD detection datasets across three models demonstrate that Delta-KNN consistently outperforms existing ICL baselines. Notably, when using the Llama-3.1 model, our approach achieves new state-of-the-art results, surpassing even supervised classifiers.
pdf
bib
abs
Help Me Write a Story: Evaluating LLMs’ Ability to Generate Writing Feedback
Hannah Rashkin
|
Elizabeth Clark
|
Fantine Huot
|
Mirella Lapata
Can LLMs provide support to creative writers by giving meaningful writing feedback? In this paper, we explore the challenges and limitations of model-generated writing feedback by defining a new task, dataset, and evaluation frameworks. To study model performance in a controlled manner, we present a novel test set of 1,300 stories that we corrupted to intentionally introduce writing issues. We study the performance of commonly used LLMs in this task with both automatic and human evaluation metrics. Our analysis shows that current models have strong out-of-the-box behavior in many respects—providing specific and mostly accurate writing feedback. However, models often fail to identify the biggest writing issue in the story and to correctly decide when to offer critical vs. positive feedback.
pdf
bib
abs
Language Fusion for Parameter-Efficient Cross-lingual Transfer
Philipp Borchert
|
Ivan Vulić
|
Marie-Francine Moens
|
Jochen De Weerdt
Limited availability of multilingual text corpora for training language models often leads to poor performance on downstream tasks due to undertrained representation spaces for languages other than English. This ‘under-representation’ has motivated recent cross-lingual transfer methods to leverage the English representation space by e.g. mixing English and ‘non-English’ tokens at the input level or extending model parameters to accommodate new languages. However, these approaches often come at the cost of increased computational complexity. We propose Fusion for Language Representations (FLARE) in adapters, a novel method that enhances representation quality and downstream performance for languages other than English while maintaining parameter efficiency. FLARE integrates source and target language representations within low-rank (LoRA) adapters using lightweight linear transformations, maintaining parameter efficiency while improving transfer performance. A series of experiments across representative cross-lingual natural language understanding tasks, including natural language inference, question-answering and sentiment analysis, demonstrate FLARE’s effectiveness. FLARE achieves performance improvements of 4.9% for Llama 3.1 and 2.2% for Gemma 2 compared to standard LoRA fine-tuning on question-answering tasks, as measured by the exact match metric.
pdf
bib
abs
Culture is Not Trivia: Sociocultural Theory for Cultural NLP
Naitian Zhou
|
David Bamman
|
Isaac L. Bleaman
The field of cultural NLP has recently experienced rapid growth, driven by a pressing need to ensure that language technologies are effective and safe across a pluralistic user base. This work has largely progressed without a shared conception of culture, instead choosing to rely on a wide array of cultural proxies. However, this leads to a number of recurring limitations: coarse national boundaries fail to capture nuanced differences that lay within them, limited coverage restricts datasets to only a subset of usually highly-represented cultures, and a lack of dynamicity results in static cultural benchmarks that do not change as culture evolves. In this position paper, we argue that these methodological limitations are symptomatic of a theoretical gap. We draw on a well-developed theory of culture from sociocultural linguistics to fill this gap by 1) demonstrating in a case study how it can clarify methodological constraints and affordances, 2) offering theoretically-motivated paths forward to achieving cultural competence, and 3) arguing that localization is a more useful framing for the goals of much current work in cultural NLP.
pdf
bib
abs
AAD-LLM: Neural Attention-Driven Auditory Scene Understanding
Xilin Jiang
|
Sukru Samet Dindar
|
Vishal Choudhari
|
Stephan Bickel
|
Ashesh Mehta
|
Guy M McKhann
|
Daniel Friedman
|
Adeen Flinker
|
Nima Mesgarani
Auditory foundation models, including auditory large language models (LLMs), process all sound inputs equally, independent of listener perception. However, human auditory perception is inherently selective: listeners focus on specific speakers while ignoring others in complex auditory scenes. Existing models do not incorporate this selectivity, limiting their ability to generate perception-aligned responses. To address this, we introduce intention-informed auditory scene understanding (II-ASU) and present Auditory Attention-Driven LLM (AAD-LLM), a prototype system that integrates brain signals to infer listener attention. AAD-LLM extends an auditory LLM by incorporating intracranial electroencephalography (iEEG) recordings to decode which speaker a listener is attending to and refine responses accordingly. The model first predicts the attended speaker from neural activity, then conditions response generation on this inferred attentional state. We evaluate AAD-LLM on speaker description, speech transcription and extraction, and question answering in multitalker scenarios, with both objective and subjective ratings showing improved alignment with listener intention. By taking a first step toward intention-aware auditory AI, this work explores a new paradigm where listener perception informs machine listening, paving the way for future listener-centered auditory systems. Demo available.
pdf
bib
abs
Do Language Models Have Semantics? On the Five Standard Positions
Anders Søgaard
We identify five positions on whether large language models (LLMs) and chatbots can be said to exhibit semantic understanding. These positions differ in whether they attribute semantics to LLMs and/or chatbots trained on feedback, what kind of semantics they attribute (inferential or referential), and in virtue of what they attribute referential semantics (internal or external causes). This allows for 2^^4=16 logically possible positions, but we have only seen people argue for five of these. Based on a pairwise comparison of these five positions, we conclude that the better theory of semantics in large language models is, in fact, a sixth combination: Both large language models and chatbots have inferential and referential semantics, grounded in both internal and external causes.
pdf
bib
abs
Dehumanizing Machines: Mitigating Anthropomorphic Behaviors in Text Generation Systems
Myra Cheng
|
Su Lin Blodgett
|
Alicia DeVrio
|
Lisa Egede
|
Alexandra Olteanu
As text generation systems’ outputs are increasingly anthropomorphic—perceived as human-like—scholars have also increasingly raised concerns about how such outputs can lead to harmful outcomes, such as users over-relying or developing emotional dependence on these systems. How to intervene on such system outputs to mitigate anthropomorphic behaviors and their attendant harmful outcomes, however, remains understudied. With this work, we aim to provide empirical and theoretical grounding for developing such interventions. To do so, we compile an inventory of interventions grounded both in prior literature and a crowdsourcing study where participants edited system outputs to make them less human-like. Drawing on this inventory, we also develop a conceptual framework to help characterize the landscape of possible interventions, articulate distinctions between different types of interventions, and provide a theoretical basis for evaluating the effectiveness of different interventions.
pdf
bib
abs
Evaluating Multimodal Language Models as Visual Assistants for Visually Impaired Users
Antonia Karamolegkou
|
Malvina Nikandrou
|
Georgios Pantazopoulos
|
Danae Sanchez Villegas
|
Phillip Rust
|
Ruchira Dhar
|
Daniel Hershcovich
|
Anders Søgaard
This paper explores the effectiveness of Multimodal Large Language models (MLLMs) as assistive technologies for visually impaired individuals. We conduct a user survey to identify adoption patterns and key challenges users face with such technologies. Despite a high adoption rate of these models, our findings highlight concerns related to contextual understanding, cultural sensitivity, and complex scene understanding, particularly for individuals who may rely solely on them for visual interpretation. Informed by these results, we collate five user-centred tasks with image and video inputs, including a novel task on Optical Braille Recognition. Our systematic evaluation of twelve MLLMs reveals that further advancements are necessary to overcome limitations related to cultural context, multilingual support, Braille reading comprehension, assistive object recognition, and hallucinations. This work provides critical insights into the future direction of multimodal AI for accessibility, underscoring the need for more inclusive, robust, and trustworthy visual assistance technologies.
pdf
bib
abs
HumT DumT: Measuring and controlling human-like language in LLMs
Myra Cheng
|
Sunny Yu
|
Dan Jurafsky
Should LLMs generate language that makes them seem human? Human-like language might improve user experience, but might also lead to deception, overreliance, and stereotyping. Assessing these potential impacts requires a systematic way to measure human-like tone in LLM outputs. We introduce HumT and SocioT, metrics for human-like tone and other dimensions of social perceptions in text data based on relative probabilities from an LLM. By measuring HumT across preference and usage datasets, we find that users prefer less human-like outputs from LLMs in many contexts. HumT also offers insights into the perceptions and impacts of anthropomorphism: human-like LLM outputs are highly correlated with warmth, social closeness, femininity, and low status, which are closely linked to the aforementioned harms. We introduce DumT, a method using HumT to systematically control and reduce the degree of human-like tone while preserving model performance. DumT offers a practical approach for mitigating risks associated with anthropomorphic language generation.
pdf
bib
abs
ChatBench: From Static Benchmarks to Human-AI Evaluation
Serina Chang
|
Ashton Anderson
|
Jake M. Hofman
With the rapid adoption of LLM-based chat-bots, there is a pressing need to evaluate what humans and LLMs can achieve together. However, standard benchmarks, such as MMLU, measure LLM capabilities in isolation (i.e., “AI-alone”). Here, we design and conduct a user study to convert MMLU questions into user-AI conversations, by seeding the user with the question and having them carry out a conversation with the LLM to answer their question. We release ChatBench, a new dataset with AI-alone, user-alone, and user-AI data for 396 questions and two LLMs, including 144K answers and 7,336 user-AI conversations. We find that AI-alone accuracy fails to predict user-AI accuracy, with significant differences across multiple subjects (math, physics, and moral reasoning), and we analyze the user-AI conversations to provide insight into how they diverge from AI-alone benchmarks. Finally, we show that fine-tuning a user simulator on a subset of ChatBench improves its ability to estimate user-AI accuracies, increasing correlation on held-out questions by more than 20 points, creating possibilities for scaling interactive evaluation.
pdf
bib
abs
Teaching an Old LLM Secure Coding: Localized Preference Optimization on Distilled Preferences
Mohammad Saqib Hasan
|
Saikat Chakraborty
|
Santu Karmaker
|
Niranjan Balasubramanian
LLM generated code often contains security issues. We address two key challenges in improving secure code generation. First, obtaining high quality training data covering a broad set of security issues is critical. To address this, we introduce a method for distilling a preference dataset of insecure and secure code pairs from frontier LLMs, along with a security reasoning that explains the issues and the fix. The key idea here is to make use of security knowledge sources to devise a systematic prompting strategy that ensures broad coverage. Second, aligning models to secure code requires focusing on localized regions of code. Direct preference optimization methods, like SimPO, are not designed to handle these localized differences and turn out to be ineffective. We address this with a new localized preference optimization algorithm that masks the security related tokens in both the winning (secure) and losing (insecure) responses. To prevent loss in code quality, we also add a regularizer. Evaluations show that both training on our dataset, DiSCo, and the new preference optimization algorithm, LPO, yield substantial reductions in code insecurity while also improving overall code quality. Code and dataset are available at https://github.com/StonyBrookNLP/disco-lpo.
pdf
bib
abs
Anything Goes? A Crosslinguistic Study of (Im)possible Language Learning in LMs
Xiulin Yang
|
Tatsuya Aoyama
|
Yuekun Yao
|
Ethan Wilcox
Do language models (LMs) offer insights into human language learning? A common argument against this idea is that because their architecture and training paradigm are so vastly different from humans, LMs can learn arbitrary inputs as easily as natural languages. We test this claim by training LMs to model impossible and typologically unattested languages.Unlike previous work, which has focused exclusively on English, we conduct experiments on 12 languages from 4 language families with two newly constructed parallel corpora. Our results show that while GPT-2 small can largely distinguish attested languages from their impossible counterparts, it does not achieve perfect separation between all the attested languages and all the impossible ones. We further test whether GPT-2 small distinguishes typologically attested from unattested languages with different NP orders by manipulating word order based on Greenberg’s Universal 20. We find that the model’s perplexity scores do not distinguish attested vs. unattested word orders, while its performance on the generalization test does. These findings suggest that LMs exhibit some human-like inductive biases, though these biases are weaker than those found in human learners.
pdf
bib
abs
Ranking Unraveled: Recipes for LLM Rankings in Head-to-Head AI Combat
Roland Daynauth
|
Christopher Clarke
|
Krisztian Flautner
|
Lingjia Tang
|
Jason Mars
Evaluating large language model (LLM) is a complex task. Pairwise ranking has emerged as state-of-the-art method to evaluate human preferences by having humans compare pairs of LLM outputs based on predefined criteria, enabling ranking across multiple LLMs by aggregating pairwise results through algorithms like Elo. However, applying these ranking algorithms in the context of LLM evaluation introduces several challenges, such as inconsistent ranking results when using ELO. Currently there is a lack of systematic study of those ranking algorithms in evaluating LLMs. In this paper, we explore the effectiveness of ranking systems for head-to-head comparisons of LLMs. We formally define a set of fundamental principles for effective ranking and conduct extensive evaluations on the robustness of several ranking algorithms in the context of LLMs. Our analysis uncovers key insights into the factors that affect ranking accuracy and efficiency, offering guidelines for selecting the most appropriate methods based on specific evaluation contexts and resource constraints.
pdf
bib
abs
LLM Agents Making Agent Tools
Georg Wölflein
|
Dyke Ferber
|
Daniel Truhn
|
Ognjen Arandjelovic
|
Jakob Nikolas Kather
Tool use has turned large language models (LLMs) into powerful agents that can perform complex multi-step tasks by dynamically utilising external software components. However, these tools must be implemented in advance by human developers, hindering the applicability of LLM agents in domains demanding large numbers of highly specialised tools, like in life sciences and medicine. Motivated by the growing trend of scientific studies accompanied by public code repositories, we propose ToolMaker, an agentic framework that autonomously transforms papers with code into LLM-compatible tools. Given a GitHub URL and short task description, ToolMaker autonomously installs dependencies and generates code to perform the task, using a closed-loop self-correction mechanism for debugging. To evaluate our approach, we introduce a benchmark comprising 15 complex computational tasks spanning various domains with over 100 unit tests to assess correctness and robustness. Our method correctly implements 80% of the tasks, substantially outperforming current state-of-the-art software engineering agents. ToolMaker therefore is a step towards fully autonomous agent-based scientific workflows.
pdf
bib
abs
CrafText Benchmark: Advancing Instruction Following in Complex Multimodal Open-Ended World
Zoya Volovikova
|
Gregory Gorbov
|
Petr Kuderov
|
Aleksandr Panov
|
Alexey Skrynnik
Following instructions in real-world conditions requires a capability to adapt to the world’s volatility and entanglement: the environment is dynamic and unpredictable, instructions can be linguistically complex with diverse vocabulary, and the number of possible goals an agent may encounter is vast. Despite extensive research in this area, most studies are conducted in static environments with simple instructions and a limited vocabulary, making it difficult to assess agent performance in more diverse and challenging settings. To address this gap, we introduce CrafText, a benchmark for evaluating instruction following in a multimodal environment with diverse instructions and dynamic interactions. CrafText includes 3,924 instructions with 3,423 unique words, covering Localization, Conditional, Building, and Achievement tasks. Additionally, we propose an evaluation protocol that measures an agent’s ability to generalize to novel instruction formulations and dynamically evolving task configurations, providing a rigorous test of both linguistic understanding and adaptive decision-making.
pdf
bib
abs
QG-SMS: Enhancing Test Item Analysis via Student Modeling and Simulation
Bang Nguyen
|
Tingting Du
|
Mengxia Yu
|
Lawrence Angrave
|
Meng Jiang
While the Question Generation (QG) task has been increasingly adopted in educational assessments, its evaluation remains limited by approaches that lack a clear connection to the educational values of test items. In this work, we introduce test item analysis, a method frequently used by educators to assess test question quality, into QG evaluation. Specifically, we construct pairs of candidate questions that differ in quality across dimensions such as topic coverage, item difficulty, item discrimination, and distractor efficiency. We then examine whether existing QG evaluation approaches can effectively distinguish these differences. Our findings reveal significant shortcomings in these approaches with respect to accurately assessing test item quality in relation to student performance. To address this gap, we propose a novel QG evaluation framework, QG-SMS, which leverages Large Language Model for Student Modeling and Simulation to perform test item analysis. As demonstrated in our extensive experiments and human evaluation study, the additional perspectives introduced by the simulated student profiles lead to a more effective and robust assessment of test items.
pdf
bib
abs
Causal Graph based Event Reasoning using Semantic Relation Experts
Mahnaz Koupaee
|
Xueying Bai
|
Mudan Chen
|
Greg Durrett
|
Nathanael Chambers
|
Niranjan Balasubramanian
Understanding how events in a scenario causally connect with each other is important for effectively modeling and reasoning about events. But event reasoning remains a difficult challenge, and despite recent advances, Large Language Models (LLMs) still struggle to accurately identify causal connections between events. This struggle leads to poor performance on deeper reasoning tasks like event forecasting and timeline understanding. To address this challenge, we investigate the generation of causal event graphs (e.g., A enables B) as a parallel mechanism to help LLMs explicitly represent causality during inference. This paper evaluates both how to generate correct graphs as well as how graphs can assist reasoning. We propose a collaborative approach to causal graph generation where we use LLMs to simulate experts that focus on specific semantic relations. The experts engage in multiple rounds of discussions which are then consolidated by a final expert. Then, to demonstrate the utility of causal graphs, we use them on multiple downstream applications, and also introduce a new explainable event prediction task that requires a causal chain of events in the explanation. These explanations are more informative and coherent than baseline generations. Finally, our overall approach not finetuned on any downstream task, achieves competitive results with state-of-the-art models on both forecasting and next event prediction tasks.
pdf
bib
abs
LogicPro: Improving Complex Logical Reasoning via Program-Guided Learning
Jin Jiang
|
Yuchen Yan
|
Yang Liu
|
Jianing Wang
|
Shuai Peng
|
Xunliang Cai
|
Yixin Cao
|
Mengdi Zhang
|
Liangcai Gao
In this paper, we propose a new data synthesis method called LogicPro, which leverages LeetCode-style algorithm Problems and their corresponding Program solutions to synthesize Complex Logical Reasoning data in text format. First, we synthesize complex reasoning problems through source algorithm problems and test cases. Then, standard answers and intermediate variable outputs are obtained for each problem based on standard python solutions and test cases. Finally, with the guidance of code intermediate variables, we synthesize the text reasoning process for each reasoning problems. Through this method, we can synthesize data that is difficult, scalable, effective, and comes with golden standard answers and high-quality reasoning processes. As a result, with our 540K synthesized dataset constructed solely from 2,360 algorithm problems, our approach achieves significant improvements in multiple models for the datasets BBH^27, LogicBench, DROP, AR-LSAT, and GSM8K, etc. outperforming a wide range of existing reasoning datasets.
pdf
bib
abs
Do LLMs Understand Dialogues? A Case Study on Dialogue Acts
Ayesha Qamar
|
Jonathan Tong
|
Ruihong Huang
Recent advancements in NLP, largely driven by Large Language Models (LLMs), have significantly improved performance on an array of tasks. However, Dialogue Act (DA) classification remains challenging, particularly in the fine-grained 50-class, multiparty setting. This paper investigates the root causes of LLMs’ poor performance in DA classification through a linguistically motivated analysis. We identify three key pre-tasks essential for accurate DA prediction: Turn Management, Communicative Function Identification, and Dialogue Structure Prediction. Our experiments reveal that LLMs struggle with these fundamental tasks, often failing to outperform simple rule-based baselines. Additionally, we establish a strong empirical correlation between errors in these pre-tasks and DA classification failures. A human study further highlights the significant gap between LLM and human-level dialogue understanding. These findings indicate that LLMs’ shortcomings in dialogue comprehension hinder their ability to accurately predict DAs, highlighting the need for improved dialogue-aware training approaches.
pdf
bib
abs
Research Borderlands: Analysing Writing Across Research Cultures
Shaily Bhatt
|
Tal August
|
Maria Antoniak
Improving cultural competence of language technologies is important. However most recent works rarely engage with the communities they study, and instead rely on synthetic setups and imperfect proxies of culture. In this work, we take a human-centered approach to discover and measure language-based cultural norms, and cultural competence of LLMs. We focus on a single kind of culture, *research cultures*, and a single task, *adapting writing across research cultures*. Through a set of interviews with interdisciplinary researchers, who are experts at moving between cultures, we create a framework of structural, stylistic, rhetorical, and citational norms that vary across research cultures. We operationalise these features with a suite of computational metrics and use them for (a) surfacing latent cultural norms in human-written research papers at scale; and (b) highlighting the lack of cultural competence of LLMs, and their tendency to homogenize writing. Overall, our work illustrates the efficacy of a human-centered approach to measuring cultural norms in human-written and LLM-generated texts.
pdf
bib
abs
CEAES: Bidirectional Reinforcement Learning Optimization for Consistent and Explainable Essay Assessment
Xia Li
|
Wenjing Pan
Most current automated essay quality assessment systems treat score prediction and feedback generation as separate tasks, overlooking the fact that scores provide a quantitative evaluation of quality, while feedback offers a qualitative assessment. Both aspects reflect essay quality from different perspectives, and they are inherently consistent and can reinforce each other. In this paper, we propose a novel bidirectional reinforcement learning framework that effectively utilizes this consistency constraint to jointly optimize score prediction and feedback generation, ensuring mutual reinforcement and alignment between them. In this way, our model is hope to obtain a simultaneous accurate ratings and consistent text feedback. We conducted extensive experiments on publicly available datasets. The results demonstrate that our approach surpasses the current state-of-the-art models, enhancing both scoring accuracy and feedback quality.
pdf
bib
abs
DeAL: Decoding-time Alignment for Large Language Models
James Y. Huang
|
Sailik Sengupta
|
Daniele Bonadiman
|
Yi-An Lai
|
Arshit Gupta
|
Nikolaos Pappas
|
Saab Mansour
|
Katrin Kirchhoff
|
Dan Roth
Large Language Models (LLMs) are nowadays expected to generate content aligned with human preferences. Current work focuses on alignment at model training time, through techniques such as Reinforcement Learning with Human Feedback (RLHF). However, it is unclear if such methods are an effective choice to teach alignment objectives to the model. First, the inability to incorporate multiple, custom rewards and reliance on a model developer’s view of universal and static principles are key limitations. Second, the reliability of such approaches is also questionable (e.g. susceptibility to jailbreaking even after safety training). To address these issues, we propose DeAL, a framework that allows the user to customize reward functions and enables Decoding-time Alignment of LLMs (DeAL). At its core, we view decoding as a heuristic-guided search process and facilitate the use of a wide variety of alignment objectives. Our experiments with programmatic constraints such as keyword and length constraints, and abstract objectives such as harmlessness and helpfulness, show that we can DeAL with fine-grained trade-offs and improve adherence to alignment objectives. Lastly, we demonstrate that DeAL is largely complementary to existing alignment strategies, and can be effectively paired with RLHF and prompting techniques to achieve better alignment.
pdf
bib
abs
Cultural Bias Matters: A Cross-Cultural Benchmark Dataset and Sentiment-Enriched Model for Understanding Multimodal Metaphors
Senqi Yang
|
Dongyu Zhang
|
Jing Ren
|
Ziqi Xu
|
Xiuzhen Zhang
|
Yiliao Song
|
Hongfei Lin
|
Feng Xia
Metaphors are pervasive in communication, making them crucial for natural language processing (NLP). Previous research on automatic metaphor processing predominantly relies on training data consisting of English samples, which often reflect Western European or North American biases. This cultural skew can lead to an overestimation of model performance and contributions to NLP progress. However, the impact of cultural bias on metaphor processing, particularly in multimodal contexts, remains largely unexplored. To address this gap, we introduce MultiMM, a Multicultural Multimodal Metaphor dataset designed for cross-cultural studies of metaphor in Chinese and English. MultiMM consists of 8,461 text-image advertisement pairs, each accompanied by fine-grained annotations, providing a deeper understanding of multimodal metaphors beyond a single cultural domain. Additionally, we propose Sentiment-Enriched Metaphor Detection (SEMD), a baseline model that integrates sentiment embeddings to enhance metaphor comprehension across cultural backgrounds. Experimental results validate the effectiveness of SEMD on metaphor detection and sentiment analysis tasks. We hope this work increases awareness of cultural bias in NLP research and contributes to the development of fairer and more inclusive language models.
pdf
bib
abs
OmniCharacter: Towards Immersive Role-Playing Agents with Seamless Speech-Language Personality Interaction
Haonan Zhang
|
Run Luo
|
Xiong Liu
|
Yuchuan Wu
|
Ting-En Lin
|
Pengpeng Zeng
|
Qiang Qu
|
Feiteng Fang
|
Min Yang
|
Lianli Gao
|
Jingkuan Song
|
Fei Huang
|
Yongbin Li
Role-Playing Agents (RPAs), benefiting from large language models, is an emerging interactive AI system that simulates roles or characters with diverse personalities. However, existing methods primarily focus on mimicking dialogues among roles in textual form, neglecting the role’s voice traits (e.g., voice style and emotions) as playing a crucial effect in interaction, which tends to be more immersive experiences in realistic scenarios. Towards this goal, we propose OmniCharacter, a first seamless speech-language personality interaction model to achieve immersive RPAs with low latency. Specifically, OmniCharacter enables agents to consistently exhibit role-specific personality traits and vocal traits throughout the interaction, enabling a mixture of speech and language responses. To align the model with speech-language scenarios, we construct a dataset named OmniCharacter-10K, which involves more distinctive characters (20), richly contextualized multi-round dialogue (10K), and dynamic speech response (135K). Experimental results showcase that our method yields better responses in terms of both content and style compared to existing RPAs and mainstream speech-language models, with a response latency as low as 289ms.
pdf
bib
abs
Mixtures of In-Context Learners
Giwon Hong
|
Emile Van Krieken
|
Edoardo Ponti
|
Nikolay Malkin
|
Pasquale Minervini
In-context learning (ICL) adapts LLMs by providing demonstrations without fine-tuning the model parameters; however, it is very sensitive to the choice of in-context demonstrations, and processing many demonstrations can be computationally demanding. We propose Mixtures of In-Context Learners (MoICL), a novel approach that uses subsets of demonstrations to train a set of experts via ICL and learns a weighting function to merge their output distributions via gradient-based optimisation. In our experiments, we show performance improvements on 5 out of 7 classification datasets compared to a set of strong baselines (e.g., up to +13% compared to ICL and LENS). Moreover, we improve the Pareto frontier of ICL by reducing the inference time needed to achieve the same performance with fewer demonstrations. Finally, MoICL is more robust to out-of-domain (up to +11%), imbalanced (up to +49%) and perturbed demonstrations (up to +38%).
pdf
bib
abs
Balancing Diversity and Risk in LLM Sampling: How to Select Your Method and Parameter for Open-Ended Text Generation
Yuxuan Zhou
|
Margret Keuper
|
Mario Fritz
Sampling-based decoding strategies have been widely adopted for Large Language Models (LLMs) in numerous applications, targeting a balance between diversity and quality via temperature tuning and tail truncation. Considering the strong dependency of the candidate next tokens on different prefixes, recent studies propose to adaptively truncate the tail of LLMs’ predicted distribution. Although improved results have been reported with these methods on open-ended text generation tasks, the results are highly dependent on the curated parameters and the limited exemplar text. In this paper, we propose a systematic way to estimate the intrinsic capacity of a truncation sampling method by considering the trade-off between diversity and risk at each decoding step, based on our collected prefix tree which preserves the context of a full sentence. Our work offers a comprehensive comparison of existing truncation sampling methods and serves as a practical user guideline for their parameter selection. Our code is available at https://anonymous.4open.science/r/Truncation-Sampling-Evaluation-251F.
pdf
bib
abs
RADAR: Enhancing Radiology Report Generation with Supplementary Knowledge Injection
Wenjun Hou
|
Yi Cheng
|
Kaishuai Xu
|
Heng Li
|
Yan Hu
|
Wenjie Li
|
Jiang Liu
Large language models (LLMs) have demonstrated remarkable capabilities in various domains, including radiology report generation. Previous approaches have attempted to utilize multimodal LLMs for this task, enhancing their performance through the integration of domain-specific knowledge retrieval. However, these approaches often overlook the knowledge already embedded within the LLMs, leading to redundant information integration. To address this limitation, we propose Radar, a framework for enhancing radiology report generation with supplementary knowledge injection. Radar improves report generation by systematically leveraging both the internal knowledge of an LLM and externally retrieved information. Specifically, it first extracts the model’s acquired knowledge that aligns with expert image-based classification outputs. It then retrieves relevant supplementary knowledge to further enrich this information. Finally, by aggregating both sources, Radar generates more accurate and informative radiology reports. Extensive experiments on MIMIC-CXR, CheXpert-Plus, and IU X-ray demonstrate that our model outperforms state-of-the-art LLMs in both language quality and clinical accuracy
pdf
bib
abs
Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates
Jaewoo Ahn
|
Heeseung Yun
|
Dayoon Ko
|
Gunhee Kim
While pre-trained multimodal representations (e.g., CLIP) have shown impressive capabilities, they exhibit significant compositional vulnerabilities leading to counterintuitive judgments. We introduce Multimodal Adversarial Compositionality (MAC), a benchmark that leverages large language models (LLMs) to generate deceptive text samples to exploit these vulnerabilities across different modalities and evaluates them through both sample-wise attack success rate and group-wise entropy-based diversity. To improve zero-shot methods, we propose a self-training approach that leverages rejection-sampling fine-tuning with diversity-promoting filtering, which enhances both attack success rate and sample diversity. Using smaller language models like Llama-3.1-8B, our approach demonstrates superior performance in revealing compositional vulnerabilities across various multimodal representations, including images, videos, and audios.
pdf
bib
abs
Attention Speaks Volumes: Localizing and Mitigating Bias in Language Models
Rishabh Adiga
|
Besmira Nushi
|
Varun Chandrasekaran
We believe that analyzing attention is crucial for understanding bias in large language models (LLMs); in ambiguous comparative prompting frameworks, it provides insight into how the LLM distributes its focus across different entities, and how this contributes to biased decisions. To this end, we first introduce a metric to quantify the “entity preference” of an LLM. We then propose ATLAS, a technique to localize bias to specific layers of the LLM by analyzing attention scores and then reduce bias by scaling attention in these biased layers. To evaluate our method, we conduct extensive experiments across 3 datasets, 4 models, and 4 baseline approaches. Our experiments demonstrate that bias is concentrated in the later layers, typically around the last third. We also show how ATLAS effectively mitigates bias through targeted interventions without compromising downstream performance and an average increase of only 0.34% in perplexity when the intervention is applied. We see an average improvement of 0.28 points in the bias score across all the datasets.
pdf
bib
abs
MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teaming
Weiyang Guo
|
Jing Li
|
Wenya Wang
|
Yu Li
|
Daojing He
|
Jun Yu
|
Min Zhang
The proliferation of jailbreak attacks against large language models (LLMs) highlights the need for robust security measures. However, in multi-round dialogues, malicious intentions may be hidden in interactions, leading LLMs to be more prone to produce harmful responses. In this paper, we propose the Multi-Turn Safety Alignment (MTSA) framework, to address the challenge of securing LLMs in multi-round interactions. It consists of two stages: In the thought-guided attack learning stage, the red-team model learns about thought-guided multi-round jailbreak attacks to generate adversarial prompts. In the adversarial iterative optimization stage, the red-team model and the target model continuously improve their respective capabilities in interaction. Furthermore, we introduce a multi-turn reinforcement learning algorithm based on future rewards to enhance the robustness of safety alignment. Experimental results show that the red-team model exhibits state-of-the-art attack capabilities, while the target model significantly improves its performance on safety benchmarks.
pdf
bib
abs
The Efficiency vs. Accuracy Trade-off: Optimizing RAG-Enhanced LLM Recommender Systems Using Multi-Head Early Exit
Huixue Zhou
|
Hengrui Gu
|
Zaifu Zhan
|
Xi Liu
|
Kaixiong Zhou
|
Yongkang Xiao
|
Mingfu Liang
|
Srinivas Prasad Govindan
|
Piyush Chawla
|
Jiyan Yang
|
Xiangfei Meng
|
Huayu Li
|
Buyun Zhang
|
Liang Luo
|
Wen-Yen Chen
|
Yiping Han
|
Bo Long
|
Rui Zhang
|
Tianlong Chen
The deployment of Large Language Models (LLMs) in recommender systems for Click-Through Rate (CTR) prediction requires a careful balance between computational efficiency and predictive accuracy. This paper introduces OptiRAG-Rec, a comprehensive framework that integrates Retrieval-Augmented Generation (RAG) with a novel multi-head early exit architecture to address both challenges. By leveraging Graph Convolutional Networks (GCNs) as efficient retrieval mechanisms, the framework significantly reduces data retrieval times while maintaining high model performance. Additionally, the multi-head early exit strategy dynamically terminates inference based on real-time predictive confidence assessments, enhancing responsiveness without sacrificing accuracy. Experimental results demonstrate that OptiRAG-Rec reduces computation time while preserving the precision required for reliable recommendations, establishing a new benchmark for efficient and accurate LLM deployment in recommendation.
pdf
bib
abs
Unraveling LoRA Interference: Orthogonal Subspaces for Robust Model Merging
Haobo Zhang
|
Jiayu Zhou
Fine-tuning large language models (LMs) for individual tasks yields strong performance but is expensive for deployment and storage. Recent works explore model merging to combine multiple task-specific models into a single multi-task model without additional training. However, existing merging methods often fail for models fine-tuned with low-rank adaptation (LoRA), due to significant performance degradation. In this paper, we show that this issue arises from a previously overlooked interplay between model parameters and data distributions. We propose **O**rthogonal **S**ubspaces for **R**obust model **M**erging (**OSRM**) to constrain the LoRA subspace *prior* to fine-tuning, ensuring that updates relevant to one task do not adversely shift outputs for others. Our approach can seamlessly integrate with most existing merging algorithms, reducing the unintended interference among tasks. Extensive experiments on eight datasets, tested with three widely used LMs and two large LMs, demonstrate that our method not only boosts merging performance but also preserves single-task accuracy. Furthermore, our approach exhibits greater robustness to the hyperparameters of merging. These results highlight the importance of data-parameter interaction in model merging and offer a plug-and-play solution for merging LoRA models.
pdf
bib
abs
BIG-Bench Extra Hard
Mehran Kazemi
|
Bahare Fatemi
|
Hritik Bansal
|
John Palowitch
|
Chrysovalantis Anastasiou
|
Sanket Vaibhav Mehta
|
Lalit K Jain
|
Virginia Aglietti
|
Disha Jindal
|
Peter Chen
|
Nishanth Dikkala
|
Gladys Tyen
|
Xin Liu
|
Uri Shalit
|
Silvia Chiappa
|
Kate Olszewska
|
Yi Tay
|
Vinh Q. Tran
|
Quoc V Le
|
Orhan Firat
Current benchmarks for large language model (LLM) reasoning predominantly focus on mathematical and coding abilities, leaving a gap in evaluating broader reasoning proficiencies. One particular exception is the BIG-Bench dataset, which has served as a crucial benchmark for evaluating the general reasoning capabilities of LLMs, thanks to its diverse set of challenging tasks that allowed for a comprehensive assessment of general reasoning across various skills within a unified framework. However, recent advances in LLMs have led to saturation on BIG-Bench, and its harder version BIG-Bench Hard (BBH). State-of-the-art models achieve near-perfect scores on many tasks in BBH, thus diminishing its utility. To address this limitation, we introduce BIG-Bench Extra Hard (BBEH), a new benchmark designed to push the boundaries of LLM reasoning evaluation. BBEH replaces each task in BBH with a novel task that probes a similar reasoning capability but exhibits significantly increased difficulty. We evaluate various general-purpose and reasoning-specialized models on BBEH and observe an accuracy of 23.9% for the best general-purpose model and 54.2% for the best reasoning-specialized model, indicating substantial room for improvement and highlighting the ongoing challenge of achieving robust general reasoning in LLMs. We release BBEH publicly at: https://github.com/google-deepmind/bbeh.
pdf
bib
abs
CSTree-SRI: Introspection-Driven Cognitive Semantic Tree for Multi-Turn Question Answering over Extra-Long Contexts
Zhaowen Wang
|
Xiang Wei
|
Kangshao Du
|
Yiting Zhang
|
Libo Qin
|
Yingjie Xia
|
Li Kuang
Large Language Models (LLMs) have achieved remarkable success in natural language processing (NLP), particularly in single-turn question answering (QA) on short-text. However, their performance significantly declines when applied to multi-turn QA over extra-long context (ELC), as they struggle to capture the logical correlations across multiple chunks of ELC and maintain the coherence of multi-turn Questions. To address the challenges, we propose the CSTree-SRI framework (Cognitive Semantic Tree through Summarization, Retrieval, and Introspection). CSTree-SRI dynamically constructs the CSTree to preserve logical coherence within ELC through hierarchical synthesis and introspective validation. Then a logic-driven traversal strategy on CSTree is designed to provide efficient information retrieval for question answering. Additionally, we construct a suite of multi-turn QA datasets and an evaluation benchmark tailored for ELC tasks, and comprehensive experiments demonstrate the framework’s superiority in addressing the challenges of multi-turn QA over ELC.
pdf
bib
abs
InductionBench: LLMs Fail in the Simplest Complexity Class
Wenyue Hua
|
Tyler Wong
|
Fei Sun
|
Liangming Pan
|
Adam Jardine
|
William Yang Wang
Large language models (LLMs) have shown remarkable improvements in reasoning and many existing benchmarks have been addressed by models such as o1 and o3 either fully or partially. However, a majority of these benchmarks emphasize deductive reasoning, including mathematical and coding tasks in which rules such as mathematical axioms or programming syntax are clearly defined, based on which LLMs can plan and apply these rules to arrive at a solution. In contrast,
inductive reasoning, where one infers the underlying rules from observed data, remains less explored. Such inductive processes lie at the heart of scientific discovery, as they enable researchers to extract general principles from empirical observations. To assess whether LLMs possess this capacity, we introduce
InductionBench, a new benchmark designed to evaluate the inductive reasoning ability of LLMs. Our experimental findings reveal that even the most advanced modelw available struggle to master the simplest complexity classes within the subregular hierarchy of functions, highlighting a notable deficiency in current LLMs’ inductive reasoning capabilities. Coda and data are available
https://anonymous.4open.science/r/inductive_reasoning_benchmark-BB2D.
pdf
bib
abs
RATIONALYST: Pre-training Process-Supervision for Improving Reasoning
Dongwei Jiang
|
Guoxuan Wang
|
Yining Lu
|
Andrew Wang
|
Jingyu Zhang
|
Chuyu Liu
|
Benjamin Van Durme
|
Daniel Khashabi
The reasoning steps generated by LLMs might be incomplete, as they mimic logical leaps common in everyday communication found in their pre-training data: underlying rationales are frequently left implicit (unstated). To address this challenge, we introduce RATIONALYST, a model for process-supervision of reasoning based on pre-training on a vast collection of rationale annotations extracted from unlabeled data. We extract 79k rationales from web-scale unlabelled dataset (the Pile) and a combination of reasoning datasets with minimal human intervention. This web-scale pre-training for reasoning allows RATIONALYST to consistently generalize across diverse reasoning tasks, including mathematical, commonsense, scientific, and logical reasoning. Fine-tuned from LLaMa-3-8B, RATIONALYST improves the accuracy of reasoning by an average of 3.9% on 7 representative reasoning benchmarks. It also demonstrates superior performance compared to significantly larger verifiers like GPT-4 and similarly sized models fine-tuned on matching training sets.
pdf
bib
abs
Make Imagination Clearer! Stable Diffusion-based Visual Imagination for Multimodal Machine Translation
Andong Chen
|
Yuchen Song
|
Kehai Chen
|
Xuefeng Bai
|
Muyun Yang
|
Liqiang Nie
|
Jie Liu
|
Tiejun Zhao
|
Min Zhang
Visual information has been introduced for enhancing machine translation (MT), and its effectiveness heavily relies on the availability of large amounts of bilingual parallel sentence pairs with manual image annotations. In this paper, we introduce a stable diffusion-based imagination network into a multimodal large language model (MLLM) to explicitly generate an image for each source sentence, thereby advancing the multimodel MT. Particularly, we build heuristic feedback with reinforcement learning to ensure the consistency of the generated image with the source sentence without the supervision of visual information, which breaks the high-cost bottleneck of image annotation in MT. Furthermore, the proposed method enables imaginative visual information to be integrated into text-only MT in addition to multimodal MT. Experimental results show that our model significantly outperforms existing multimodal MT and text-only MT, especially achieving an average improvement of more than 14 BLEU points on Multi30K and MSCOCO multimodal MT benchmarks.
pdf
bib
abs
Advancing SMoE for Continuous Domain Adaptation of MLLMs: Adaptive Router and Domain-Specific Loss
Liang Zhang
|
Ziyao Lu
|
Fandong Meng
|
Hui Li
|
Jie Zhou
|
Jinsong Su
Recent studies have explored Continual Instruction Tuning (CIT) in Multimodal Large Language Models (MLLMs), with a primary focus on Task-incremental CIT, where MLLMs are required to continuously acquire new tasks. However, the more practical and challenging Domain-incremental CIT, focused on the continual adaptation of MLLMs to new domains, remains underexplored. In this paper, we propose a new Sparse Mixture of Expert (SMoE) based method for domain-incremental CIT in MLLMs. During training, we learn a domain-specific SMoE module for each new domain in every FFN sub-layer of MLLMs, preventing catastrophic forgetting caused by inter-domain conflicts. Moreover, we equip the SMoE module with a domain-specific autoregressive loss (DSAL), which is used to identify the most suitable SMoE module for processing each test instruction during inference. To further enhance the SMoE module’s ability to learn domain knowledge, we design an adaptive threshold-based router (AT-Router) that allocates computing resources (experts) to instruction tokens based on their importance. Finally, we establish a new benchmark to evaluate the efficacy of our method and advance future research. Extensive experiments show that our method consistently outperforms all competitive baselines.
pdf
bib
abs
Multi-document Summarization through Multi-document Event Relation Graph Reasoning in LLMs: a case study in Framing Bias Mitigation
Yuanyuan Lei
|
Ruihong Huang
Media outlets are becoming more partisan and polarized nowadays. Most previous work focused on detecting media bias. In this paper, we aim to mitigate media bias by generating a neutralized summary given multiple articles presenting different ideological views. Motivated by the critical role of events and event relations in media bias detection, we propose to increase awareness of bias in LLMs via multi-document events reasoning and use a multi-document event relation graph to guide the summarization process. This graph contains rich event information useful to reveal bias: four common types of in-doc event relations to reflect content framing bias, cross-doc event coreference relation to reveal content selection bias, and event-level moral opinions to highlight opinionated framing bias. We further develop two strategies to incorporate the multi-document event relation graph for neutralized summarization. Firstly, we convert a graph into natural language descriptions and feed the textualized graph into LLMs as a part of a hard text prompt. Secondly, we encode the graph with graph attention network and insert the graph embedding into LLMs as a soft prompt. Both automatic evaluation and human evaluation confirm that our approach effectively mitigates both lexical and informational media bias, and meanwhile improves content preservation.
pdf
bib
abs
Who Writes What: Unveiling the Impact of Author Roles on AI-generated Text Detection
Jiatao Li
|
Xiaojun Wan
The rise of Large Language Models (LLMs) necessitates accurate AI-generated text detection. However, current approaches largely overlook the influence of author characteristics. We investigate how sociolinguistic attributes—gender, CEFR proficiency, academic field, and language environment—impact state-of-the-art AI text detectors. Using the ICNALE corpus of human-authored texts and parallel AI-generated texts from diverse LLMs, we conduct a rigorous evaluation employing multi-factor ANOVA and weighted least squares (WLS). Our results reveal significant biases: CEFR proficiency and language environment consistently affected detector accuracy, while gender and academic field showed detector-dependent effects. These findings highlight the crucial need for socially aware AI text detection to avoid unfairly penalizing specific demographic groups. We offer novel empirical evidence, a robust statistical framework, and actionable insights for developing more equitable and reliable detection systems in real-world, out-of-domain contexts. This work paves the way for future research on bias mitigation, inclusive evaluation benchmarks, and socially responsible LLM detectors.
pdf
bib
abs
RoCoFT: Efficient Finetuning of Large Language Models with Row-Column Updates
Md Kowsher
|
Tara Esmaeilbeig
|
Chun-Nam Yu
|
Chen Chen
|
Mojtaba Soltanalian
|
Niloofar Yousefi
We propose Row-Column Fine-Tuning(RoCoFT), a parameter-efficient fine-tuning method for large language models based on updating only a few rows and columns of the weight matrices in transformers. Through extensive experiments with medium-sized LMs like RoBERTa and DeBERTa, and larger LMs like Bloom-7B, Llama2-7B, and Llama2-13B, we show that our method gives comparable or better accuracies than state-of-the-art Parameter-Efficient Finetuning methods while also being more memory and computation-efficient. We also study the reason behind the effectiveness of our method with tools from neural tangent kernel theory. We empirically demonstrate that our kernel, constructed using a restricted set of row and column parameters, is numerically close to the full-parameter kernel and gives comparable classification performance. Ablation studies are conducted to investigate the impact of different algorithmic choices, including the robustness of RoCoFT to any selection of rows and columns, as well as the optimal rank for the effective implementation of our method.
pdf
bib
abs
Scaling Laws and Efficient Inference for Ternary Language Models
Tejas Vaidhya
|
Ayush Kaushal
|
Vineet Jain
|
Francis Couture-Harpin
|
Prashant Shishodia
|
Majid Behbahani
|
Yuriy Nevmyvaka
|
Irina Rish
Large language models (LLMs) are increasingly used across research and industry applications, yet their inference efficiency remains a significant challenge. As the computational power of modern GPU architectures continuously improves, their memory bandwidth and capacity have not scaled proportionally, creating a critical bottleneck during inference. To address this, we investigate ternary language models (TriLMs) that employ quantization-aware training to significantly reduce memory requirements. We first analyze the scalability of TriLMs by conducting a scaling law analysis, revealing that TriLMs benefit more from increasing training data than from scaling model parameters. Based on this observation, we introduce TriTera, an open suite of TriLMs trained on up to 1.2 trillion tokens, demonstrating sustained performance gains at scale. Furthermore, to improve inference efficiency, we propose novel 2-bit and 1.6-bit packing schemes for ternary weights, which demonstrate accelerated inference across various CPU architectures. Building on the 2-bit packing, we develop a GPU kernel called TriRun that accelerates end-to-end model inference by up to 5 × compared to floating-point baselines. To encourage further exploration and development of TriLMs, we will release the TriTera suite and TriRun inference kernels. Overall, our work lays the foundation for building and deploying efficient LLMs, providing a valuable resource for the research community.
pdf
bib
abs
Exploring the Impact of Instruction-Tuning on LLM’s Susceptibility to Misinformation
Kyubeen Han
|
Junseo Jang
|
Hongjin Kim
|
Geunyeong Jeong
|
Harksoo Kim
Instruction-tuning enhances the ability of large language models (LLMs) to follow user instructions more accurately, improving usability while reducing harmful outputs. However, this process may increase the model’s dependence on user input, potentially leading to the unfiltered acceptance of misinformation and the generation of hallucinations. Existing studies primarily highlight that LLMs are receptive to external information that contradict their parametric knowledge, but little research has been conducted on the direct impact of instruction-tuning on this phenomenon. In our study, we investigate the impact of instruction-tuning on LLM susceptibility to misinformation. Our analysis reveals that instruction-tuned LLMs are significantly more likely to accept misinformation when it is presented by the user. A comparison with base models shows that instruction-tuning increases reliance on user-provided information, shifting susceptibility from the assistant role to the user role. Furthermore, we explore additional factors influencing misinformation susceptibility, such as the role of the user in prompt structure, misinformation length, and the presence of warnings in the system prompt. Our findings underscore the need for systematic approaches to mitigate unintended consequences of instruction-tuning and enhance the reliability of LLMs in real-world applications.
pdf
bib
abs
Do Language Models Understand Honorific Systems in Javanese?
Mohammad Rifqi Farhansyah
|
Iwan Darmawan
|
Adryan Kusumawardhana
|
Genta Indra Winata
|
Alham Fikri Aji
|
Derry Tanti Wijaya
The Javanese language features a complex system of honorifics that vary according to the social status of the speaker, listener, and referent. Despite its cultural and linguistic significance, there has been limited progress in developing a comprehensive corpus to capture these variations for natural language processing (NLP) tasks. In this paper, we present Unggah-Ungguh, a carefully curated dataset designed to encapsulate the nuances of Unggah-Ungguh Basa, the Javanese speech etiquette framework that dictates the choice of words and phrases based on social hierarchy and context. Using Unggah-Ungguh, we assess the ability of language models (LMs) to process various levels of Javanese honorifics through classification and machine translation tasks. To further evaluate cross-lingual LMs, we conduct machine translation experiments between Javanese (at specific honorific levels) and Indonesian. Additionally, we explore whether LMs can generate contextually appropriate Javanese honorifics in conversation tasks, where the honorific usage should align with the social role and contextual cues. Our findings indicate that current LMs struggle with most honorific levels, exhibiting a bias toward certain honorific tiers.
pdf
bib
abs
Generative Reward Modeling via Synthetic Criteria Preference Learning
Xiaobo Liang
|
Haoke Zhang
|
Juntao Li
|
Kehai Chen
|
Qiaoming Zhu
|
Min Zhang
Generative Reward Models (GenRMs) leverage synthesized Chains of Thought (CoT) to reduce the need for massive labeled data, but this approach introduces risks of overoptimization due to the inability to guarantee the correctness of the CoTs. Identifying and optimizing unexpected behaviors within these synthesized CoT remains a challenge, as it heavily depends on precise annotations of intermediate behavior, similar to process supervision. In this work, we introduce a criteria-based preference tree for reward modeling, where each path in the tree represents a reasoning trajectory based on synthesized criteria. Crucially, each reasoning trajectory can be independently optimized through RL algorithm. These fine-grained process reward signals are derived from the inference-time computations and predefined rules, eliminating the need for human supervision. In experiments, SyncPL showed significant improvements over baselines on multiple human preference benchmarks. We further demonstrate that synthesized data can be learned using a long CoT format, analogous to an o1-like model, further enhancing performance while keeping stability and efficiency during training.
pdf
bib
abs
Exploring Multimodal Relation Extraction of Hierarchical Tabular Data with Multi-task Learning
Xinyu Zhang
|
Aibo Song
|
Jingyi Qiu
|
Jiahui Jin
|
Tianbo Zhang
|
Xiaolin Fang
Relation Extraction (RE) is a key task in table understanding, aiming to extract semantic relations between columns. However, complex tables with hierarchical headers are hard to obtain high-quality textual formats (e.g., Markdown) for input under practical scenarios like webpage screenshots and scanned documents, while table images are more accessible and intuitive. Besides, existing works overlook the need of mining relations among multiple columns rather than just the semantic relation between two specific columns in real-world practice. In this work, we explore utilizing Multimodal Large Language Models (MLLMs) to address RE in tables with complex structures. We creatively extend the concept of RE to include calculational relations, enabling multi-task learning of both semantic and calculational RE for mutual reinforcement. Specifically, we reconstruct table images into graph structure based on neighboring nodes to extract graph-level visual features. Such feature enhancement alleviates the insensitivity of MLLMs to the positional information within table images. We then propose a Chain-of-Thought distillation framework with self-correction mechanism to enhance MLLMs’ reasoning capabilities without increasing parameter scale. Our method significantly outperforms most baselines on wide datasets. Additionally, we release a benchmark dataset for calculational RE in complex tables.
pdf
bib
abs
A Self-Denoising Model for Robust Few-Shot Relation Extraction
Liang Zhang
|
Yang Zhang
|
Ziyao Lu
|
Fandong Meng
|
Jie Zhou
|
Jinsong Su
The few-shot relation extraction (FSRE) aims at enhancing the model’s generalization to new relations with very few labeled instances (support instances). Most existing studies use prototype networks (ProtoNets) for FSRE and assume that the support set, adapting the model to new relations, only contains accurately labeled instances. However, this assumption is usually unrealistic, as even carefully-annotated datasets often contain mislabeled instances. Thus, it is essential to enhance the robustness of FSRE models to noisy labels in support set, but this issue remains unexplored. In this paper, we first conduct a preliminary study, revealing the high sensitivity of ProtoNets to such noisy labels. Meanwhile, we discover that fully leveraging mislabeled support instances is crucial for enhancing the model’s robustness. To do this, we propose a self-denoising model for FSRE, which can automatically correct noisy labels of support instances. Specifically, our model comprises two core components: 1) a label correction module (LCM), used to correct mislabeled support instances based on the distances between them in the embedding space, and 2) a relation classification module (RCM), designed to achieve more robust relation prediction using the corrected labels generated by the LCM. Moreover, we propose a feedback-based training strategy, which focuses on training LCM and RCM to synergistically handle noisy labels in support set. Experimental results on two public datasets show the effectiveness and robustness of our model. Notably, even in scenarios without noisy labels, our model significantly outperforms all competitive baselines.
pdf
bib
abs
QuASAR: A Question-Driven Structure-Aware Approach for Table-to-Text Generation
WeiJie Liu
|
Yibin Zheng
|
Fang Kong
Table-to-text generation aims to automatically produce natural language descriptions from structured or semi-structured tabular data. Unlike traditional text generation tasks, it requires models to accurately understand and represent table structures. Existing approaches typically process tables by linearizing them or converting them into graph structures. However, these methods either fail to adequately capture the table structure or rely on complex attention mechanisms, limiting their applicability. To tackle these challenges, we propose QuASAR, a question-driven self-supervised approach designed to enhance the model’s structural perception and representation capabilities. Specifically, QuASAR formulates a set of structure-related queries for self-supervised training, explicitly guiding the model to capture both local and global table structures. Additionally, we introduce two auxiliary pre-training tasks: a word-to-sentence reconstruction task and a numerical summarization task, which further enhance the fluency and factuality of the generated text. Experimental results on the ToTTo and HiTab datasets demonstrate that our approach produces higher-quality text compared to existing methods.
pdf
bib
abs
Automated Structured Radiology Report Generation
Jean-Benoit Delbrouck
|
Justin Xu
|
Johannes Moll
|
Alois Thomas
|
Zhihong Chen
|
Sophie Ostmeier
|
Asfandyar Azhar
|
Kelvin Zhenghao Li
|
Andrew Johnston
|
Christian Bluethgen
|
Eduardo Pontes Reis
|
Mohamed S Muneer
|
Maya Varma
|
Curtis Langlotz
Automated radiology report generation from chest X-ray (CXR) images has the potential to improve clinical efficiency and reduce radiologists’ workload. However, most datasets, including the publicly available MIMIC-CXR and CheXpert Plus, consist entirely of free-form reports, which are inherently variable and unstructured. This variability poses challenges for both generation and evaluation: existing models struggle to produce consistent, clinically meaningful reports, and standard evaluation metrics fail to capture the nuances of radiological interpretation. To address this, we introduce Structured Radiology Report Generation (SRRG), a new task that reformulates free-text radiology reports into a standardized format, ensuring clarity, consistency, and structured clinical reporting. We create a novel dataset by restructuring reports using large language models (LLMs) following strict structured reporting desiderata. Additionally, we introduce SRR-BERT, a fine-grained disease classification model trained on 55 labels, enabling more precise and clinically informed evaluation of structured reports. To assess report quality, we propose F1-SRR-BERT, a metric that leverages SRR-BERT’s hierarchical disease taxonomy to bridge the gap between free-text variability and structured clinical reporting. We validate our dataset through a reader study conducted by five board-certified radiologists and extensive benchmarking experiments.
pdf
bib
abs
LPOI: Listwise Preference Optimization for Vision Language Models
Fatemeh Pesaran Zadeh
|
Yoojin Oh
|
Gunhee Kim
Aligning large VLMs with human preferences is a challenging task, as methods like RLHF and DPO often overfit to textual information or exacerbate hallucinations. Although augmenting negative image samples partially addresses these pitfalls, no prior work has employed listwise preference optimization for VLMs, due to the complexity and cost of constructing listwise image samples. In this work, we propose LPOI, the first object-aware listwise preference optimization developed for reducing hallucinations in VLMs. LPOI identifies and masks a critical object in the image, and then interpolates the masked region between the positive and negative images to form a sequence of incrementally more complete images. The model is trained to rank these images in ascending order of object visibility, effectively reducing hallucinations while retaining visual fidelity. LPOI requires no extra annotations beyond standard pairwise preference data, as it automatically constructs the ranked lists through object masking and interpolation. Comprehensive experiments on MMHalBench, AMBER, and Object HalBench confirm that LPOI outperforms existing preference optimization methods in reducing hallucinations and enhancing VLM performance.
pdf
bib
abs
Predicting Through Generation: Why Generation Is Better for Prediction
Md Kowsher
|
Nusrat Jahan Prottasha
|
Prakash Bhat
|
Chun-Nam Yu
|
Mojtaba Soltanalian
|
Ivan Garibay
|
Ozlem Garibay
|
Chen Chen
|
Niloofar Yousefi
This paper argues that generating output tokens is more effective than using pooled representations for prediction tasks because token-level generation retains more mutual information. Since LLMs are trained on massive text corpora using next-token prediction, generation aligns naturally with their learned behavior. Using the Data Processing Inequality (DPI), we provide both theoretical and empirical evidence supporting this claim. However, autoregressive models face two key challenges when used for prediction: (1) exposure bias, where the model sees ground-truth tokens during training but relies on its own predictions during inference, leading to errors, and (2) format mismatch, where discrete tokens do not always align with the task’s required output structure. To address these challenges, we introduce PredGen (Predicting Through Generating), an end-to-end framework that (i) uses scheduled sampling to reduce exposure bias, and (ii) introduces a task adapter to convert the generated tokens into structured outputs. Additionally, we introduce Writer-Director Alignment Loss (WDAL), which ensures consistency between token generation and final task predictions, improving both text coherence and numerical accuracy. We evaluate PredGen on multiple classification and regression benchmarks. Our results show that PredGen consistently outperforms standard baselines, demonstrating its effectiveness in structured prediction tasks.
pdf
bib
abs
“Give Me BF16 or Give Me Death”? Accuracy-Performance Trade-Offs in LLM Quantization
Eldar Kurtic
|
Alexandre Noll Marques
|
Shubhra Pandit
|
Mark Kurtz
|
Dan Alistarh
Despite the popularity of large language model (LLM) quantization for inference acceleration, significant uncertainty remains regarding the accuracy-performance trade-offs associated with various quantization formats. We present a comprehensive empirical study of quantized accuracy, evaluating popular quantization formats (FP8, INT8, INT4) across academic benchmarks and real-world tasks, on the entire Llama-3.1 model family. Additionally, our study examines the difference in text generated by quantized models versus their uncompressed counterparts. Beyond benchmarks, we also present a couple of quantization improvements which allowed us to obtain state-of-the-art accuracy recovery results. Our investigation, encompassing over 500,000 individual evaluations, yields several key findings: (1) FP8 weight and activation quantization (W8A8-FP) is lossless across all model scales, (2) INT8 weight and activation quantization (W8A8-INT), when properly tuned, incurs surprisingly low 1-3% accuracy degradation, and (3) INT4 weight-only quantization (W4A16-INT) is competitive with 8-bit integer weight and activation quantization. To address the question of the “best” format for a given deployment environment, we conduct inference performance analysis using the popular open-source vLLM framework on various GPU architectures. We find that W4A16 offers the best cost-efficiency for synchronous deployments, and for asynchronous deployment on mid-tier GPUs. At the same time, W8A8 formats excel in asynchronous deployment of mid and large-size models on high-end GPUs. Our results provide a first set of practical guidelines for deploying quantized LLMs across different scales and performance requirements.
pdf
bib
abs
StitchLLM: Serving LLMs, One Block at a Time
Bodun Hu
|
Shuozhe Li
|
Saurabh Agarwal
|
Myungjin Lee
|
Akshay Jajoo
|
Jiamin Li
|
Le Xu
|
Geon-Woo Kim
|
Donghyun Kim
|
Hong Xu
|
Amy Zhang
|
Aditya Akella
The rapid evolution of large language models (LLMs) has revolutionized natural language processing (NLP) tasks such as text generation, translation, and comprehension. However, the increasing computational demands and inference costs of these models present significant challenges. This study investigates the dynamic and efficient utilization of pre-trained weights from open-sourced LLMs of varying parameter sizes to achieve an optimal balance between computational efficiency and task performance. Drawing inspiration from the dual-process theory of human cognition, we introduce StitchLLM: a dynamic model routing framework that employs a powerful bottom model to process all queries, and uses a lightweight routing mechanism to allocate computational resources appropriately. Our novel framework optimizes efficiency and maintains performance, leveraging a trainable stitching layer for seamless integration of decoder layers across different LLMs. Experimental results demonstrate that StitchLLM improves system throughput while minimizing performance degradation, offering a flexible solution for deploying LLMs in resource-constrained settings.
pdf
bib
abs
Walk in Others’ Shoes with a Single Glance: Human-Centric Visual Grounding with Top-View Perspective Transformation
Yuqi Bu
|
Xin Wu
|
Zirui Zhao
|
Yi Cai
|
David Hsu
|
Qiong Liu
Visual perspective-taking, an ability to envision others’ perspectives from a single self-perspective, is vital in human-robot interactions. Thus, we introduce a human-centric visual grounding task and a dataset to evaluate this ability. Recent advances in vision-language models (VLMs) have shown potential for inferring others’ perspectives, yet are insensitive to information differences induced by slight perspective changes. To address this problem, we propose a top-view enhanced perspective transformation (TEP) method, which decomposes the transition from robot to human perspectives through an abstract top-view representation. It unifies perspectives and facilitates the capture of information differences from diverse perspectives. Experimental results show that TEP improves performance by up to 18%, exhibits perspective-taking abilities across various perspectives, and generalizes effectively to robotic and dynamic scenarios.
pdf
bib
abs
Is linguistically-motivated data augmentation worth it?
Ray Groshan
|
Michael Ginn
|
Alexis Palmer
Data augmentation, a widely-employed technique for addressing data scarcity, involves generating synthetic data examples which are then used to augment available training data. Researchers have seen surprising success from simple methods, such as random perturbations from natural examples, where models seem to benefit even from data with nonsense words, or data that doesn’t conform to the rules of the language. A second line of research produces synthetic data that does in fact follow all linguistic constraints; these methods require some linguistic expertise and are generally more challenging to implement. No previous work has done a systematic, empirical comparison of both linguistically-naive and linguistically-motivated data augmentation strategies, leaving uncertainty about whether the additional time and effort of linguistically-motivated data augmentation work in fact yields better downstream performance.In this work, we conduct a careful and comprehensive comparison of augmentation strategies (both linguistically-naive and linguistically-motivated) for two low-resource languages with different morphological properties, Uspanteko and Arapaho. We evaluate the effectiveness of many different strategies and their combinations across two important sequence-to-sequence tasks for low-resource languages: machine translation and interlinear glossing. We find that linguistically-motivated strategies can have benefits over naive approaches, but only when the new examples they produce are not significantly unlike the training data distribution.
pdf
bib
abs
From Lists to Emojis: How Format Bias Affects Model Alignment
Xuanchang Zhang
|
Wei Xiong
|
Lichang Chen
|
Tianyi Zhou
|
Heng Huang
|
Tong Zhang
In this paper, we study format biases in reinforcement learning from human feedback (RLHF). We observe that many widely-used preference models—including human evaluators, GPT-4, and top-ranking models on the RewardBench benchmark—exhibit strong biases towards specific format patterns, such as lists, links, bold text, and emojis. Furthermore, large language models (LLMs) can exploit these biases to achieve higher rankings on popular benchmarks like AlpacaEval and LMSYS Chatbot Arena. One notable example is verbosity bias, where current preference models favor longer responses that appear more comprehensive, even when their quality is equal to or lower than shorter responses. However, format biases beyond verbosity remain largely underexplored. In this work, we extend the study of biases in preference learning beyond the commonly recognized length bias, offering a comprehensive analysis of a wider range of format biases. Additionally, we show that with a small amount of biased data (less than 1%), we can inject significant bias into the reward model. Moreover, these format biases can also be easily exploited by downstream alignment algorithms, such as *best-of-n sampling* and online iterative *DPO*, as it is usually easier to manipulate the format than to improve the quality of responses. Our findings emphasize the need to disentangle format and content both for designing alignment algorithms and evaluating models.
pdf
bib
abs
Colloquial Singaporean English Style Transfer with Fine-Grained Explainable Control
Jinggui Liang
|
Dung Vo
|
Yap Hong Xian
|
Hai Leong Chieu
|
Kian Ming A. Chai
|
Jing Jiang
|
Lizi Liao
Colloquial Singaporean English (Singlish) is an informal English marked by a unique blend of languages reflecting Singapore’s multicultural identity. Style transfer between Singlish and Standard (formal) English is vital for various applications, yet existing methods often lack explainability and fine-grained control. To fill this gap, we contribute in two key ways. First, we construct a large, high-quality dataset of formal and informal sentences, annotated across six linguistic aspects—Syntax, Lexical Borrowing, Pragmatics, Prosody/Phonology, Emoticons/Punctuation, and Code-Switching—with detailed explanations. Starting with manually annotated cases, we scaled the dataset to 140K with ensured quality. Second, inspired by the “Society of Mind” theory, we propose a novel multi-agent framework where large language models (LLMs) act as expert agents for each linguistic aspect. These agents collaborate by iteratively generating, critiquing, and refining responses to achieve controlled, explainable style transfer. Both automatic metrics and human evaluations confirm that our method enables precise, interpretable transformations, advancing explainability in NLP for Singlish.
pdf
bib
abs
From Informal to Formal – Incorporating and Evaluating LLMs on Natural Language Requirements to Verifiable Formal Proofs
Jialun Cao
|
Yaojie Lu
|
Meiziniu Li
|
Haoyang Ma
|
Haokun Li
|
Mengda He
|
Cheng Wen
|
Le Sun
|
Hongyu Zhang
|
Shengchao Qin
|
Shing-Chi Cheung
|
Cong Tian
The research in AI-based formal mathematical reasoning has shown an unstoppable growth trend. These studies have excelled in mathematical competitions like IMO and have made significant progress. However, these studies intertwined multiple skills simultaneously—problem-solving, reasoning, and writing formal specifications—making it hard to precisely identify the LLMs’ strengths and weaknesses in each task. This paper focuses on formal verification, an immediate application scenario of formal reasoning, and breaks it down into sub-tasks. We constructed 18k high-quality instruction-response pairs across five mainstream formal specification languages (Coq, Lean4, Dafny, ACSL, and TLA+) in six tasks by distilling gpt-4o and evaluated against ten open-sourced LLMs, including recent popular DeepSeek-R1. We found that LLMs are good at writing proof segments when given either the code, or the detailed description of proof steps. Also, the fine-tuning brought about a nearly threefold improvement at most. And interestingly, we observed that fine-tuning with formal data also enhances abilities in mathematics, reasoning, and coding. We hope our findings inspire further research.
pdf
bib
abs
CoAM: Corpus of All-Type Multiword Expressions
Yusuke Ide
|
Joshua Tanner
|
Adam Nohejl
|
Jacob Hoffman
|
Justin Vasselli
|
Hidetaka Kamigaito
|
Taro Watanabe
Multiword expressions (MWEs) refer to idiomatic sequences of multiple words.MWE identification, i.e., detecting MWEs in text, can play a key role in downstream tasks such as machine translation, but existing datasets for the task are inconsistently annotated, limited to a single type of MWE, or limited in size.To enable reliable and comprehensive evaluation, we created CoAM: Corpus of All-Type Multiword Expressions, a dataset of 1.3K sentences constructed through a multi-step process to enhance data quality consisting of human annotation, human review, and automated consistency checking.Additionally, for the first time in a dataset of MWE identification, CoAM’s MWEs are tagged with MWE types, such as Noun and Verb, enabling fine-grained error analysis.Annotations for CoAM were collected using a new interface created with our interface generator, which allows easy and flexible annotation of MWEs in any form.Through experiments using CoAM, we find that a fine-tuned large language model outperforms MWEasWSD, which achieved the state-of-the-art performance on the DiMSUM dataset.Furthermore, analysis using our MWE type tagged data reveals that Verb MWEs are easier than Noun MWEs to identify across approaches.
pdf
bib
abs
SeaKR: Self-aware Knowledge Retrieval for Adaptive Retrieval Augmented Generation
Zijun Yao
|
Weijian Qi
|
Liangming Pan
|
Shulin Cao
|
Linmei Hu
|
Liu Weichuan
|
Lei Hou
|
Juanzi Li
Adaptive Retrieval-Augmented Generation (RAG) is an effective strategy to alleviate hallucination of large language models (LLMs). It dynamically determines whether LLMs need external knowledge for generation and invokes retrieval accordingly. This paper introduces Self-aware Knowledge Retrieval (SeaKR), a novel adaptive RAG model that extracts self-aware uncertainty of LLMs from their internal states. SeaKR activates retrieval when the LLMs present high self-aware uncertainty for generation. To effectively integrate retrieved knowledge snippets, SeaKR re-ranks them based on LLM’s self-aware uncertainty to preserve the snippet that reduces their uncertainty to the utmost. To facilitate solving complex tasks that require multiple retrievals, SeaKR utilizes their self-aware uncertainty to choose among different reasoning strategies. Our experiments on both complex and simple Question Answering datasets show that SeaKR outperforms existing adaptive RAG methods.
pdf
bib
abs
Exposing the Achilles’ Heel: Evaluating LLMs Ability to Handle Mistakes in Mathematical Reasoning
Joykirat Singh
|
Akshay Nambi
|
Vibhav Vineet
Large Language Models (LLMs) have significantly impacted the field of Math Word Problems (MWPs), transforming how these problems are approached and solved, particularly in educational contexts. However, existing evaluations often focus on final accuracy, neglecting the critical aspect of reasoning capabilities. This work addresses that gap by evaluating LLMs’ abilities to detect and correct reasoning mistakes. We present a novel dataset, MWP-MISTAKE, containing MWPs with both correct and incorrect reasoning steps generated through rule-based methods and smaller language models. Our comprehensive benchmarking of state-of-the-art models such as GPT-4o and GPT4 uncovers important insights into their strengths and limitations. While GPT-4o excels in mistake detection and rectification, gaps remain, particularly in handling complex datasets and novel problems. Additionally, we identify concerns with data contamination and memorization, which affect LLM reliability in real-world applications. While OpenAI’ O1 model demonstrates 90% accuracy in reasoning and final answers on complex tasks, it remains weak in mistake detection. Our findings highlight the need for improved reasoning evaluations and suggest ways to enhance LLM generalization and robustness in math problem-solving.
pdf
bib
abs
Understanding the Dark Side of LLMs’ Intrinsic Self-Correction
Qingjie Zhang
|
Di Wang
|
Haoting Qian
|
Yiming Li
|
Tianwei Zhang
|
Minlie Huang
|
Ke Xu
|
Hewu Li
|
Liu Yan
|
Han Qiu
Intrinsic self-correction was initially proposed to improve LLMs’ responses via feedback solely based on their inherent capability. However, recent works show that LLMs’ intrinsic self-correction fails without oracle labels as feedback. In this paper, our research goal is to *interpret LLMs’ intrinsic self-correction for different tasks, especially for those failure cases.* By including one simple task and three complex tasks with state-of-the-art (SOTA) LLMs like ChatGPT, Llama, and DeepSeek, we design three interpretation methods to reveal the dark side of LLMs’ intrinsic self-correction. We identify intrinsic self-correction can (1) cause LLMs to waver both intermedia and final answers and lead to prompt bias on simple factual questions; (2) introduce human-like cognitive bias on complex tasks. In light of our findings, we also provide two simple yet effective strategies for alleviation: question repeating and supervised fine-tuning with a few samples. We open-source our work at https://x-isc.info/.
pdf
bib
abs
VideoVista-CulturalLingo: 360° Horizons-Bridging Cultures, Languages, and Domains in Video Comprehension
Xinyu Chen
|
Yunxin Li
|
Haoyuan Shi
|
Baotian Hu
|
Wenhan Luo
|
Yaowei Wang
|
Min Zhang
Assessing the video comprehension capabilities of multimodal AI systems can effectively measure their understanding and reasoning abilities. Most video evaluation benchmarks are limited to a single language, typically English, and predominantly feature videos rooted in Western cultural contexts. In this paper, we present **VideoVista-CulturalLingo**, the first video evaluation benchmark designed to bridge cultural, linguistic, and domain divide in video comprehension. Our work differs from existing benchmarks in the following ways: 1) **Cultural diversity**, incorporating cultures from China, North America, and Europe; 2) **Multi-linguistics**, with questions presented in Chinese and English—two of the most widely spoken languages; and 3) **Broad domain**, featuring videos sourced from hundreds of human-created domains. VideoVista-CulturalLingo contains 1,389 videos and 3,134 QA pairs, and we have evaluated 24 recent open-source or proprietary video large models. From the experiment results, we observe that: 1) Existing models perform worse on Chinese-centric questions than Western-centric ones, particularly those related to Chinese history; 2) Current open-source models still exhibit limitations in temporal understanding, especially in the Event Localization task, achieving a maximum score of only 45.2%; 3) Mainstream models demonstrate strong performance in general scientific questions, while open-source models demonstrate weak performance in mathematics.
pdf
bib
abs
What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices
Zhi Chen
|
Qiguang Chen
|
Libo Qin
|
Qipeng Guo
|
Haijun Lv
|
Yicheng Zou
|
Hang Yan
|
Kai Chen
|
Dahua Lin
Recent advancements in large language models (LLMs) with extended context windows have significantly improved various tasks. To improve long-context capabilities, much work focuses on augmenting LLM’s capabilities with synthetic data. Existing methods often leverage the Self-Instruct framework to generate long-context instruction-tuning data. However, our preliminary experiments show that fewer than 35% of samples generated by Qwen-2-72B are multi-hop, and over 40% exhibit poor quality, limiting comprehensive understanding and further research. To address this, we propose the Multi-agent Interactive Multi-hop Generation (MIMG) framework, which integrates a quality verification agent, a single-hop question generation agent, a multiple question sampling strategy, and a multi-hop question merger agent. This framework significantly improves data quality, with high-quality, multi-hop, and diverse data. Furthermore, we conduct a thorough analysis of document selection, question merging, and validation techniques through extensive experiments across various models. Our results demonstrate that synthetic high-quality long-context instruction data can enhance model performance, surpassing even models trained on larger amounts of human-annotated data.
pdf
bib
abs
Knowledge Graph Retrieval-Augmented Generation for LLM-based Recommendation
Shijie Wang
|
Wenqi Fan
|
Yue Feng
|
Lin Shanru
|
Xinyu Ma
|
Shuaiqiang Wang
|
Dawei Yin
Recommender systems have become increasingly vital in our daily lives, helping to alleviate the problem of information overload across various user-oriented online services. The emergence of Large Language Models (LLMs) has yielded remarkable achievements, demonstrating their potential for the development of next-generation recommender systems. Despite these advancements, LLM-based recommender systems face inherent limitations stemming from their LLM backbones, particularly issues of hallucinations and the lack of up-to-date and domain-specific knowledge.Recently, Retrieval-Augmented Generation (RAG) has garnered significant attention for addressing these limitations by leveraging external knowledge sources to enhance the understanding and generation of LLMs. However, vanilla RAG methods often introduce noise and neglect structural relationships in knowledge, limiting their effectiveness in LLM-based recommendations. To address these limitations, we propose to retrieve high-quality and up-to-date structure information from the knowledge graph (KG) to augment recommendations. Specifically, our approach develops a retrieval-augmented framework, termed K-RagRec, that facilitates the recommendation generation process by incorporating structure information from the external KG. Extensive experiments have been conducted to demonstrate the effectiveness of our proposed method.
pdf
bib
abs
SudoLM: Learning Access Control of Parametric Knowledge with Authorization Alignment
Qin Liu
|
Fei Wang
|
Chaowei Xiao
|
Muhao Chen
Existing preference alignment is a one-size-fits-all alignment mechanism, where the part of the large language model (LLM) parametric knowledge with non-preferred features is uniformly blocked to all the users. However, this part of knowledge can be useful to advanced users whose expertise qualifies them to handle these information. The one-size-fits-all alignment mechanism undermines LLM’s utility for these qualified users. To address this problem, we propose SudoLM, a framework that lets LLMs learn access control over specific parametric knowledge for users with different credentials via authorization alignment. SudoLM allows authorized users to unlock their access to all the parametric knowledge with an assigned Sudo key while blocking access to non-qualified users. Experiments on two application scenarios demonstrate that SudoLM effectively controls the user’s access to the parametric knowledge and maintains its general utility.
pdf
bib
abs
I0T: Embedding Standardization Method Towards Zero Modality Gap
Na Min An
|
Eunki Kim
|
James Thorne
|
Hyunjung Shim
Contrastive Language-Image Pretraining (CLIP) enables zero-shot inference in downstream tasks such as image-text retrieval and classification. However, recent works extending CLIP suffer from the issue of *modality gap*, which arises when the image and text embeddings are projected to disparate manifolds, deviating from the intended objective of image-text contrastive learning. We discover that this phenomenon is linked to the modality-specific characteristic that each image or text encoder independently possesses. Herein, we propose two methods to address the modality gap: (1) a post-hoc embedding standardization method, I0Tpost that reduces the modality gap approximately to zero and (2) a trainable method, I0Tasync, to alleviate the modality gap problem by adding two normalization layers for each encoder. Our I0T framework can significantly reduce the modality gap while preserving the original embedding representations of trained models with their locked parameters. In practice, I0Tpost can serve as an alternative explainable automatic evaluation metric of widely used CLIPScore (CLIP-S). The code is available in https://github.com/xfactlab/I0T.
pdf
bib
abs
Odysseus Navigates the Sirens’ Song: Dynamic Focus Decoding for Factual and Diverse Open-Ended Text Generation
Wen Luo
|
Feifan Song
|
Wei Li
|
Guangyue Peng
|
Shaohang Wei
|
Houfeng Wang
Large Language Models (LLMs) are increasingly required to generate text that is both factually accurate and diverse across various open-ended applications. However, current stochastic decoding methods struggle to balance such objectives. We introduce Dynamic Focus Decoding (DFD), a novel plug-and-play stochastic approach that resolves this trade-off without requiring additional data, knowledge, or models. DFD adaptively adjusts the decoding focus based on distributional differences across layers, leveraging the modular and hierarchical nature of factual knowledge within LLMs. This dynamic adjustment improves factuality in knowledge-intensive decoding steps and promotes diversity in less knowledge-reliant steps. DFD can be easily integrated with existing decoding methods, enhancing both factuality and diversity with minimal computational overhead. Extensive experiments across seven datasets demonstrate that DFD significantly improves performance, providing a scalable and efficient solution for open-ended text generation.
pdf
bib
abs
Better Embeddings with Coupled Adam
Felix Stollenwerk
|
Tobias Stollenwerk
Despite their remarkable capabilities, LLMs learn word representations that exhibit the undesirable yet poorly understood feature of anisotropy. In this paper, we argue that the second moment in Adam is a cause of anisotropic embeddings, and suggest a modified optimizer called Coupled Adam to mitigate the problem. Our experiments demonstrate that Coupled Adam significantly improves the quality of embeddings, while also leading to better upstream and downstream performance on large enough datasets.
pdf
bib
abs
Bone Soups: A Seek-and-Soup Model Merging Approach for Controllable Multi-Objective Generation
Guofu Xie
|
Xiao Zhang
|
Ting Yao
|
Yunsheng Shi
User information needs are often highly diverse and varied. A key challenge in current research is how to achieve controllable multi-objective generation while enabling rapid adaptation to accommodate diverse user demands during test time. Existing solutions, such as Rewarded Soup, focus on merging language models individually tuned on single objectives. While easy to implement and widely used, these approaches face limitations in achieving optimal performance due to their disregard for the impacts of competing objectives on model tuning. To address this issue, we propose **Bone Soup**, a novel model merging approach that first seeks a series of back**bone** models by considering the impacts of multiple objectives and then makes the **soup** (i.e., merge the backbone models). Specifically, Bone Soup begins by training multiple backbone models for different objectives using multi-objective reinforcement learning. Each backbone model is guided by a combination of backbone reward signals. To ensure that these models are optimal for the Pareto front, the backbone rewards are crafted by combining standard reward functions into basis vectors, which can then be modified through a rule-based construction method. Bone Soup leverages a symmetric circulant matrix mapping to generate the merging coefficients, which are used to merge the backbone models according to user preferences.Extensive experimental results demonstrate that Bone Soup exhibits strong controllability and Pareto optimality in controllable multi-objective generation, providing a more effective and efficient approach to addressing diverse user needs at test time.
pdf
bib
abs
Controllable and Reliable Knowledge-Intensive Task-Oriented Conversational Agents with Declarative Genie Worksheets
Harshit Joshi
|
Shicheng Liu
|
James Chen
|
Larsen Weigle
|
Monica Lam
Large Language Models are capable of carrying out human-like conversations in diverse settings in response to user requests for tasks and knowledge. However, existing conversational agents implemented with LLMs often struggle with hallucination, following instructions with conditional logic, and integrating knowledge from different sources. These shortcomings compromise the agents’ effectiveness, rendering them unsuitable for deployment. To address these challenges, we introduce Genie, a programmable framework for creating knowledge-intensive task-oriented conversational agents that handle involved interactions and answer complex queries. Unlike LLMs, Genie delivers reliable, grounded responses through advanced dialogue state management and supports controllable agent policies via its declarative specification – Genie Worksheet. This is achieved through an algorithmic runtime system that implements the developer-supplied policy, limiting LLMs to (1) parse user input using a succinct conversational history, and (2) generate responses according to supplied content. Agents built with Genie outperform SOTA methods on complex logic dialogue datasets by up to 20.5%. We conducted a user study with 62 participants. Genie agents with GPT-4 Turbo outperformed the GPT-4 Turbo agents with function calling, improving goal completion rates from 21.8% to 82.8% across three real-world tasks.
pdf
bib
abs
Benchmarking Long-Context Language Models on Long Code Understanding
Jia Li
|
Xuyuan Guo
|
Lei Li
|
Kechi Zhang
|
Ge Li
|
Jia Li
|
Zhengwei Tao
|
Fang Liu
|
Chongyang Tao
|
Yuqi Zhu
|
Zhi Jin
Current advanced long-context language models offer great potential for real-world software engineering applications. However, progress in this critical domain remains hampered by a fundamental limitation: the absence of a rigorous evaluation framework for long code understanding. To gap this obstacle, we propose a long code understanding benchmark LongCodeU from four aspects (8 tasks) to evaluate LCLMs’ long code understanding ability required for practical applications, including code unit perception, intra-code unit understanding, inter-code unit relation understanding, and long code documentation understanding. We evaluate 9 popular LCLMs on LongCodeU (i.e., 6 general models and 3 code models). Our experimental results reveal key limitations in current LCLMs’ capabilities for long code understanding. Particularly, the performance of LCLMs drops dramatically when the long code length is greater than 32K, falling far short of their claimed 128K to 1M context windows. In the four aspects, inter-code unit relation understanding is the most challenging for LCLMs. Our study provides valuable insights for optimizing LCLMs and driving advancements in software engineering.
pdf
bib
abs
MAGNET: Augmenting Generative Decoders with Representation Learning and Infilling Capabilities
Savya Khosla
|
Aditi Tiwari
|
Kushal Kafle
|
Simon Jenni
|
Handong Zhao
|
John Collomosse
|
Jing Shi
While originally designed for unidirectional generative modeling, decoder-only large language models (LLMs) are increasingly being adapted for bidirectional modeling. However, unidirectional and bidirectional models are typically trained separately with distinct objectives (generation and representation learning). This separation overlooks the opportunity for developing a more versatile language model and for these objectives to complement each other. In this work, we propose MAGNET, a method for adapting decoder-only LLMs to generate robust representations and infill missing text spans. MAGNET employs three self-supervised training objectives and introduces an attention mechanism that combines bidirectional and causal attention, enabling unified training across all objectives. Our results demonstrate that LLMs adapted with MAGNET (1) surpass strong text encoders on token-level and sentence-level representation learning tasks, (2) generate contextually appropriate text infills by leveraging past and future contexts, (3) perform open-ended text generation without excessive repetition of words or phrases, and (4) preserve the knowledge and reasoning capability gained by the LLM during pretraining.
pdf
bib
abs
Internal Value Alignment in Large Language Models through Controlled Value Vector Activation
Haoran Jin
|
Meng Li
|
Xiting Wang
|
Zhihao Xu
|
Minlie Huang
|
Yantao Jia
|
Defu Lian
Aligning Large Language Models (LLMs) with human values has attracted increasing attention since it provides clarity, transparency, and the ability to adapt to evolving scenarios. In this paper, we introduce a Controlled Value Vector Activation (ConVA) method that directly aligns the internal values of LLMs by interpreting how a value is encoded in their latent representations and modifies relevant activations to ensure consistent values in LLMs. To ensure an accurate and unbiased interpretation, we propose a context-controlled value vector identification method. To consistently control values without sacrificing model performance, we introduce a gated value vector activation method for effective and minimum degree of value control. Experiments show that our method achieves the highest control success rate across 10 basic values without hurting LLM performance and fluency, and ensures target values even with opposite and potentially malicious input prompts. Source code and data are available at https://github.com/hr-jin/ConVA.
pdf
bib
abs
A Dual-Perspective NLG Meta-Evaluation Framework with Automatic Benchmark and Better Interpretability
Xinyu Hu
|
Mingqi Gao
|
Li Lin
|
Zhenghan Yu
|
Xiaojun Wan
In NLG meta-evaluation, evaluation metrics are typically assessed based on their consistency with humans. However, we identify some limitations in traditional NLG meta-evaluation approaches, such as issues in handling human ratings and ambiguous selections of correlation measures, which undermine the effectiveness of meta-evaluation. In this work, we propose a dual-perspective NLG meta-evaluation framework that focuses on different evaluation capabilities, thereby providing better interpretability. In addition, we introduce a method of automatically constructing the corresponding benchmarks without requiring new human annotations. Furthermore, we conduct experiments with 16 representative LLMs as the evaluators based on our proposed framework, comprehensively analyzing their evaluation performance from different perspectives.
pdf
bib
abs
Recurrent Knowledge Identification and Fusion for Language Model Continual Learning
Yujie Feng
|
Xujia Wang
|
Zexin Lu
|
Shenghong Fu
|
Guangyuan Shi
|
Yongxin Xu
|
Yasha Wang
|
Philip S. Yu
|
Xu Chu
|
Xiao-Ming Wu
Continual learning (CL) is crucial for deploying large language models (LLMs) in dynamic real-world environments without costly retraining. While recent model ensemble and model merging methods guided by parameter importance have gained popularity, they often struggle to balance knowledge transfer and forgetting, mainly due to the reliance on static importance estimates during sequential training. In this paper, we present Recurrent-KIF, a novel CL framework for Recurrent Knowledge Identification and Fusion, which enables dynamic estimation of parameter importance distributions to enhance knowledge transfer. Inspired by human continual learning, Recurrent-KIF employs an inner loop that rapidly adapts to new tasks while identifying important parameters, coupled with an outer loop that globally manages the fusion of new and historical knowledge through redundant knowledge pruning and key knowledge merging. These inner-outer loops iteratively perform multiple rounds of fusion, allowing Recurrent-KIF to leverage intermediate training information and adaptively adjust fusion strategies based on evolving importance distributions. Extensive experiments on two CL benchmarks with various model sizes (from 770M to 13B) demonstrate that Recurrent-KIF effectively mitigates catastrophic forgetting and enhances knowledge transfer.
pdf
bib
abs
Data-Constrained Synthesis of Training Data for De-Identification
Thomas Vakili
|
Aron Henriksson
|
Hercules Dalianis
Many sensitive domains — such as the clinical domain — lack widely available datasets due to privacy risks. The increasing generative capabilities of large language models (LLMs) have made synthetic datasets a viable path forward. In this study, we domain-adapt LLMs to the clinical domain and generate synthetic clinical texts that are machine-annotated with tags for personally identifiable information using capable encoder-based NER models. The synthetic corpora are then used to train synthetic NER models. The results show that training NER models using synthetic corpora incurs only a small drop in predictive performance. The limits of this process are investigated in a systematic ablation study — using both Swedish and Spanish data. Our analysis shows that smaller datasets can be sufficient for domain-adapting LLMs for data synthesis. Instead, the effectiveness of this process is almost entirely contingent on the performance of the machine-annotating NER models trained using the original data.
pdf
bib
abs
Just a Scratch: Enhancing LLM Capabilities for Self-harm Detection through Intent Differentiation and Emoji Interpretation
Soumitra Ghosh
|
Gopendra Vikram Singh
|
Shambhavi Shambhavi
|
Sabarna Choudhury
|
Asif Ekbal
Self-harm detection on social media is critical for early intervention and mental health support, yet remains challenging due to the subtle, context-dependent nature of such expressions. Identifying self-harm intent aids suicide prevention by enabling timely responses, but current large language models (LLMs) struggle to interpret implicit cues in casual language and emojis. This work enhances LLMs’ comprehension of self-harm by distinguishing intent through nuanced language–emoji interplay. We present the Centennial Emoji Sensitivity Matrix (CESM-100)—a curated set of 100 emojis with contextual self-harm interpretations—and the Self-Harm Identification aNd intent Extraction with Supportive emoji sensitivity (SHINES) dataset, offering detailed annotations for self-harm labels, casual mentions (CMs), and serious intents (SIs). Our unified framework:a) enriches inputs using CESM-100;b) fine-tunes LLMs for multi-task learning—self-harm detection (primary) and CM/SI span detection (auxiliary);c) generate explainable rationales for self-harm predictions. We evaluate the framework on three state-of-the-art LLMs—Llama 3, Mental-Alpaca, and MentalLlama—across zero-shot, few-shot, and fine-tuned scenarios. By coupling intent differentiation with contextual cues, our approach commendably enhances LLM performance in both detection and explanation tasks, effectively addressing the inherent ambiguity in self-harm signals. The SHINES dataset, CESM-100 and codebase are publicly available at: https://www.iitp.ac.in/%7eai-nlp-ml/resources.html#SHINES
pdf
bib
abs
Contrastive Learning on LLM Back Generation Treebank for Cross-domain Constituency Parsing
Peiming Guo
|
Meishan Zhang
|
Jianling Li
|
Min Zhang
|
Yue Zhang
Cross-domain constituency parsing is still an unsolved challenge in computational linguistics since the available multi-domain constituency treebank is limited. We investigate automatic treebank generation by large language models (LLMs) in this paper. The performance of LLMs on constituency parsing is poor, therefore we propose a novel treebank generation method, LLM back generation, which is similar to the reverse process of constituency parsing. LLM back generation takes the incomplete cross-domain constituency tree with only domain keyword leaf nodes as input and fills the missing words to generate the cross-domain constituency treebank. Besides, we also introduce a span-level contrastive learning pre-training strategy to make full use of the LLM back generation treebank for cross-domain constituency parsing. We verify the effectiveness of our LLM back generation treebank coupled with contrastive learning pre-training on five target domains of MCTB. Experimental results show that our approach achieves state-of-the-art performance on average results compared with various baselines.
pdf
bib
abs
MMDEND: Dendrite-Inspired Multi-Branch Multi-Compartment Parallel Spiking Neuron for Sequence Modeling
Kexin Wang
|
Yuhong Chou
|
Di Shang
|
Shijie Mei
|
Jiahong Zhang
|
Yanbin Huang
|
Man Yao
|
Bo Xu
|
Guoqi Li
Vanilla spiking neurons are simplified from complex biological neurons with dendrites, soma, and synapses, into single somatic compartments. Due to limitations in performance and training efficiency, vanilla spiking neurons face significant challenges in modeling long sequences. In terms of performance, the oversimplified dynamics of spiking neurons omit long-term temporal dependencies. Additionally, the long-tail membrane potential distribution and binary activation discretization errors further limit their capacity to model long sequences. In terms of efficiency, the serial mechanism of spiking neurons leads to excessively long training times for long sequences. Though parallel spiking neurons are an efficient solution, their number of parameters is often tied to the hidden dimension or sequence length, which makes current parallel neurons unsuitable for large architectures. To address these issues, we propose **MMDEND**: a Multi-Branch Multi-Compartment Parallel Spiking Dendritic Neuron. Its proportion-adjustable multi-branch, multi-compartment structure enables long-term temporal dependencies. Additionally, we introduce a Scaling-Shifting Integer Firing (SSF) mechanism that fits the long-tail membrane potential distribution, retains efficiency, and mitigates discretization errors. Compared with parallel neurons, MMDEND achieves better long-sequence modeling capability with fewer parameters and lower energy consumption. Visualization also confirms that the SSF mechanism effectively fits long-tail distributions.
pdf
bib
abs
Understanding Impact of Human Feedback via Influence Functions
Taywon Min
|
Haeone Lee
|
Yongchan Kwon
|
Kimin Lee
In Reinforcement Learning from Human Feedback (RLHF), it is crucial to learn suitable reward models from human feedback to align large language models (LLMs) with human intentions. However, human feedback can often be noisy, inconsistent, or biased, especially when evaluating complex responses. Such feedback can lead to misaligned reward signals, potentially causing unintended side effects during the RLHF process. To address these challenges, we explore the use of influence functions to measure the impact of human feedback on the performance of reward models. We propose a compute-efficient approximation method that enables the application of influence functions to LLM-based reward models and large-scale preference datasets. Our experiments showcase two key applications of influence functions: (1) detecting common labeler biases in human feedback datasets and (2) guiding labelers in refining their strategies to better align with expert feedback. By quantifying the impact of human feedback, we believe that influence functions can enhance feedback interpretability and contribute to scalable oversight in RLHF, helping labelers provide more accurate and consistent feedback. Source code is available at https://github.com/mintaywon/IF_RLHF.
pdf
bib
abs
T2I-FactualBench: Benchmarking the Factuality of Text-to-Image Models with Knowledge-Intensive Concepts
Ziwei Huang
|
Wanggui He
|
Quanyu Long
|
Yandi Wang
|
Haoyuan Li
|
Zhelun Yu
|
Fangxun Shu
|
Weilong Dai
|
Hao Jiang
|
Fei Wu
|
Leilei Gan
Most existing studies on evaluating text-to-image (T2I) models primarily focus on evaluating text-image alignment, image quality, and object composition capabilities, with comparatively fewer studies addressing the evaluation of the factuality of the synthesized images, particularly when the images involve knowledge-intensive concepts. In this work, we present T2I-FactualBench—the largest benchmark to date in terms of the number of concepts and prompts specifically designed to evaluate the factuality of knowledge-intensive concept generation. T2I-FactualBench consists of a three-tiered knowledge-intensive text-to-image generation framework, ranging from the basic memorization of individual knowledge concepts to the more complex composition of multiple knowledge concepts. We further introduce a multi-round visual question answering (VQA)-based evaluation framework to assesses the factuality of three-tiered knowledge-intensive text-to-image generation tasks. Experiments on T2I-FactualBench indicate that current state-of-the-art (SOTA) T2I models still leave significant room for improvement. We release our datasets and code at https://github.com/Safeoffellow/T2I-FactualBench.
pdf
bib
abs
InspireDebate: Multi-Dimensional Subjective-Objective Evaluation-Guided Reasoning and Optimization for Debating
Fuyu Wang
|
Jiangtong Li
|
Kun Zhu
|
Changjun Jiang
With the rapid advancements in large language models (LLMs), debating tasks, such as argument quality assessment and debate process simulation, have made significant progress. However, existing LLM-based debating systems focus on responding to specific arguments while neglecting objective assessments such as authenticity and logical validity. Furthermore, these systems lack a structured approach to optimize across various dimensions—including evaluation metrics, chain-of-thought (CoT) reasoning, and multi-turn debate refinement—thereby limiting their effectiveness. To address these interconnected challenges, we propose a dual-component framework: (1) InspireScore, a novel evaluation system that establishes a multi-dimensional assessment architecture incorporating four subjective criteria (emotional appeal, argument clarity, argument arrangement, and topic relevance) alongside two objective metrics (fact authenticity and logical validity); and (2) InspireDebate, an optimized debating framework employing a phased optimization approach through CoT reasoning enhancement, multi-dimensional Direct Preference Optimization (DPO), and real-time knowledge grounding via web-based Retrieval Augmented Generation (Web-RAG). Empirical evaluations demonstrate that InspireScore achieves 44% higher correlation with expert judgments compared to existing methods, while InspireDebate shows significant improvements, outperforming baseline models by 57%. Source code is available at https://github.com/fywang12/InspireDebate.
pdf
bib
abs
OpenWebVoyager: Building Multimodal Web Agents via Iterative Real-World Exploration, Feedback and Optimization
Hongliang He
|
Wenlin Yao
|
Kaixin Ma
|
Wenhao Yu
|
Hongming Zhang
|
Tianqing Fang
|
Zhenzhong Lan
|
Dong Yu
The advancement of foundation models has laid the groundwork for building autonomous agents for complex tasks such as web navigation. Recent efforts have also tried to equip the agent with the ability to explore environments and continuously improve over time. However, existing works only focused on building text-only agents in synthetic environments where the reward signals are clearly defined. Such agents can hardly generalize to realistic settings that require multimodal perception ability and provide no ground-truth signal. In this paper, we introduce an innovative multimodal web agent that can autonomously conduct real-world exploration and improve itself. We first train the base model with imitation learning to gain the basic abilities. We then let the agent explore the open web and collect feedback on its trajectories. After that, it further improves its policy by learning from well-performing trajectories judged by another general-purpose model. This exploration-feedback-optimization cycle can continue for several iterations. Experimental results show that our web agent successfully improves itself after each iteration, demonstrating strong performance across multiple test sets. We will release our code and model to encourage future research in this field.
pdf
bib
abs
FOCUS: Evaluating Pre-trained Vision-Language Models on Underspecification Reasoning
Kankan Zhou
|
Eason Lai
|
Kyriakos Mouratidis
|
Jing Jiang
Humans possess a remarkable ability to interpret underspecified ambiguous statements by inferring their meanings from contexts such as visual inputs. This ability, however, may not be as developed in recent pre-trained vision-language models (VLMs). In this paper, we introduce a novel probing dataset called FOCUS to evaluate whether state-of-the-art VLMs have this ability. FOCUS consists of underspecified sentences paired with image contexts and carefully designed probing questions. Our experiments reveal that VLMs still fall short in handling underspecification even when visual inputs that can help resolve the ambiguities are available. To further support research in underspecification, FOCUS will be released for public use. We hope this dataset will inspire further research on the reasoning and contextual understanding capabilities of VLMs.
pdf
bib
abs
Sightation Counts: Leveraging Sighted User Feedback in Building a BLV-aligned Dataset of Diagram Descriptions
Wan Ju Kang
|
Eunki Kim
|
Na Min An
|
Sangryul Kim
|
Haemin Choi
|
Ki Hoon Kwak
|
James Thorne
Often, the needs and visual abilities differ between the annotator group and the end user group. Generating detailed diagram descriptions for blind and low-vision (BLV) users is one such challenging domain. Sighted annotators could describe visuals with ease, but existing studies have shown that direct generations by them are costly, bias-prone, and somewhat lacking by BLV standards. In this study, we ask sighted individuals to assess—rather than produce—diagram descriptions generated by vision-language models (VLM) that have been guided with latent supervision via a multi-pass inference. The sighted assessments prove effective and useful to professional educators who are themselves BLV and teach visually impaired learners. We release Sightation, a collection of diagram description datasets spanning 5k diagrams and 137k samples for completion, preference, retrieval, question answering, and reasoning training purposes and demonstrate their fine-tuning potential in various downstream tasks.
pdf
bib
abs
Personal Travel Solver: A Preference-Driven LLM-Solver System for Travel Planning
Zijian Shao
|
Jiancan Wu
|
Weijian Chen
|
Xiang Wang
Personal travel planning is a challenging task that aims to find a feasible plan that not only satisfies diverse constraints but also meets the demands of the user’s explicit and implicit preferences. In this paper, we study how to integrate the user’s implicit preference into the progress of travel planning. We introduce RealTravel, an augmented version of the TravelPlanner by incorporating real user reviews and point-of-interest metadata from Google Local. Based on RealTravel, we propose Personal Travel Solver (PTS), an integrated system that combines LLMs with numerical solvers to generate travel plans that satisfy both explicit constraints and implicit user preferences. PTS employs a novel architecture that seamlessly connects explicit constraint validation with implicit preference modeling through five specialized modules. The experimental results demonstrate the system’s effectiveness, achieving better performance than baseline methods, and improvement in the level of personalization. Our data and code are available at [PersonalTravelSolver](https://github.com/cliftclift/PTS).
pdf
bib
abs
Counterspeech the ultimate shield! Multi-Conditioned Counterspeech Generation through Attributed Prefix Learning
Aswini Kumar Padhi
|
Anil Bandhakavi
|
Tanmoy Chakraborty
Counterspeech has proven to be a powerful tool to combat hate speech online. Previous studies have focused on generating counterspeech conditioned only on specific intents (single attributed). However, a holistic approach considering multiple attributes simultaneously can yield more nuanced and effective responses. Here, we introduce HiPPrO, Hierarchical Prefix learning with Preference Optimization, a novel two-stage framework that utilizes the effectiveness of attribute-specific prefix embedding spaces hierarchically optimized during the counterspeech generation process in the first phase. Thereafter, we incorporate both reference and reward-free preference optimization to generate more constructive counterspeech. Furthermore, we extend IntentCONANv2 by annotating all 13,973 counterspeech instances with emotion labels by five annotators. HiPPrO leverages hierarchical prefix optimization to integrate these dual attributes effectively. An extensive evaluation demonstrates that HiPPrO achieves a 38 % improvement in intent conformity and a 3 %, 2 %, 3 % improvement in Rouge-1, Rouge-2, and Rouge-L, respectively, compared to several baseline models. Human evaluations further substantiate the superiority of our approach, highlighting the enhanced relevance and appropriateness of the generated counterspeech. This work underscores the potential of multi-attribute conditioning in advancing the efficacy of counterspeech generation systems. Our code is available on Github and dataset is open-sourced on Hugging-face.
pdf
bib
abs
LLM×MapReduce: Simplified Long-Sequence Processing using Large Language Models
Zihan Zhou
|
Chong Li
|
Xinyi Chen
|
Shuo Wang
|
Yu Chao
|
Zhili Li
|
Haoyu Wang
|
Qi Shi
|
Zhixing Tan
|
Xu Han
|
Xiaodong Shi
|
Zhiyuan Liu
|
Maosong Sun
We propose a training-free framework that enables large language models (LLMs) to effectively process long texts, using a divide-and-conquer strategy for comprehensive document understanding.The proposed LLM×MapReduce framework splits the entire document into several chunks for LLMs to read and then aggregates the intermediate outputs to produce the final response. The main challenge for divide-and-conquer long text processing frameworks lies in the risk of losing essential long-range information due to document splitting, which can lead the model to produce incomplete or incorrect answers based on the segmented texts.Disrupted long-range information can be classified into two categories: inter-chunk dependency and inter-chunk conflict.We design a structured information protocol to better cope with inter-chunk dependency and an in-context confidence calibration mechanism to resolve inter-chunk conflicts. Experiments demonstrate that LLM×MapReduce outperforms representative open-source and commercial long-context LLMs and is compatible with several models.Our framework can also function as a data synthesis engine, capable of generating high-quality long-alignment data using only short-context LLMs.
pdf
bib
abs
CheXalign: Preference fine-tuning in chest X-ray interpretation models without human feedback
Dennis Hein
|
Zhihong Chen
|
Sophie Ostmeier
|
Justin Xu
|
Maya Varma
|
Eduardo Pontes Reis
|
Arne Edward Michalson Md
|
Christian Bluethgen
|
Hyun Joo Shin
|
Curtis Langlotz
|
Akshay S Chaudhari
Radiologists play a crucial role in translating medical images into actionable reports. However, the field faces staffing shortages and increasing workloads. While automated approaches using vision-language models (VLMs) show promise as assistants, they require exceptionally high accuracy. Most current VLMs in radiology rely solely on supervised fine-tuning. Meanwhile, additional preference fine-tuning in the post-training pipeline has become standard practice in the general domain. The challenge in radiology lies in the prohibitive cost of obtaining radiologist feedback at scale. To address this challenge, we propose an automated pipeline for preference feedback, focusing on chest X-ray radiology report generation (RRG). Specifically, our method leverages publicly available datasets containing pairs of images and radiologist-written reference reports with reference-based metrics, or Judges, eliminating the need for *additional radiologist feedback*. We investigate reward overoptimization via length exploitation in this setting and introduce a length-controlled version of the GREEN score. Our best-performing setup achieves state-of-the-art CheXbert scores on the MIMIC-CXR dataset for the RRG task while on average maintaining robust performance across six additional image perception and reasoning tasks.
pdf
bib
abs
Knowledge Tracing in Programming Education Integrating Students’ Questions
Doyoun Kim
|
Suin Kim
|
Yohan Jo
Knowledge tracing (KT) in programming education presents unique challenges due to the complexity of coding tasks and the diverse methods students use to solve problems. Although students’ questions often contain valuable signals about their understanding and misconceptions, traditional KT models often neglect to incorporate these questions as inputs to address these challenges. This paper introduces SQKT (Students’ Question-based Knowledge Tracing), a knowledge tracing model that leverages students’ questions and automatically extracted skill information to enhance the accuracy of predicting students’ performance on subsequent problems in programming education. Our method creates semantically rich embeddings that capture not only the surface-level content of the questions but also the student’s mastery level and conceptual understanding. Experimental results demonstrate SQKT’s superior performance in predicting student completion across various Python programming courses of differing difficulty levels. In in-domain experiments, SQKT achieved a 33.1% absolute improvement in AUC compared to baseline models. The model also exhibited robust generalization capabilities in cross-domain settings, effectively addressing data scarcity issues in advanced programming courses. SQKT can be used to tailor educational content to individual learning needs and design adaptive learning systems in computer science education.
pdf
bib
abs
PRISM: A Framework for Producing Interpretable Political Bias Embeddings with Political-Aware Cross-Encoder
Yiqun Sun
|
Qiang Huang
|
Anthony Kum Hoe Tung
|
Jun Yu
Semantic Text Embedding is a fundamental NLP task that encodes textual content into vector representations, where proximity in the embedding space reflects semantic similarity. While existing embedding models excel at capturing general meaning, they often overlook ideological nuances, limiting their effectiveness in tasks that require an understanding of political bias. To address this gap, we introduce PRISM, the first framework designed to
Produce inte
Rpretable pol
Itical bia
S e
Mbeddings. PRISM operates in two key stages: (1) Controversial Topic Bias Indicator Mining, which systematically extracts fine-grained political topics and corresponding bias indicators from weakly labeled news data, and (2) Cross-Encoder Political Bias Embedding, which assigns structured bias scores to news articles based on their alignment with these indicators. This approach ensures that embeddings are explicitly tied to bias-revealing dimensions, enhancing both interpretability and predictive power. Through extensive experiments on large-scale datasets, we demonstrate that PRISM outperforms state-of-the-art text embedding models in political bias classification while offering highly interpretable representations that facilitate diversified retrieval and ideological analysis. The source code is available at
https://anonymous.4open.science/r/PRISM-80B4/.
pdf
bib
abs
Representations of Fact, Fiction and Forecast in Large Language Models: Epistemics and Attitudes
Meng Li
|
Michael Vrazitulis
|
David Schlangen
Rational speakers are supposed to know what they know and what they do not know, and to generate expressions matching the strength of evidence. In contrast, it is still a challenge for current large language models to generate corresponding utterances based on the assessment of facts and confidence in an uncertain real-world environment. While it has recently become popular to estimate and calibrate confidence of LLMs with verbalized uncertainty, what is lacking is a careful examination of the linguistic knowledge of uncertainty encoded in the latent space of LLMs. In this paper, we draw on typological frameworks of epistemic expressions to evaluate LLMs’ knowledge of epistemic modality, using controlled stories. Our experiments show that the performance of LLMs in generating epistemic expressions is limited and not robust, and hence the expressions of uncertainty generated by LLMs are not always reliable. To build uncertainty-aware LLMs, it is necessary to enrich semantic knowledge of epistemic modality in LLMs.
pdf
bib
abs
Lexical Diversity-aware Relevance Assessment for Retrieval-Augmented Generation
Zhange Zhang
|
Yuqing Ma
|
Yulong Wang
|
Shan He
|
Tianbo Wang
|
Siqi He
|
Jiakai Wang
|
Xianglong Liu
Retrieval-Augmented Generation (RAG) has proven effective in enhancing the factuality of LLMs’ generation, making them a focal point of research. However, previous RAG approaches overlook the lexical diversity of queries, hindering their ability to achieve a granular relevance assessment between queries and retrieved documents, resulting in suboptimal performance. In this paper, we introduce a Lexical Diversity-aware RAG (DRAG) method to address the biases in relevant information retrieval and utilization induced by lexical diversity. Specifically, a Diversity-sensitive Relevance Analyzer is proposed to decouple and assess the relevance of different query components (words, phrases) based on their levels of lexical diversity, ensuring precise and comprehensive document retrieval. Moreover, a Risk-guided Sparse Calibration strategy is further introduced to calibrate the generated tokens that is heavily affected by irrelevant content. Through these modules, DRAG is capable of effectively retrieving relevant documents and leverages their pertinent knowledge to refine the original results and generate meaningful outcomes. Extensive experiments on widely used benchmarks demonstrate the efficacy of our approach, yielding a 10.6% accuracy improvement on HotpotQA.
pdf
bib
abs
Weaving Context Across Images: Improving Vision-Language Models through Focus-Centric Visual Chains
Juntian Zhang
|
Chuanqi Cheng
|
Yuhan Liu
|
Wei Liu
|
Jian Luan
|
Rui Yan
Vision-language models (VLMs) achieve remarkable success in single-image tasks. However, real-world scenarios often involve intricate multi-image inputs, leading to a notable performance decline as models struggle to disentangle critical information scattered across complex visual features. In this work, we propose Focus-Centric Visual Chain, a novel paradigm that enhances VLMs’ perception, comprehension, and reasoning abilities in multi-image scenarios. To facilitate this paradigm, we propose Focus-Centric Data Synthesis, a scalable bottom-up approach for synthesizing high-quality data with elaborate reasoning paths. Through this approach, We construct VISC-150K, a large-scale dataset with reasoning data in the form of Focus-Centric Visual Chain, specifically designed for multi-image tasks. Experimental results on seven multi-image benchmarks demonstrate that our method achieves average performance gains of 3.16% and 2.24% across two distinct model architectures, without compromising the general vision-language capabilities. Our study represents a significant step toward more robust and capable vision-language systems that can handle complex visual scenarios.
pdf
bib
abs
Online Iterative Self-Alignment for Radiology Report Generation
Ting Xiao
|
Lei Shi
|
Yang Zhang
|
HaoFeng Yang
|
Zhe Wang
|
Chenjia Bai
Radiology Report Generation (RRG) is an important research topic for relieving radiologists’ heavy workload. Existing RRG models mainly rely on supervised fine-tuning (SFT) based on different model architectures using data pairs of radiological images and corresponding radiologist-annotated reports. Recent research has shifted focus to post-training improvements, aligning RRG model outputs with human preferences using reinforcement learning (RL). However, the limited data coverage of high-quality annotated data poses risks of overfitting and generalization. This paper proposes a novel Online Iterative Self-Alignment (OISA) method for RRG that consists of four stages: self-generation of diverse data, self-evaluation for multi-objective preference data, self-alignment for multi-objective optimization and self-iteration for further improvement. Our approach allows for generating varied reports tailored to specific clinical objectives, enhancing the overall performance of the RRG model iteratively. Unlike existing methods, our framework significantly increases data quality and optimizes performance through iterative multi-objective optimization. Experimental results demonstrate that our method surpasses previous approaches, achieving state-of-the-art performance across multiple evaluation metrics.
pdf
bib
abs
Chinese Inertial GAN for Handwriting Signal Generation and Recognition
Yifeng Wang
|
Yi Zhao
Keyboard-based interaction may not accommodate various needs, especially for individuals with disabilities. While inertial sensor-based writing recognition is promising due to the sensors’ small size, wearability, and low cost, accurate recognition in the Chinese context is hampered by the difficulty of collecting extensive inertial signal samples for the vast number of characters. Therefore, we design a Chinese Inertial GAN (CI-GAN) containing Chinese glyph encoding (CGE), forced optimal transport (FOT), and semantic relevance alignment (SRA) to acquire unlimited high-quality training samples. Unlike existing vectorization methods focusing on the meaning of Chinese characters, CGE represents shape and stroke features, providing glyph guidance for writing signal generation. FOT establishes a triple-consistency constraint between the input prompt, output signal features, and real signal features, ensuring the authenticity and semantic accuracy of the generated signals. SRA aligns semantic relationships between multiple outputs and their input prompts, ensuring that similar inputs correspond to similar outputs (and vice versa), alleviating model hallucination. The three modules guide the generator while also interacting with each other, forming a coupled system. By utilizing the massive training samples provided by CI-GAN, the performance of six widely used classifiers is improved from 6.7% to 98.4%, indicating that CI-GAN constructs a flexible and efficient data platform for Chinese inertial writing recognition. Furthermore, we release the first Chinese inertial writing dataset on GitHub.
pdf
bib
abs
LLMs Caught in the Crossfire: Malware Requests and Jailbreak Challenges
Haoyang Li
|
Huan Gao
|
Zhiyuan Zhao
|
Zhiyu Lin
|
Junyu Gao
|
Xuelong Li
The widespread adoption of Large Language Models (LLMs) has heightened concerns about their security, particularly their vulnerability to jailbreak attacks that leverage crafted prompts to generate malicious outputs. While prior research has been conducted on general security capabilities of LLMs, their specific susceptibility to jailbreak attacks in code generation remains largely unexplored. To fill this gap, we propose MalwareBench, a benchmark dataset containing 3,520 jailbreaking prompts for malicious code-generation, designed to evaluate LLM robustness against such threats. MalwareBench is based on 320 manually crafted malicious code generation requirements, covering 11 jailbreak methods and 29 code functionality categories. Experiments show that mainstream LLMs exhibit limited ability to reject malicious code-generation requirements, and the combination of multiple jailbreak methods further reduces the model’s security capabilities: specifically, the average rejection rate for malicious content is 60.93%, dropping to 39.92% when combined with jailbreak attack algorithms. Our work highlights that the code security capabilities of LLMs still pose significant challenges.
pdf
bib
abs
Evaluating Sequence Labeling on the basis of Information Theory
Enrique Amigo
|
Elena Álvarez-Mellado
|
Julio Gonzalo
|
Jorge Carrillo-de-Albornoz
Various metrics exist for evaluating sequence labeling problems (strict span matching, token oriented metrics, token concurrence in sequences, etc.), each of them focusing on certain aspects of the task. In this paper, we define a comprehensive set of formal properties that captures the strengths and weaknesses of the existing metric families and prove that none of them is able to satisfy all properties simultaneously. We argue that it is necessary to measure how much information (correct or noisy) each token in the sequence contributes depending on different aspects such as sequence length, number of tokens annotated by the system, token specificity, etc. On this basis, we introduce the Sequence Labelling Information Contrast Model (SL-ICM), a novel metric based on information theory for evaluating sequence labeling tasks. Our formal analysis and experimentation show that the proposed metric satisfies all properties simultaneously
pdf
bib
abs
GRAT: Guiding Retrieval-Augmented Reasoning through Process Rewards Tree Search
Xianshu Peng
|
Wei Wei
Enhancing large models for complex multi-hop question-answering has become a research focus in the Retrieval-augmented generation (RAG) area. Many existing approaches aim to mimic human thought processes by enabling large models to perform retrieval-augmented generation step by step. However, these methods can only perform single chain reasoning, which lacks the ability for multi-path exploration, strategic look-ahead, stepwise evaluation, and global selection. In addition, to effectively decompose complex problems, these methods can only rely on labor-intensive intermediate annotations for supervised fine-tuning. To address these issues, we propose GRAT, an algorithm guided by Monte Carlo Tree Search (MCTS) and process rewards. GRAT not only enables self-evaluation and self-correction but also assigns fine-grained rewards to each intermediate step in the search path. These fine-grained annotations can be used for model self-training, which enables GRAT to continuously self-update its problem analysis and reasoning capabilities. We conducted experiments on four multihop QA datasets: HotPotQA, 2WikiMultiHopQA, MuSiQue, and Bamboogle, demonstrating that GRAT outperforms various RAG-based methods. Additionally, incorporating self-training significantly enhances GRAT’s reasoning performance.
pdf
bib
abs
T-REG: Preference Optimization with Token-Level Reward Regularization
Wenxuan Zhou
|
Shujian Zhang
|
Lingxiao Zhao
|
Tao Meng
Reinforcement Learning from Human Feedback (RLHF) has been pivotal in enabling Large Language Models (LLMs) to effectively follow instructions and produce meaningful alignment by leveraging human preference data. Traditionally, RLHF involves generating responses to a query and using a separate reward model to assign a score to the entire completion. This approach, however, presents challenges, as it provides a single, sparse reward at the end of a sequence, making optimization difficult for the model, in which both training and generation occur auto-regressively at token levels. While recent methods have attempted to address this by assigning token-level discrete or continuous rewards, these often rely on either a trained credit assignment model or AI annotators, which raises concerns about the quality and reliability of the token-level rewards. In this paper, we propose T-REG, which utilizes both sequence-level and token-level rewards for preference optimization. T-REG employs self-generated token-level rewards, derived through opposite prompting, as a weak supervision signal to guide the model in distributing sequence-level rewards at the token level, thereby achieving more effective token-level credit assignment and improving alignment performance. Experiments on the instruction following benchmarks, including Alpaca Eval 2 and Arena-Hard, show that our method consistently outperforms baseline methods by up to 3.8% and 4.4%, respectively.
pdf
bib
abs
Gödel Agent: A Self-Referential Agent Framework for Recursively Self-Improvement
Xunjian Yin
|
Xinyi Wang
|
Liangming Pan
|
Li Lin
|
Xiaojun Wan
|
William Yang Wang
The rapid advancement of large language models (LLMs) has significantly enhanced the capabilities of agents across various tasks. However, existing agentic systems, whether based on fixed pipeline algorithms or pre-defined meta-learning frameworks, cannot search the whole agent design space due to the restriction of human-designed components, and thus might miss the more optimal agent design. In this paper, we introduce Gödel Agent, a self-evolving framework inspired by the Gödel Machine, enabling agents to recursively improve themselves without relying on predefined routines or fixed optimization algorithms. Gödel Agent leverages LLMs to dynamically modify its own logic and behavior, guided solely by high-level objectives through prompting. Experimental results on multiple domains demonstrate that the implementation of Gödel Agent can achieve continuous self-improvement, surpassing manually crafted agents in performance, efficiency, and generalizability.
pdf
bib
abs
AgentGym: Evaluating and Training Large Language Model-based Agents across Diverse Environments
Zhiheng Xi
|
Yiwen Ding
|
Wenxiang Chen
|
Boyang Hong
|
Honglin Guo
|
Junzhe Wang
|
Xin Guo
|
Dingwen Yang
|
Chenyang Liao
|
Wei He
|
Songyang Gao
|
Lu Chen
|
Rui Zheng
|
Yicheng Zou
|
Tao Gui
|
Qi Zhang
|
Xipeng Qiu
|
Xuanjing Huang
|
Zuxuan Wu
|
Yu-Gang Jiang
Large language models (LLMs) have emerged as a promising foundation to build generally-capable agents (LLM-based agents) that can handle multi-turn decision-making tasks across various environments. However, the community lacks a unified interactive framework that covers diverse environments for comprehensive evaluation of agents, and enables exploration and learning for their self-improvement. To address this, we propose AgentGym, a framework featuring 7 real-world scenarios, 14 environments, and 89 tasks for unified, real-time, and concurrent agent interaction. We construct expanded instruction set, high-quality trajectories, and comprehensive benchmarking suite for developing LLM-based agents. Moreover, AgentGym supports interactive exploration and learning for agents through multi-turn interactions and real-time feedback. Based on AgentGym, we take the initial step to develop LLM-based agents that can handle diverse tasks via methods like self-improvement or reinforcement learning. Experimental results show that the trained agents can achieve results comparable to commercial models. We hope our work can help the community develop more advanced LLM-based agents. We release the code, dataset, benchmark, and checkpoints at https://agentgym.github.io/.
pdf
bib
abs
Rethinking the Role of Prompting Strategies in LLM Test-Time Scaling: A Perspective of Probability Theory
Yexiang Liu
|
Zekun Li
|
Zhi Fang
|
Nan Xu
|
Ran He
|
Tieniu Tan
Recently, scaling test-time compute on Large Language Models (LLM) has garnered wide attention. However, there has been limited investigation of how various reasoning prompting strategies perform as scaling. In this paper, we focus on a standard and realistic scaling setting: majority voting. We systematically conduct experiments on 6 LLMs × 8 prompting strategies × 6 benchmarks. Experiment results consistently show that as the sampling time and computational overhead increase, complicated prompting strategies with superior initial performance gradually fall behind simple Chain-of-Thought.We analyze this phenomenon and provide theoretical proofs. Additionally, we propose a probabilistic method to efficiently predict scaling performance and identify the best prompting strategy under large sampling times, eliminating the need for resource-intensive inference processes in practical applications.Furthermore, we introduce two ways derived from our theoretical analysis to significantly improve the scaling performance. We hope that our research can promote to re-examine the role of complicated prompting, unleash the potential of simple prompting strategies, and provide new insights for enhancing test-time scaling performance. Code is available at https://github.com/MraDonkey/rethinking_prompting.
pdf
bib
abs
Information Locality as an Inductive Bias for Neural Language Models
Taiga Someya
|
Anej Svete
|
Brian DuSell
|
Timothy J. O’Donnell
|
Mario Giulianelli
|
Ryan Cotterell
Inductive biases are inherent in every machine learning system, shaping how models generalize from finite data. In the case of neural language models (LMs), debates persist as to whether these biases align with or diverge from human processing constraints. To address this issue, we propose a quantitative framework that allows for controlled investigations into the nature of these biases. Within our framework, we introduce m-local entropy—an information-theoretic measure derived from average lossy-context surprisal—that captures the local uncertainty of a language by quantifying how effectively the preceding symbols disambiguate the next symbol. In experiments on both perturbed natural language corpora and languages defined by probabilistic finite-state automata (PFSA), we show that languages with higher m-local entropy are more difficult for Transformer and LSTM LMs to learn. These results suggest that neural LMs, much like humans, are highly sensitive to the local statistical structure of a language.
pdf
bib
abs
Learning to Reason Over Time: Timeline Self-Reflection for Improved Temporal Reasoning in Language Models
Adrián Bazaga
|
Rexhina Blloshmi
|
Bill Byrne
|
Adrià de Gispert
Large Language Models (LLMs) have emerged as powerful tools for generating coherent text, understanding context, and performing reasoning tasks. However, they struggle with temporal reasoning, which requires processing time-related information such as event sequencing, durations, and inter-temporal relationships. These capabilities are critical for applications including question answering, scheduling, and historical analysis. In this paper, we introduce TISER, a novel framework that enhances the temporal reasoning abilities of LLMs through a multi-stage process that combines timeline construction with iterative self-reflection. Our approach leverages test-time scaling to extend the length of reasoning traces, enabling models to capture complex temporal dependencies more effectively. This strategy not only boosts reasoning accuracy but also improves the traceability of the inference process. Experimental results demonstrate state-of-the-art performance across multiple benchmarks, including out-of-distribution test sets, and reveal that TISER enables smaller open-source models to surpass larger closed-weight models on challenging temporal reasoning tasks.
pdf
bib
abs
Query-driven Document-level Scientific Evidence Extraction from Biomedical Studies
Massimiliano Pronesti
|
Joao H Bettencourt-Silva
|
Paul Flanagan
|
Alessandra Pascale
|
Oisín Redmond
|
Anya Belz
|
Yufang Hou
Extracting scientific evidence from biomedical studies for clinical research questions (e.g., Does stem cell transplantation improve quality of life in patients with medically refractory Crohn’s disease compared to placebo?) is a crucial step in synthesising biomedical evidence. In this paper, we focus on the task of document-level scientific evidence extraction for clinical questions with conflicting evidence. To support this task, we create a dataset called CochraneForest leveraging forest plots from Cochrane systematic reviews. It comprises 202 annotated forest plots, associated clinical research questions, full texts of studies, and study-specific conclusions. Building on CochraneForest, we propose URCA (Uniform Retrieval Clustered Augmentation), a retrieval-augmented generation framework designed to tackle the unique challenges of evidence extraction. Our experiments show that URCA outperforms the best existing methods by up to 10.3% in F1 score on this task. However, the results also underscore the complexity of CochraneForest, establishing it as a challenging testbed for advancing automated evidence synthesis systems.
pdf
bib
abs
Towards Robust Universal Information Extraction: Dataset, Evaluation, and Solution
Jizhao Zhu
|
Akang Shi
|
Zixuan Li
|
Long Bai
|
Xiaolong Jin
|
Jiafeng Guo
|
Xueqi Cheng
In this paper, we aim to enhance the robustness of Universal Information Extraction (UIE) by introducing a new benchmark dataset, a comprehensive evaluation, and a feasible solution. Existing robust benchmark datasets have two key limitations: 1) They generate only a limited range of perturbations for a single Information Extraction (IE) task, which fails to evaluate the robustness of UIE models effectively; 2) They rely on small models or handcrafted rules to generate perturbations, often resulting in unnatural adversarial examples. Considering the powerful generation capabilities of Large Language Models (LLMs), we introduce a new benchmark dataset for Robust UIE, called RUIE-Bench, which utilizes LLMs to generate more diverse and realistic perturbations across different IE tasks. Based on this dataset, we comprehensively evaluate existing UIE models and reveal that both LLM-based models and other models suffer from significant performance drops. To improve robustness and reduce training costs, we propose a data-augmentation solution that dynamically selects hard samples for iterative training based on the model’s inference loss. Experimental results show that training with only 15% of the data leads to an average 8.1% relative performance improvement across three IE tasks. Our code and dataset are available at: https://github.com/ICT-GoKnow/RobustUIE.
pdf
bib
abs
Multi-perspective Alignment for Increasing Naturalness in Neural Machine Translation
Huiyuan Lai
|
Esther Ploeger
|
Rik Van Noord
|
Antonio Toral
Neural machine translation (NMT) systems amplify lexical biases present in their training data, leading to artificially impoverished language in output translations. These language-level characteristics render automatic translations different from text originally written in a language and human translations, which hinders their usefulness in for example creating evaluation datasets. Attempts to increase naturalness in NMT can fall short in terms of content preservation, where increased lexical diversity comes at the cost of translation accuracy. Inspired by the reinforcement learning from human feedback framework, we introduce a novel method that rewards both naturalness and content preservation. We experiment with multiple perspectives to produce more natural translations, aiming at reducing machine and human translationese. We evaluate our method on English-to-Dutch literary translation, and find that our best model produces translations that are lexically richer and exhibit more properties of human-written language, without loss in translation accuracy.
pdf
bib
abs
Temporal reasoning for timeline summarisation in social media
Jiayu Song
|
Mahmud Elahi Akhter
|
Dana Atzil-Slonim
|
Maria Liakata
This paper explores whether enhancing temporal reasoning capabilities in Large Language Models (LLMs) can improve the quality of timeline summarisation, the task of summarising long texts containing sequences of events, such as social media threads. We first introduce NarrativeReason, a novel dataset focused on temporal relationships among sequential events within narratives, distinguishing it from existing temporal reasoning datasets that primarily address pair-wise event relationships. Our approach then combines temporal reasoning with timeline summarisation through a knowledge distillation framework, where we first fine-tune a teacher model on temporal reasoning tasks and then distill this knowledge into a student model while simultaneously training it for the task of timeline summarisation. Experimental results demonstrate that our model achieves superior performance on out-of-domain mental health-related timeline summarisation tasks, which involve long social media threads with repetitions of events and a mix of emotions, highlighting the importance and generalisability of leveraging temporal reasoning to improve timeline summarisation.
pdf
bib
abs
Beyond Negative Stereotypes – Non-Negative Abusive Utterances about Identity Groups and Their Semantic Variants
Tina Lommel
|
Elisabeth Eder
|
Josef Ruppenhofer
|
Michael Wiegand
We study a subtype of implicitly abusive language, namely non-negative sentences about identity groups (e.g. “Women make good cooks”), and introduce a novel dataset of such utterances. Not only do we profile such abusive sentences, but since our dataset includes different semantic variants of the same characteristic attributed to an identity group, we can also systematically study the impact of varying degrees of generalization and perspective framing. Similarly, we switch identity groups to assess whether the characteristic described in a sentence is inherently abusive. We also report on classification experiments.
pdf
bib
abs
Persistent Homology of Topic Networks for the Prediction of Reader Curiosity
Manuel D.s. Hopp
|
Vincent Labatut
|
Arthur Amalvy
|
Richard Dufour
|
Hannah Stone
|
Hayley K Jach
|
Kou Murayama
Reader curiosity, the drive to seek information, is crucial for textual engagement, yet remains relatively underexplored in NLP. Building on Loewenstein’s Information Gap Theory, we introduce a framework that models reader curiosity by quantifying semantic information gaps within a text’s semantic structure. Our approach leverages BERTopic-inspired topic modeling and persistent homology to analyze the evolving topology (connected components, cycles, voids) of a dynamic semantic network derived from text segments, treating these features as proxies for information gaps. To empirically evaluate this pipeline, we collect reader curiosity ratings from participants (*n* = 49) as they read S. Collins’s “The Hunger Games” novel. We then use the topological features from our pipeline as independent variables to predict these ratings, and experimentally show that they significantly improve curiosity prediction compared to a baseline model (73% vs. 30% explained deviance), validating our approach. This pipeline offers a new computational method for analyzing text structure and its relation to reader engagement.
pdf
bib
abs
Tokenisation is NP-Complete
Philip Whittington
|
Gregor Bachmann
|
Tiago Pimentel
In this work, we prove the NP-completeness of two variants of tokenisation, defined here as the problem of compressing a dataset to at most 𝛿 symbols by either finding a vocabulary directly (_direct_ tokenisation), or selecting a sequence of merge operations (_bottom-up_ tokenisation).
pdf
bib
abs
Training Dynamics Underlying Language Model Scaling Laws: Loss Deceleration and Zero-Sum Learning
Andrei Mircea
|
Supriyo Chakraborty
|
Nima Chitsazan
|
Irina Rish
|
Ekaterina Lobacheva
This work aims to understand how scaling improves language models, specifically in terms of training dynamics. We find that language models undergo loss deceleration early in training—an abrupt slowdown in the rate of loss improvement, resulting in piecewise linear behaviour of the loss curve in log-log space. Scaling up the model mitigates this transition by (1) decreasing the loss at which deceleration occurs, and (2) improving the log-log rate of loss improvement after deceleration. We attribute loss deceleration to a type of degenerate training dynamics we term zero-sum learning (ZSL). In ZSL, per-example gradients become systematically opposed, leading to destructive interference in per-example changes in loss. As a result, improving loss on one subset of examples degrades it on another, bottlenecking overall progress. Loss deceleration and ZSL provide new insights into the training dynamics underlying language model scaling laws, and could potentially be targeted directly to improve language models independent of scale. We make our code and artefacts available at: https://github.com/mirandrom/zsl
pdf
bib
abs
Parameter-Aware Contrastive Knowledge Editing: Tracing and Rectifying based on Critical Transmission Paths
Songlin Zhai
|
Yuan Meng
|
Yuxin Zhang
|
Guilin Qi
Large language models (LLMs) have encoded vast amounts of knowledge in their parameters, but the acquired knowledge can sometimes be incorrect or outdated over time, necessitating rectification after pre-training. Traditional localized methods in knowledge-based model editing (KME) typically assume that knowledge is stored in particular intermediate layers. However, recent research suggests that these methods do not identify the optimal locations for parameter editing, as knowledge gradually accumulates across all layers in LLMs during the forward pass rather than being stored in specific layers. This paper, for the first time, introduces the concept of critical transmission paths into KME for parameter updating. Specifically, these paths capture the key information flows that significantly influence the model predictions for the editing process. To facilitate this process, we also design a parameter-aware contrastive rectifying algorithm that considers less important paths as contrastive examples. Experiments on two prominent datasets and three widely used LLMs demonstrate the superiority of our method in editing performance.
pdf
bib
abs
Many Heads Are Better Than One: Improved Scientific Idea Generation by A LLM-Based Multi-Agent System
Haoyang Su
|
Renqi Chen
|
Shixiang Tang
|
Zhenfei Yin
|
Xinzhe Zheng
|
Jinzhe Li
|
Biqing Qi
|
Qi Wu
|
Hui Li
|
Wanli Ouyang
|
Philip Torr
|
Bowen Zhou
|
Nanqing Dong
The rapid advancement of scientific progress requires innovative tools that can accelerate knowledge discovery. Although recent AI methods, particularly large language models (LLMs), have shown promise in tasks such as hypothesis generation and experimental design, they fall short of replicating the collaborative nature of real-world scientific practices, where diverse experts work together in teams to tackle complex problems. To address the limitations, we propose an LLM-based multi-agent system, i.e., Virtual Scientists (VIRSCI), designed to mimic the teamwork inherent in scientific research. VIRSCI organizes a team of agents to collaboratively generate, evaluate, and refine research ideas. Through comprehensive experiments, we demonstrate that this multi-agent approach outperforms the state-of-the-art method in producing novel scientific ideas. We further investigate the collaboration mechanisms that contribute to its tendency to produce ideas with higher novelty, offering valuable insights to guide future research and illuminating pathways toward building a robust system for autonomous scientific discovery. The code is available at https://github.com/open-sciencelab/Virtual-Scientists.
pdf
bib
abs
Inner Thinking Transformer: Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking
Yilong Chen
|
Junyuan Shang
|
Zhenyu Zhang
|
Yanxi Xie
|
Jiawei Sheng
|
Tingwen Liu
|
Shuohuan Wang
|
Yu Sun
|
Hua Wu
|
Haifeng Wang
Large language models (LLMs) face inherent performance bottlenecks under parameter constraints, particularly in processing critical tokens that demand complex reasoning. Empirical analysis reveals challenging tokens induce abrupt gradient spikes across layers, exposing architectural stress points in standard Transformers. Building on this insight, we propose Inner Thinking Transformer (ITT), which reimagines layer computations as implicit thinking steps. ITT dynamically allocates computation through Adaptive Token Routing, iteratively refines representations via Residual Thinking Connections, and distinguishes reasoning phases using Thinking Step Encoding. ITT enables deeper processing of critical tokens without parameter expansion. Evaluations across 162M-466M parameter models show ITT achieves 96.5% performance of a 466M Transformer using only 162M parameters, reduces training data by 43.2%, and outperforms Transformer/Loop variants in 11 benchmarks. By enabling elastic computation allocation during inference, ITT balances performance and efficiency through architecture-aware optimization of implicit thinking pathways.
pdf
bib
abs
Document-Level Text Generation with Minimum Bayes Risk Decoding using Optimal Transport
Yuu Jinnai
Document-level text generation tasks are known to be more difficult than sentence-level text generation tasks as they require an understanding of longer context to generate high-quality texts. In this paper, we investigate the adaptation of Minimum Bayes Risk (MBR) decoding for document-level text generation tasks. MBR decoding makes use of a utility function to estimate the output with the highest expected utility from a set of candidate outputs. Although MBR decoding is shown to be effective in a wide range of sentence-level text generation tasks, its performance on document-level text generation tasks is limited, as many of the utility functions are designed for evaluating the utility of sentences. To this end, we propose MBR-OT, a variant of MBR decoding using Wasserstein distance to compute the utility of a document using a sentence-level utility function. The experimental result shows that the performance of MBR-OT outperforms that of the standard MBR in document-level machine translation, text simplification, and dense image captioning tasks.
pdf
bib
abs
Opt-Out: Investigating Entity-Level Unlearning for Large Language Models via Optimal Transport
Minseok Choi
|
Daniel Rim
|
Dohyun Lee
|
Jaegul Choo
Instruction-following large language models (LLMs), such as ChatGPT, have become widely popular among everyday users. However, these models inadvertently disclose private, sensitive information to their users, underscoring the need for machine unlearning techniques to remove selective information from the models. While prior work has focused on forgetting small, random subsets of training data at the instance-level, we argue that real-world scenarios often require the removal of an entire user data, which may require a more careful maneuver. In this study, we explore entity-level unlearning, which aims to erase all knowledge related to a target entity while preserving the remaining model capabilities. To address this, we introduce Opt-Out, an optimal transport-based unlearning method that utilizes the Wasserstein distance from the model’s initial parameters to achieve more effective and fine-grained unlearning. We also present the first Entity-Level Unlearning Dataset (ELUDe) designed to evaluate entity-level unlearning. Our empirical results demonstrate that Opt-Out surpasses existing methods, establishing a new standard for secure and adaptable LLMs that can accommodate user data removal requests without the need for full retraining.
pdf
bib
abs
Mixture of Small and Large Models for Chinese Spelling Check
Ziheng Qiao
|
Houquan Zhou
|
Zhenghua Li
In the era of large language models (LLMs), the Chinese Spelling Check (CSC) task has seen various LLM methods developed, yet their performance remains unsatisfactory. In contrast, fine-tuned BERT-based models, relying on high-quality in-domain data, show excellent performance but suffer from edit pattern overfitting. This paper proposes a novel dynamic mixture approach that effectively combines the probability distributions of small models and LLMs during the beam search decoding phase, achieving a balanced enhancement of precise corrections from small models and the fluency of LLMs. This approach also eliminates the need for fine-tuning LLMs, saving significant time and resources, and facilitating domain adaptation. Comprehensive experiments demonstrate that our mixture approach significantly boosts error correction capabilities, achieving state-of-the-art results across multiple datasets. Our code is available at https://github.com/zhqiao-nlp/MSLLM.
pdf
bib
abs
DISC: Plug-and-Play Decoding Intervention with Similarity of Characters for Chinese Spelling Check
Ziheng Qiao
|
Houquan Zhou
|
Yumeng Liu
|
Zhenghua Li
|
Min Zhang
|
Bo Zhang
|
Chen Li
|
Ji Zhang
|
Fei Huang
One key characteristic of the Chinese spelling check (CSC) task is that incorrect characters are usually similar to the correct ones in either phonetics or glyph. To accommodate this, previous works usually leverage confusion sets, which suffer from two problems, i.e., difficulty in determining which character pairs to include and lack of probabilities to distinguish items in the set. In this paper, we propose a light-weight plug-and-play DISC (i.e., decoding intervention with similarity of characters) module for CSC models. DISC measures phonetic and glyph similarities between characters and incorporates this similarity information only during the inference phase. This method can be easily integrated into various existing CSC models, such as ReaLiSe, SCOPE, and ReLM, without additional training costs. Experiments on three CSC benchmarks demonstrate that our proposed method significantly improves model performance, approaching and even surpassing the current state-of-the-art models.
pdf
bib
abs
Causal Estimation of Tokenisation Bias
Pietro Lesci
|
Clara Meister
|
Thomas Hofmann
|
Andreas Vlachos
|
Tiago Pimentel
Modern language models are typically trained over subword sequences, but ultimately define probabilities over character-strings. Ideally, the choice of the tokeniser—which maps character-strings to subwords—should not affect the probability assigned to the underlying character-string; in practice, it does. We define this mismatch as **tokenisation bias**. In this work, we quantify one particular type of tokenisation bias: the effect of including or not a subword (e.g., ⟨ hello ⟩) in a tokeniser’s vocabulary on the probability a trained model assigns to the corresponding characters (i.e., “hello”). Estimating this effect is challenging because each model is trained with only one tokeniser. We address this by framing tokenisation bias as a causal effect and estimating it using the regression discontinuity design. Specifically, we exploit the fact that tokenisation algorithms rank subwords and add the first K to a tokeniser’s vocabulary, where K is an arbitrary cutoff point. As such, we can estimate a causal effect by comparing similar subwords around this cutoff. Experimentally, we find that tokenisation consistently affects models’ outputs across scales, vocabularies, and tokenisers. Notably, a subword’s presence in a small model’s vocabulary may increase its characters’ probability by up to 17 times, highlighting tokenisation as a key design choice in language modelling.
pdf
bib
abs
Value Residual Learning
Zhanchao Zhou
|
Tianyi Wu
|
Zhiyun Jiang
|
Fares Obeid
|
Zhenzhong Lan
While Transformer models have achieved remarkable success in various domains, the effectiveness of information propagation through deep networks remains a critical challenge. Standard hidden state residuals often fail to adequately preserve initial token-level information in deeper layers. This paper introduces ResFormer, a novel architecture that enhances information flow by incorporating value residual connections in addition to hidden state residuals. And a variant is SVFormer, where all layers share the first layer’s value embedding. Comprehensive empirical evidence demonstrates ResFormer achieves equivalent validation loss with 16.11% fewer model parameters and 20.3% less training data compared to Transformer, while maintaining similar memory usage and computational cost. Besides, SVFormer reduces KV cache size by nearly half with only a small performance penalty and can be integrated with other KV-efficient methods, yielding further reductions in KV cache, with performance influenced by sequence length and cumulative learning rate.
pdf
bib
abs
SGIC: A Self-Guided Iterative Calibration Framework for RAG
Guanhua Chen
|
Yutong Yao
|
Lidia S. Chao
|
Xuebo Liu
|
Derek F. Wong
Recent research in retrieval-augmented generation (RAG) has concentrated on retrieving useful information from candidate documents. However, numerous methodologies frequently neglect the calibration capabilities of large language models (LLMs), which capitalize on their robust in-context reasoning prowess. This work illustrates that providing LLMs with specific cues substantially improves their calibration efficacy, especially in multi-round calibrations. We present a new SGIC: Self-Guided Iterative Calibration Framework that employs uncertainty scores as a tool. Initially, this framework calculates uncertainty scores to determine both the relevance of each document to the query and the confidence level in the responses produced by the LLMs. Subsequently, it reevaluates these scores iteratively, amalgamating them with prior responses to refine calibration. Furthermore, we introduce an innovative approach for constructing an iterative self-calibration training set, which optimizes LLMs to efficiently harness uncertainty scores for capturing critical information and enhancing response accuracy. Our proposed framework significantly improves performance on both closed-source and open-source LLMs.
pdf
bib
abs
NusaAksara: A Multimodal and Multilingual Benchmark for Preserving Indonesian Indigenous Scripts
Muhammad Farid Adilazuarda
|
Musa Izzanardi Wijanarko
|
Lucky Susanto
|
Khumaisa Nur’aini
|
Derry Tanti Wijaya
|
Alham Fikri Aji
Indonesia is rich in languages and scripts. However, most NLP progress has been made using romanized text. In this paper, we present NusaAksara, a novel public benchmark for Indonesian languages that includes their original scripts. Our benchmark covers both text and image modalities and encompasses diverse tasks such as image segmentation, OCR, transliteration, translation, and language identification. Our data is constructed by human experts through rigorous steps. NusaAksara covers 8 scripts across 7 languages, including low-resource languages not commonly seen in NLP benchmarks. Although unsupported by Unicode, the Lampung script is included in this dataset. We benchmark our data across several models, from LLMs and VLMs such as GPT-4o, Llama 3.2, and Aya 23 to task-specific systems such as PP-OCR and LangID, and show that most NLP technologies cannot handle Indonesia’s local scripts, with many achieving near-zero performance.
pdf
bib
abs
LLM-based Rumor Detection via Influence Guided Sample Selection and Game-based Perspective Analysis
Zhiliang Tian
|
Jingyuan Huang
|
Zejiang He
|
Zhen Huang
|
Menglong Lu
|
Linbo Qiao
|
Songzhu Mei
|
Yijie Wang
|
Dongsheng Li
Rumor detection on social media has become an emerging topic. Traditional deep learning-based methods model rumors based on content, propagation structure, or user behavior, but these approaches are constrained by limited modeling capacity and insufficient training corpora. Recent studies have explored using LLMs for rumor detection through supervised fine-tuning (SFT), but face two issues: 1) unreliable samples sometimes mislead the model learning; 2) the model only learns the most salient input-output mapping and skips in-depth analyses of the rumored content for convenience. To address these issues, we propose an SFT-based LLM rumor detection model with Influence guided Sample selection and Game-based multi-perspective Analysis (ISGA). Specifically, we first introduce the Influence Score (IS) to assess the impact of samples on model predictions and select samples for SFT. We also approximate IS via Taylor expansion to reduce computational complexity. Next, we use LLMs to generate in-depth analyses of news content from multiple perspectives and model their collaborative process for prediction as a cooperative game. Then we utilize the Shapley value to quantify the contribution of each perspective for selecting informative perspective analyses. Experiments show that ISGA excels existing SOTA on three datasets.
pdf
bib
abs
Hierarchical-Task-Aware Multi-modal Mixture of Incremental LoRA Experts for Embodied Continual Learning
Ziqi Jia
|
Anmin Wang
|
Xiaoyang Qu
|
Xiaowen Yang
|
Jianzong Wang
Previous continual learning setups for embodied intelligence focused on executing low-level actions based on human commands, neglecting the ability to learn high-level planning and multi-level knowledge. To address these issues, we propose the Hierarchical Embodied Continual Learning Setups (HEC) that divide the agent’s continual learning process into two layers: high-level instructions and low-level actions, and define five embodied continual learning sub-setups. Building on these setups, we introduce the Task-aware Mixture of Incremental LoRA Experts (Task-aware MoILE) method. This approach achieves task recognition by clustering visual-text embeddings and uses both a task-level router and a token-level router to select the appropriate LoRA experts. To effectively address the issue of catastrophic forgetting, we apply Singular Value Decomposition (SVD) to the LoRA parameters obtained from prior tasks, preserving key components while orthogonally training the remaining parts. The experimental results show that our method stands out in reducing the forgetting of old tasks compared to other methods, effectively supporting agents in retaining prior knowledge while continuously learning new tasks.
pdf
bib
abs
SpindleKV: A Novel KV Cache Reduction Method Balancing Both Shallow and Deep Layers
Zicong Tang
|
Shi Luohe
|
Zuchao Li
|
Baoyuan Qi
|
Liu Guoming
|
Lefei Zhang
|
Ping Wang
Large Language Models (LLMs) have achieved impressive accomplishments in recent years. However, the increasing memory consumption of KV cache has possessed a significant challenge to the inference system. Eviction methods have revealed the inherent redundancy within the KV cache, demonstrating its potential for reduction, particularly in deeper layers. However, KV cache reduction for shallower layers has been found to be insufficient. Based on our observation that, the KV cache exhibits a high degree of similarity. Based on this observation, we proposed a novel KV cache reduction method, SpindleKV, which balances both shallow and deep layers. For deep layers, we employ an attention weight based eviction method, while for shallow layers, we apply a codebook based replacement approach which is learnt by similarity and merging policy. Moreover, SpindleKV addressed the Grouped-Query Attention (GQA) dilemma faced by other attention based eviction methods. Experiments on two common benchmarks with three different LLMs shown that SpindleKV obtained better KV cache reduction effect compared to baseline methods, while preserving similar or even better model performance.
pdf
bib
abs
Medical Graph RAG: Evidence-based Medical Large Language Model via Graph Retrieval-Augmented Generation
Junde Wu
|
Jiayuan Zhu
|
Yunli Qi
|
Jingkun Chen
|
Min Xu
|
Filippo Menolascina
|
Yueming Jin
|
Vicente Grau
We introduce MedGraphRAG, a novel graph-based Retrieval-Augmented Generation (RAG) framework designed to enhance LLMs in generating evidence-based medical responses, improving safety and reliability with private medical data. We introduce Triple Graph Construction and U-Retrieval to enhance GraphRAG, enabling holistic insights and evidence-based response generation for medical applications. Specifically, we connect user documents to credible medical sources and integrate Top-down Precise Retrieval with Bottom-up Response Refinement for balanced context awareness and precise indexing. Validated on 9 medical Q&A benchmarks, 2 health fact-checking datasets, and a long-form generation test set, MedGraphRAG outperforms state-of-the-art models while ensuring credible sourcing. Our code is publicly available.
pdf
bib
abs
Unifying Uniform and Binary-coding Quantization for Accurate Compression of Large Language Models
Seungcheol Park
|
Jeongin Bae
|
Beomseok Kwon
|
Minjun Kim
|
Byeongwook Kim
|
Se Jung Kwon
|
U Kang
|
Dongsoo Lee
How can we quantize large language models while preserving accuracy? Quantization is essential for deploying large language models (LLMs) efficiently. Binary-coding quantization (BCQ) and uniform quantization (UQ) are promising quantization schemes that have strong expressiveness and optimizability, respectively. However, neither scheme leverages both advantages. In this paper, we propose UniQuanF (Unified Quantization with Flexible Mapping), an accurate quantization method for LLMs. UniQuanF harnesses both strong expressiveness and optimizability by unifying the flexible mapping technique in UQ and BCQ’s non-uniform quantization levels. We propose unified initialization, and local and periodic mapping techniques to optimize the parameters in UniQuanF precisely. After optimization, our unification theorem removes computational and memory overhead, allowing us to utilize the superior accuracy of UniQuanF without extra deployment costs induced by the unification. Experimental results demonstrate that UniQuanF outperforms existing UQ and BCQ methods, achieving up to 4.60% higher accuracy on GSM8K benchmark.
pdf
bib
abs
Agentic Reasoning: A Streamlined Framework for Enhancing LLM Reasoning with Agentic Tools
Junde Wu
|
Jiayuan Zhu
|
Yuyuan Liu
|
Min Xu
|
Yueming Jin
We introduce Agentic Reasoning, a framework that enhances large language model (LLM) reasoning by integrating external tool-using agents. Agentic Reasoning dynamically leverages web search, code execution, and structured memory to address complex problems requiring deep research. A key innovation in our framework is the Mind-Map agent, which constructs a structured knowledge graph to store reasoning context and track logical relationships, ensuring coherence in long reasoning chains with extensive tool usage. Additionally, we conduct a comprehensive exploration of the Web-Search agent, leading to a highly effective search mechanism that surpasses all prior approaches. When deployed on DeepSeek-R1, our method achieves a new state-of-the-art (SOTA) among public models and delivers performance comparable to OpenAI Deep Research, the leading proprietary model in this domain. Extensive ablation studies validate the optimal selection of agentic tools and confirm the effectiveness of our Mind-Map and Web-Search agents in enhancing LLM reasoning. Our code and data are publicly available.
pdf
bib
abs
Probing Relative Interaction and Dynamic Calibration in Multi-modal Entity Alignment
Chenxiao Li
|
Jingwei Cheng
|
Qiang Tong
|
Fu Zhang
|
Cairui Wang
Multi-modal entity alignment aims to identify equivalent entities between two different multi-modal knowledge graphs. Current methods have made significant progress by improving embedding and cross-modal fusion. However, most of them depend on using loss functions to capture the relationship between modalities or adopt a one-time strategy to directly compute modality weights using attention mechanisms, which overlooks the relative interactions between modalities at the entity level and the accuracy of modality weights, thereby hindering the generalization to diverse entities. To address this challenge, we propose RICEA, a relative interaction and calibration framework for multi-modal entity alignment, which dynamically computes weights based on the relative interaction and recalibrates the weights according to their uncertainties. Among these, we propose a novel method called ADC that utilizes attention mechanisms to perceive the uncertainty of the weight for each modality, rather than directly calculating the weight of each modality as in previous works. Across 5 datasets and 23 settings, our proposed framework significantly outperforms other baselines. Our code and data are available at https://github.com/ChenxiaoLi-Joe/RICEA.
pdf
bib
abs
Learn to Memorize: Scalable Continual Learning in Semiparametric Models with Mixture-of-Neighbors Induction Memory
Guangyue Peng
|
Tao Ge
|
Wen Luo
|
Wei Li
|
Houfeng Wang
Semiparametric language models (LMs) have shown promise in various Natural Language Processing (NLP) tasks. However, they utilize non-parametric memory as static storage, which lacks learning capability and remains disconnected from the internal information flow of the parametric models, limiting scalability and efficiency. Based on recent interpretability theories of LMs, we reconceptualize the non-parametric memory represented by kNN-LM as a learnable Mixture-of-Neighbors Induction Memory (MoNIM), which synergizes the induction capabilities of attention heads with the memorization strength of feed-forward networks (FFN). By integrating into the model’s information flow, MoNIM functions as an FFN-like bypass layer within the Transformer architecture, enabling effective learning of new knowledge. Extensive experiments demonstrate that MoNIM is a retentive and scalable continual learner in both data- and model-wise, enhancing the scalability and continual learning performance of semiparametric LMs.
pdf
bib
abs
Adverse Event Extraction from Discharge Summaries: A New Dataset, Annotation Scheme, and Initial Findings
Imane Guellil
|
Salomé Andres
|
Atul Anand
|
Bruce Guthrie
|
Huayu Zhang
|
Abul Hasan
|
Honghan Wu
|
Beatrice Alex
In this work, we present a manually annotated corpus for Adverse Event (AE) extraction from discharge summaries of elderly patients, a population often underrepresented in clinical NLP resources. The dataset includes 14 clinically significant AEs—such as falls, delirium, and intracranial haemorrhage, along with contextual attributes like negation, diagnosis type, and in-hospital occurrence. Uniquely, the annotation schema supports both discontinuous and overlapping entities, addressing challenges rarely tackled in prior work. We evaluate multiple models using FlairNLP across three annotation granularities: fine-grained, coarse-grained, and coarse-grained with negation. While transformer-based models (e.g., BERT-cased) achieve strong performance on document-level coarse-grained extraction (F1 = 0.943), performance drops notably for fine-grained entity-level tasks (e.g., F1 = 0.675), particularly for rare events and complex attributes. These results demonstrate that despite high-level scores, significant challenges remain in detecting underrepresented AEs and capturing nuanced clinical language. Developed within a Trusted Research Environment (TRE), the dataset is available upon request via DataLoch and serves as a robust benchmark for evaluating AE extraction methods and supporting future cross-dataset generalisation.
pdf
bib
abs
Speed Up Your Code: Progressive Code Acceleration Through Bidirectional Tree Editing
Longhui Zhang
|
Jiahao Wang
|
Meishan Zhang
|
GaoXiong Cao
|
Ensheng Shi
|
Mayuchi Mayuchi
|
Jun Yu
|
Honghai Liu
|
Jing Li
|
Min Zhang
Large language models (LLMs) have made significant strides in code acceleration (CA) tasks. Current works typically fine-tune LLMs using slow-fast code pairs mined from online programming platforms. Although these methods are widely recognized for their effectiveness, the training data often lack clear code acceleration patterns and offer only limited speed improvements. Moreover, existing training methods, such as direct instruction fine-tuning (IFT), tend to overlook the hierarchical relationships among acceleration patterns. In this work, we introduce BITE, a novel training paradigm designed to improve LLMs’ CA capabilities through two key innovations: (1) Bidirectional tree editing, which generates high-quality training data by incrementally transforming given code into both its most efficient and least efficient variants, and (2) Progressive code acceleration learning, which enables LLMs to internalize multi-level CA strategies by learning increasingly sophisticated acceleration patterns. Additionally, we introduce a new CA evaluation benchmark and metric for comprehensive assessment of model performance on CA tasks. Extensive experiments on both our benchmark and existing benchmarks demonstrate the effectiveness of our approach. Notably, BITE enables Qwen-1.5B to outperform prompt-enhanced GPT-4 and current training-based methods on average across five programming languages.
pdf
bib
abs
Multi-Facet Blending for Faceted Query-by-Example Retrieval
Heejin Do
|
Sangwon Ryu
|
Jonghwi Kim
|
Gary Lee
With the growing demand to fit fine-grained user intents, faceted query-by-example (QBE), which retrieves similar documents conditioned on specific facets, has gained recent attention. However, prior approaches mainly depend on document-level comparisons using basic indicators like citations due to the lack of facet-level relevance datasets; yet, this limits their use to citation-based domains and fails to capture the intricacies of facet constraints. In this paper, we propose a multi-facet blending (FaBle) augmentation method, which exploits modularity by decomposing and recomposing to explicitly synthesize facet-specific training sets. We automatically decompose documents into facet units and generate (ir)relevant pairs by leveraging LLMs’ intrinsic distinguishing capabilities; then, dynamically recomposing the units leads to facet-wise relevance-informed document pairs. Our modularization eliminates the need for pre-defined facet knowledge or labels. Further, to prove the FaBle’s efficacy in a new domain beyond citation-based scientific paper retrieval, we release a benchmark dataset for educational exam item QBE. FaBle augmentation on 1K documents remarkably assists training in obtaining facet conditional embeddings.
pdf
bib
abs
PIPER: Benchmarking and Prompting Event Reasoning Boundary of LLMs via Debiasing-Distillation Enhanced Tuning
Zhicong Lu
|
Changyuan Tian
|
PeiguangLi PeiguangLi
|
Li Jin
|
Sirui Wang
|
Wei Jia
|
Ying Shen
|
Guangluan Xu
While Large Language Models (LLMs) excel in diverse domains, their validity in event reasoning remains underexplored. Most existing works merely stagnate at assessing LLMs’ event reasoning with a single event relational type or reasoning format, failing to conduct a complete evaluation and provide a practical solution for capability enhancement. In this paper, we propose PIPER, the first comprehensive benchmark for Probing Into the Performance boundary of LLMs in Event Reasoning. Motivated by our evaluation observations and error patterns analysis, we meticulously craft 10K diverse instruction-tuning demonstrations to alleviate event reasoning-oriented data scarcity. Additionally, a novel Debiasing and Distillation-Enhanced Supervised Fine-Tuning (D2E-SFT) strategy is presented, which facilitates adhering to context and fixating significant contextual event information to elevate the event reasoning capability. Specifically, D2E-SFT removes the given sample’s context to construct an imagined sample, subtracting its logits to mitigate the bias of neglecting context and improve contextual faithfulness. To guide the model in emphasizing significant contextual event information, D2E-SFT employs a context-refined sample to achieve self-distillation with the alignment of logits. Extensive experimental results demonstrate the effectiveness of our data and strategy in expanding the performance boundary of event reasoning.
pdf
bib
abs
MIR: Methodology Inspiration Retrieval for Scientific Research Problems
Aniketh Garikaparthi
|
Manasi Patwardhan
|
Aditya Sanjiv Kanade
|
Aman Hassan
|
Lovekesh Vig
|
Arman Cohan
There has been a surge of interest in harnessing the reasoning capabilities of Large Language Models (LLMs) to accelerate scientific discovery. While existing approaches rely on grounding the discovery process within the relevant literature, effectiveness varies significantly with the quality and nature of the retrieved literature. We address the challenge of retrieving prior work whose concepts can inspire solutions for a given research problem, a task we define as Methodology Inspiration Retrieval (MIR). We construct a novel dataset tailored for training and evaluating retrievers on MIR, and establish baselines. To address MIR, we build the Methodology Adjacency Graph (MAG); capturing methodological lineage through citation relationships. We leverage MAG to embed an “intuitive prior’’ into dense retrievers for identifying patterns of methodological inspiration beyond superficial semantic similarity. This achieves significant gains of +5.4 in Recall@3 and +7.8 in Mean Average Precision (mAP) over strong baselines. Further, we adapt LLM-based re-ranking strategies to MIR, yielding additional improvements of +4.5 in Recall@3 and +4.8 in mAP. Through extensive ablation studies and qualitative analyses, we exhibit the promise of MIR in enhancing automated scientific discovery and outline avenues for advancing inspiration-driven retrieval.
pdf
bib
abs
Sticking to the Mean: Detecting Sticky Tokens in Text Embedding Models
Kexin Chen
|
Dongxia Wang
|
Yi Liu
|
Haonan Zhang
|
Wenhai Wang
Despite the widespread use of Transformer-based text embedding models in NLP tasks, surprising “sticky tokens” can undermine the reliability of embeddings. These tokens, when repeatedly inserted into sentences, pull sentence similarity toward a certain value, disrupting the normal distribution of embedding distances and degrading downstream performance. In this paper, we systematically investigate such anomalous tokens, formally defining them and introducing an efficient detection method, Sticky Token Detector (STD), based on sentence and token filtering. Applying STD to 40 checkpoints across 14 model families, we discover a total of 868 sticky tokens. Our analysis reveals that these tokens often originate from special or unused entries in the vocabulary, as well as fragmented subwords from multilingual corpora. Notably, their presence does not strictly correlate with model size or vocabulary size. We further evaluate how sticky tokens affect downstream tasks like clustering and retrieval, observing significant performance drops of up to 50%. Through attention-layer analysis, we show that sticky tokens disproportionately dominate the model’s internal representations, raising concerns about tokenization robustness. Our findings show the need for better tokenization strategies and model design to mitigate the impact of sticky tokens in future text embedding applications.
pdf
bib
abs
Memorizing is Not Enough: Deep Knowledge Injection Through Reasoning
Ruoxi Xu
|
Yunjie Ji
|
Boxi Cao
|
Yaojie Lu
|
Hongyu Lin
|
Xianpei Han
|
Ben He
|
Yingfei Sun
|
Xiangang Li
|
Le Sun
Although large language models (LLMs) excel in knowledge recall and reasoning, their static nature leads to outdated information as the real world evolves or when adapting to domain-specific knowledge, highlighting the need for effective knowledge injection. However, current research on knowledge injection remains superficial, mainly focusing on knowledge memorization and retrieval. This paper proposes a four-tier knowledge injection framework that systematically defines the levels of knowledge injection: memorization, retrieval, reasoning, and association. Based on this framework, we introduce DeepKnowledge, a synthetic experimental testbed designed for fine-grained evaluation of the depth of knowledge injection across three knowledge types (novel, incremental, and updated). We then explore various knowledge injection scenarios and evaluate the depth of knowledge injection for each scenario on the benchmark. Experimental results reveal key factors to reach each level of knowledge injection for LLMs and establish a mapping between the levels of knowledge injection and the corresponding suitable injection methods, aiming to provide a comprehensive approach for efficient knowledge injection across various levels. The code is available at [https://github.com/icip-cas/Knowledge-Learning-Toolkits](https://github.com/icip-cas/Knowledge-Learning-Toolkits).
pdf
bib
abs
Improving Dialogue State Tracking through Combinatorial Search for In-Context Examples
Haesung Pyun
|
Yoonah Park
|
Yohan Jo
In dialogue state tracking (DST), in-context learning comprises a retriever that selects labeled dialogues as in-context examples and a DST model that uses these examples to infer the dialogue state of the query dialogue. Existing methods for constructing training data for retrievers suffer from three key limitations: (1) the synergistic effect of examples is not considered, (2) the linguistic characteristics of the query are not sufficiently factored in, and (3) scoring is not directly optimized for DST performance. Consequently, the retriever can fail to retrieve examples that would substantially improve DST performance. To address these issues, we present CombiSearch—a method that scores effective in-context examples based on their combinatorial impact on DST performance. Our evaluation on MultiWOZ shows that retrievers trained with CombiSearch surpass state-of-the-art models, achieving a 20× gain in data efficiency and generalizing well to the SGD dataset. Moreover, CombiSearch attains a 12% absolute improvement in the upper bound DST performance over traditional approaches when no retrieval errors are assumed. This significantly increases the headroom for practical DST performance while demonstrating that existing methods rely on suboptimal data for retriever training.
pdf
bib
abs
Pretraining Context Compressor for Large Language Models with Embedding-Based Memory
Yuhong Dai
|
Jianxun Lian
|
Yitian Huang
|
Wei Zhang
|
Mingyang Zhou
|
Mingqi Wu
|
Xing Xie
|
Hao Liao
Efficient processing of long contexts in large language models (LLMs) is essential for real-world applications like retrieval-augmented generation and in-context learning, especially in resource-constrained environments such as edge computing. This paper explores the embedding-based context compression to reduce inference costs while preserving the downstream LLM configurations. We propose a decoupled compressor-LLM framework, pretrained on text reconstruction and completion tasks, designed to effectively preserve essential contextual information within condensed embedding representations. Our extensive experiments investigate pretraining, model configurations, compression rates, efficiency across tasks, and adaptability to various LLMs. Results demonstrate that our approach outperforms competitive baselines in three domains and across eight datasets while being adaptable to different downstream LLMs. We find that thorough pretraining and carefully selected compression rates, such as 4x and 16x, enable a lightweight compressor to achieve a good balance between accuracy and speed. These findings underscore the potential of embedding-based compression to enhance LLM efficiency and motivate further research in this area.
pdf
bib
abs
Dialogue Systems for Emotional Support via Value Reinforcement
Juhee Kim
|
Chunghu Mok
|
Jisun Lee
|
Hyang Sook Kim
|
Yohan Jo
Emotional support dialogue systems aim to reduce help-seekers’ distress and help them overcome challenges. While human values—core beliefs that shape an individual’s priorities—are increasingly emphasized in contemporary psychological therapy for their role in fostering internal transformation and long-term emotional well-being, their integration into emotional support systems remains underexplored. To bridge this gap, we present a value-driven method for training emotional support dialogue systems designed to reinforce positive values in seekers. Notably, our model identifies which values to reinforce at each turn and how to do so, by leveraging online support conversations from Reddit. We evaluate the method across support skills, seekers’ emotional intensity, and value reinforcement. Our method consistently outperforms various baselines, effectively exploring and eliciting values from seekers. Additionally, leveraging crowd knowledge from Reddit significantly enhances its effectiveness. Therapists highlighted its ability to validate seekers’ challenges and emphasize positive aspects of their situations—both crucial elements of value reinforcement. Our work, being the first to integrate value reinforcement into emotional support systems, demonstrates its promise and establishes a foundation for future research.
pdf
bib
abs
Length-Induced Embedding Collapse in PLM-based Models
Yuqi Zhou
|
Sunhao Dai
|
Zhanshuo Cao
|
Xiao Zhang
|
Jun Xu
Text embeddings from PLM-based models enable a wide range of applications, yet their performance often degrades on longer texts. In this paper, we introduce a phenomenon we call
Length Collapse, where embeddings of longer texts tend to cluster together. This clustering results in a distributional inconsistency between the embeddings of short and long texts. We further investigate how these differences contribute to the performance decline observed with longer texts across various downstream tasks. Through a rigorous theoretical analysis of the self-attention mechanism, which acts as a low-pass filter in PLM-based models, we demonstrate that as text length increases, the strength of low-pass filtering intensifies, causing embeddings to retain more low-frequency components. As a result, input token features become more similar, leading to clustering and ultimately the collapse of embeddings for longer texts. To address this issue, we propose a simple method, TempScale, which mitigates the Length Collapse phenomenon. By narrowing the gap in low-pass filtering rates between long and short texts, TempScale ensures more consistent embeddings across different text lengths. This approach leads to performance improvements of
0.94% on MTEB and
1.10% on LongEmbed, which focuses specifically on long-context retrieval, providing strong evidence for the validity of our analysis. The source code is available at blue
https://github.com/Yuqi-Zhou/Length_Collapse.
pdf
bib
abs
SHuBERT: Self-Supervised Sign Language Representation Learning via Multi-Stream Cluster Prediction
Shester Gueuwou
|
Xiaodan Du
|
Greg Shakhnarovich
|
Karen Livescu
|
Alexander H. Liu
Sign language processing has traditionally relied on task-specific models, limiting the potential for transfer learning across tasks. Pre-training methods for sign language have typically focused on either supervised pre-training, which cannot take advantage of unlabeled data, or context-independent (frame or video segment) representations, which ignore the effects of relationships across time in sign language. We introduce SHuBERT (Sign Hidden-Unit BERT), a self-supervised contextual representation model learned from approximately 1,000 hours of American Sign Language video. SHuBERT adapts masked token prediction objectives to multi-stream visual sign language input, learning to predict multiple targets corresponding to clustered hand, face, and body pose streams. SHuBERT achieves state-of-the-art performance across multiple tasks including sign language translation, isolated sign language recognition, and fingerspelling detection.
pdf
bib
abs
ERU-KG: Efficient Reference-aligned Unsupervised Keyphrase Generation
Lam Thanh Do
|
Aaditya Bodke
|
Pritom Saha Akash
|
Kevin Chen-Chuan Chang
Unsupervised keyphrase prediction has gained growing interest in recent years. However, existing methods typically rely on heuristically defined importance scores, which may lead to inaccurate informativeness estimation. In addition, they lack consideration for time efficiency. To solve these problems, we propose ERU-KG, an unsupervised keyphrase generation (UKG) model that consists of an informativeness and a phraseness module. The former estimates the relevance of keyphrase candidates, while the latter generate those candidates. The informativeness module innovates by learning to model informativeness through references (e.g., queries, citation contexts, and titles) and at the term-level, thereby 1) capturing how the key concepts of documents are perceived in different contexts and 2) estimating informativeness of phrases more efficiently by aggregating term informativeness, removing the need for explicit modeling of the candidates. ERU-KG demonstrates its effectiveness on keyphrase generation benchmarks by outperforming unsupervised baselines and achieving on average 89% of the performance of a supervised model for top 10 predictions. Additionally, to highlight its practical utility, we evaluate the model on text retrieval tasks and show that keyphrases generated by ERU-KG are effective when employed as query and document expansions. Furthermore, inference speed tests reveal that ERU-KG is the fastest among baselines of similar model sizes. Finally, our proposed model can switch between keyphrase generation and extraction by adjusting hyperparameters, catering to diverse application requirements.
pdf
bib
abs
Know Your Mistakes: Towards Preventing Overreliance on Task-Oriented Conversational AI Through Accountability Modeling
Suvodip Dey
|
Yi-Jyun Sun
|
Gokhan Tur
|
Dilek Hakkani-Tür
Recent LLMs have enabled significant advancements for conversational agents. However, they are also well known to hallucinate, producing responses that seem plausible but are factually incorrect. On the other hand, users tend to over-rely on LLM-based AI agents, accepting AI’s suggestion even when it is wrong. Adding positive friction, such as explanations or getting user confirmations, has been proposed as a mitigation in AI-supported decision-making systems. In this paper, we propose an accountability model for LLM-based task-oriented dialogue agents to address user overreliance via friction turns in cases of model uncertainty and errors associated with dialogue state tracking (DST). The accountability model is an augmented LLM with an additional accountability head that functions as a binary classifier to predict the relevant slots of the dialogue state mentioned in the conversation. We perform our experiments with multiple backbone LLMs on two established benchmarks (MultiWOZ and Snips). Our empirical findings demonstrate that the proposed approach not only enables reliable estimation of AI agent errors but also guides the decoder in generating more accurate actions. We observe around 3% absolute improvement in joint goal accuracy (JGA) of DST output by incorporating accountability heads into modern LLMs. Self-correcting the detected errors further increases the JGA from 67.13 to 70.51, achieving state-of-the-art DST performance. Finally, we show that error correction through user confirmations (friction turn) achieves a similar performance gain, highlighting its potential to reduce user overreliance.
pdf
bib
abs
LLMs Trust Humans More, That’s a Problem! Unveiling and Mitigating the Authority Bias in Retrieval-Augmented Generation
Yuxuan Li
|
Xinwei Guo
|
Jiashi Gao
|
Guanhua Chen
|
Xiangyu Zhao
|
Jiaxin Zhang
|
Quanying Liu
|
Haiyan Wu
|
Xin Yao
|
Xuetao Wei
Retrieval-Augmented Generation (RAG) has been proven to be an effective approach to address the hallucination problem in large language models (LLMs). In current RAG systems, LLMs typically need to synthesize knowledge provided by two main external sources (user prompts and an external database) to generate a final answer. When the knowledge provided by the user conflicts with that retrieved from the database, a critical question arises: Does the LLM favor one knowledge source over the other when generating the answer? In this paper, we are the first to unveil a new phenomenon, Authority Bias, where the LLMs tend to favor the knowledge provided by the user even when it deviates from the facts; this new phenomenon is rigorously evidenced via our novel and comprehensive characterization of Authority Bias in six widely used LLMs and across diverse task scenarios. We propose a novel dataset specifically designed for detecting Authority Bias, called the Authority Bias Detection Dataset (ABDD), and introduce new, detailed metrics to measure Authority Bias. To mitigate Authority bias, we finally propose the Conflict Detection Enhanced Query (CDEQ) framework. We identify the sentences and atomic information that generate conflicts, perform a credibility assessment on the conflicting paragraphs, and ultimately enhance the query to detect perturbed text, thereby reducing Authority bias. Comparative experiments with widely used mitigation methods demonstrate that CDEQ exhibits both effectiveness and advancement, significantly enhancing the robustness of RAG systems.
pdf
bib
abs
Divide-Then-Aggregate: An Efficient Tool Learning Method via Parallel Tool Invocation
Dongsheng Zhu
|
Weixian Shi
|
Zhengliang Shi
|
Zhaochun Ren
|
Shuaiqiang Wang
|
Lingyong Yan
|
Dawei Yin
While Large Language Models (LLMs) demonstrate remarkable capabilities, their ability to autonomously execute complex real-world tasks remains limited. Accordingly, tool learning has emerged to enable LLMs to effectively leverage external tools to extend their capabilities. Current tool-learning paradigms like CoT/ReAct employ sequential tool invocation but suffer from constrained perception and inadequate task planning. Alternative approaches using search-based decision trees incur substantial computational overhead. To address these limitations, we propose DTA-Llama (Divide-Then-Aggregate Llama), a novel parallel tool invocation framework featuring: (1) A Directed Acyclic Graph (DAG) structure that transformed from traditional tree-based tool search paths, enabling parallel execution and contributing high-quality training data; (2) A process-thread-inspired inference mechanism that iteratively decomposes tasks into parallel tool-using subtasks while aggregating results for subsequent decisions. Experimental results show that our approach substantially enhances task performance while reducing token consumption and inference time. Llama2-7B, using our method, is comparable to the official parallel function calling method of GPT-3.5. The relevant code, dataset, and model weights are available at https://corn0205.github.io/.
pdf
bib
abs
Reviving Cultural Heritage: A Novel Approach for Comprehensive Historical Document Restoration
Yuyi Zhang
|
Peirong Zhang
|
Zhenhua Yang
|
Pengyu Yan
|
Yongxin Shi
|
Pengwei Liu
|
Fengjun Guo
|
Lianwen Jin
Historical documents represent an invaluable cultural heritage, yet have undergone significant degradation over time through tears, water erosion, and oxidation. Existing Historical Document Restoration (HDR) methods primarily focus on single modality or limited-size restoration, failing to meet practical needs. To fill this gap, we present a full-page HDR dataset (FPHDR) and a novel automated HDR solution (AutoHDR). Specifically, FPHDR comprises 1,633 real and 6,543 synthetic images with character-level and line-level locations, as well as character annotations in different damage grades. AutoHDR mimics historians’ restoration workflows through a three-stage approach: OCR-assisted damage localization, vision-language context text prediction, and patch autoregressive appearance restoration. The modular architecture of AutoHDR enables seamless human-machine collaboration, allowing for flexible intervention and optimization at each restoration stage. Experiments demonstrate AutoHDR’s remarkable performance in HDR. When processing severely damaged documents, our system improves OCR accuracy from 46.83% to 84.05%, with further enhancement to 94.25% through human-machine collaboration. We believe this work represents a significant advancement in automated historical document restoration and contributes substantially to cultural heritage preservation. The model and dataset are available at https://github.com/SCUT-DLVCLab/AutoHDR.
pdf
bib
abs
PopAlign: Diversifying Contrasting Patterns for a More Comprehensive Alignment
Zekun Moore Wang
|
Shenzhi Wang
|
King Zhu
|
Jiaheng Liu
|
Ke Xu
|
Jie Fu
|
Wangchunshu Zhou
|
Wenhao Huang
Alignment of large language models (LLMs) involves training models on preference-contrastive output pairs to adjust their responses according to human preferences. To obtain such contrastive pairs, traditional methods like RLHF and RLAIF rely on limited contrasting patterns, such as varying model variants or decoding temperatures. This singularity leads to two issues: (1) alignment is not comprehensive; and thereby (2) models are susceptible to harmful response tendencies. To address these issues, we investigate how to construct more comprehensive and diversified contrasting patterns to enhance preference data (RQ1) and verify the impact of the diversification of contrasting patterns on model alignment (RQ2). For RQ1, we propose PopAlign, a framework that integrates diversified contrasting patterns across the prompt, model, and pipeline levels, introducing six contrasting strategies that do not require additional feedback labeling procedures. Regarding RQ2, we conduct thorough experiments demonstrating that PopAlign significantly outperforms existing methods, leading to more comprehensive alignment.
pdf
bib
abs
Robust Utility-Preserving Text Anonymization Based on Large Language Models
Tianyu Yang
|
Xiaodan Zhu
|
Iryna Gurevych
Anonymizing text that contains sensitive information is crucial for a wide range of applications. Existing techniques face the emerging challenges of the re-identification ability of large language models (LLMs), which have shown advanced capability in memorizing detailed information and reasoning over dispersed pieces of patterns to draw conclusions. When defending against LLM-based re-identification, anonymization could jeopardize the utility of the resulting anonymized data in downstream tasks. In general, the interaction between anonymization and data utility requires a deeper understanding within the context of LLMs. In this paper, we propose a framework composed of three key LLM-based components: a privacy evaluator, a utility evaluator and an optimization component, which work collaboratively to perform anonymization. Extensive experiments demonstrate that the proposed model outperforms existing baselines, showing robustness in reducing the risk of re-identification while preserving greater data utility in downstream tasks. We provide detailed studies on these core modules. To consider large-scale and real-time applications, we investigate the distillation of the anonymization capabilities into lightweight models. All of our code and datasets will be made publicly available at [Github URL].
pdf
bib
abs
SEAL: Scaling to Emphasize Attention for Long-Context Retrieval
Changhun Lee
|
Minsang Seok
|
Jun-gyu Jin
|
YoungHyun Cho
|
Eunhyeok Park
While many advanced LLMs are designed to handle long sequence data, we can still observe notable quality degradation even within the sequence limit. In this work, we introduce a novel approach called Scaling to Emphasize Attention for Long-context retrieval (SEAL), which enhances the retrieval performance of large language models (LLMs) over long contexts. We observe that specific attention heads are closely tied to long-context retrieval, showing positive or negative correlation with retrieval scores, and adjusting the strength of these heads boosts the quality of LLMs in long context by a large margin. Built on this insight, we propose a learning-based mechanism that leverages generated data to emphasize these heads. By applying SEAL, we achieve significant improvements in long-context retrieval performance across various tasks and models. Additionally, when combined with existing training-free context extension techniques, SEAL extends the contextual limits of LLMs while maintaining highly reliable outputs.
pdf
bib
abs
From Neurons to Semantics: Evaluating Cross-Linguistic Alignment Capabilities of Large Language Models via Neurons Alignment
Chongxuan Huang
|
Yongshi Ye
|
Biao Fu
|
Qifeng Su
|
Xiaodong Shi
Large language models (LLMs) have demonstrated remarkable multilingual capabilities, however, how to evaluate cross-lingual alignment remains underexplored. Existing alignment benchmarks primarily focus on sentence embeddings, but prior research has shown that neural models tend to induce a non-smooth representation space, which impact of semantic alignment evaluation on low-resource languages. Inspired by neuroscientific findings that similar information activates overlapping neuronal regions, we propose a novel *Neuron State-Based Cross-Lingual Alignment* (NeuronXA) to assess the cross-lingual a lignment capabilities of LLMs, which offers a more semantically grounded approach to assess cross-lingual alignment. We evaluate NeuronXA on several prominent multilingual LLMs (LLaMA, Qwen, Mistral, GLM, and OLMo) across two transfer tasks and three multilingual benchmarks. The results demonstrate that with only 100 parallel sentence pairs, NeuronXA achieves a Pearson correlation of 0.9556 with downstream tasks performance and 0.8524 with transferability. These findings demonstrate NeuronXA’s effectiveness in assessing both cross-lingual alignment and transferability, even with a small dataset. This highlights its potential to advance cross-lingual alignment research and to improve the semantic understanding of multilingual LLMs.
pdf
bib
abs
𝒜3: Automatic Alignment Framework for Attributed Text Generation
Yue Wang
|
Haoke Zhang
|
Juntao Li
|
Jinxiong Chang
|
Min Zhang
Attributed text generation aims to enhance the reliability of content generated from large language models by providing citations for each claim, which thereby enables users to easily verify the correctness of the responses.However, the scarcity of high-quality training samples presents a significant challenge in aligning large language models to generate texts with citations, revealing considerable room for improvement in existing attribution systems.Besides, existing approaches of aligning large language models to follow user instructions can lead to an undue emphasis on irrelevant documents, which in turn reduces the quality of responses.To address the above problems, we propose Automatic Alignment Framework for Attributed Text Generation ( 𝒜3), a novel framework designed to automatically generate high-quality attributed query-response pairs for both supervised fine-tuning and preference optimization stages without human annotation.With the help of 𝒜3, Mistral-7B can achieve a citation recall of 84.4 and a precision of 87.0 precision on ASQA, which notably surpasses GPT-4’s citation recall of 73.0 and precision of 76.5.
pdf
bib
abs
Towards Better Value Principles for Large Language Model Alignment: A Systematic Evaluation and Enhancement
Bingbing Xu
|
Jing Yao
|
Xiaoyuan Yi
|
Aishan Maoliniyazi
|
Xing Xie
|
Xiaofeng Meng
As Large Language Models (LLMs) advance, aligning them with human values is critical for their responsible development. Value principles serve as the foundation for clarifying alignment goals.Multiple sets of value principles have been proposed, such as HHH (helpful, honest, harmless) and instructions for data synthesis in reinforcement learning from AI feedback (RLAIF). However, most of them are heuristically crafted, without consideration of three primary challenges in practical LLM alignment: 1) Comprehensiveness to deal with diverse and even unforeseen scenarios in which LLMs could be applied; 2) Precision to provide LLMs with clear and actionable guidance in specific scenarios; and 3) Compatability to avoid internal contracts between principles.In this paper, we formalize quantitative metrics to evaluate value principles along the three desirable properties. Building on these metrics, we propose the Hierarchical Value Principle framework (HiVaP), which constructs a hierarchical principle set and retrieves principles tailored to each scenario in a cascading way, addressing above challenges.Experimental results validate that the three metrics capture the effectiveness of value principles for LLM alignment, and our HiVaP framework that enhances these metrics leads to superior alignment. Warning: This paper contains several toxic and offensive statements.
pdf
bib
abs
Language Models, Graph Searching, and Supervision Adulteration: When More Supervision is Less and How to Make More More
Arvid Frydenlund
This work concerns the path-star task, a minimal example of searching over a graph. The graph, G, is star-shaped with D arms radiating from a start node, s. A language model (LM) is given G, s, and a target node, t, which ends one of the arms and is tasked with generating the arm containing t. The minimal nature of this task means only a single choice needs to be made: which of the arms contains?Decoder-only LMs fail to solve this elementary task above 1/D chance due to a learned shortcut that absorbs training supervision. We show how this pathology is caused by excess supervision and present a series of solutions demonstrating that the task is solvable via decoder-only LMs. We find that the task’s minimal nature causes its difficulty, as it prevents task decomposition. Our solutions provide insight into the pathology and its implications for LMs trained via next-token prediction.
pdf
bib
abs
Diversity Explains Inference Scaling Laws: Through a Case Study of Minimum Bayes Risk Decoding
Hidetaka Kamigaito
|
Hiroyuki Deguchi
|
Yusuke Sakai
|
Katsuhiko Hayashi
|
Taro Watanabe
Inference methods play an important role in eliciting the performance of large language models (LLMs). Currently, LLMs use inference methods utilizing generated multiple samples, which can be derived from Minimum Bayes Risk (MBR) Decoding. Previous studies have conducted empirical analyses to clarify the improvements in generation performance achieved by MBR decoding and have reported various observations. However, the theoretical underpinnings of these findings remain uncertain. To address this, we offer a new theoretical interpretation of MBR decoding from the perspective of bias–diversity decomposition. In this interpretation, the error in the quality estimation of hypotheses by MBR decoding is decomposed into two main factors: bias, which considers the closeness between the utility function and human evaluation, and diversity, which represents the variability in the quality estimation of the utility function. The theoretical analysis reveals the difficulty of simultaneously improving bias and diversity, confirming the validity of enhancing MBR decoding performance by increasing diversity. Furthermore, we reveal that diversity can explain one aspect of inference scaling laws that describe performance improvement by increasing sample size. Moreover, experiments across multiple NLP tasks yielded results consistent with these theoretical characteristics. Our code is available at https://github.com/naist-nlp/mbr-bias-diversity.
pdf
bib
abs
Performance Gap in Entity Knowledge Extraction Across Modalities in Vision Language Models
Ido Cohen
|
Daniela Gottesman
|
Mor Geva
|
Raja Giryes
Vision-language models (VLMs) excel at extracting and reasoning about information from images. Yet, their capacity to leverage internal knowledge about specific entities remains underexplored. This work investigates the disparity in model performance when answering factual questions about an entity described in text versus depicted in an image. Our results reveal a significant accuracy drop — reaching 18% for some models — when the entity is presented visually instead of textually. To study this gap we present PopVQA, a dataset which allows separating entity recognition and question answering, and use it to benchmark several models. We hypothesize that this decline arises from limitations in how information flows from image tokens to query tokens. Thus, we use mechanistic interpretability tools to reveal that, although image tokens are preprocessed by the vision encoder, meaningful information flow from these tokens occurs only in the much deeper layers. Furthermore, critical image processing happens in the language model’s middle layers, allowing few layers for consecutive reasoning, highlighting a potential inefficiency in how the model utilizes its layers for reasoning. These insights shed light on the internal mechanics of VLMs and offer pathways for enhancing their reasoning capabilities. PopVQA can be found at https://huggingface.co/datasets/idoco/PopVQA.
pdf
bib
abs
SDD: Self-Degraded Defense against Malicious Fine-tuning
ZiXuan Chen
|
Weikai Lu
|
Xin Lin
|
Ziqian Zeng
Open-source Large Language Models (LLMs) often employ safety alignment methods to resist harmful instructions. However, recent research shows that maliciously fine-tuning these LLMs on harmful data can easily bypass these safeguards. To counter this, we theoretically uncover why malicious fine-tuning succeeds and identify potential defense strategies. Building on the theoretical analysis, we introduce the Self-Degraded Defense (SDD) framework. SDD encourages LLMs to produce high-quality but irrelevant responses to harmful prompts. When attackers attempt malicious fine-tuning, the general capability of the LLM aligned by SDD will significantly decrease, rendering it incapable of following harmful instructions. Our experimental results confirm SDD’s effectiveness against such attacks.Our code is available at
https://github.com/ZeroNLP/SDD.
pdf
bib
abs
CoachMe: Decoding Sport Elements with a Reference-Based Coaching Instruction Generation Model
Wei-Hsin Yeh
|
Yu-An Su
|
Chih-Ning Chen
|
Yi-Hsueh Lin
|
Calvin Ku
|
Wenhsin Chiu
|
Min-Chun Hu
|
Lun-Wei Ku
Motion instruction is a crucial task that helps athletes refine their technique by analyzing movements and providing corrective guidance. Although recent advances in multimodal models have improved motion understanding,generating precise and sport-specific instruction remains challenging due to the highly domain-specific nature of sports and the need for informative guidance. We propose CoachMe, a reference-based model that analyzes the differences between a learner’s motion and a reference under temporal and physical aspects. This approach enables both domain-knowledge learning and the acquisition of a coach-like thinking process that identifies movement errors effectively and provides feedback to explain how to improve. In this paper, weillustrate how CoachMe adapts well to specific sports such as skating and boxing by learning from general movements and then leveraging limited data. Experiments show that CoachMe provides high-quality instructions instead of directions merely in the tone of a coach but without critical information. CoachMe outperforms GPT-4o by 31.6% in G-Eval on figure skating and by 58.3% on boxing. Analysisfurther confirms that it elaborates on errors and their corresponding improvement methods in the generated instructions. You can find CoachMe here:
https://motionxperts.github.io/pdf
bib
abs
DRPruning: Efficient Large Language Model Pruning through Distributionally Robust Optimization
Hexuan Deng
|
Wenxiang Jiao
|
Xuebo Liu
|
Jing Li
|
Min Zhang
|
Zhaopeng Tu
Large language models (LLMs) deliver impressive results but face challenges from increasing model sizes and computational costs. Structured pruning reduces model size and speeds up inference but often causes uneven degradation across domains, leading to biased performance. To address this, we propose *DRPruning*, a method that dynamically adjusts the data distribution during training to restore balanced performance across heterogeneous and multi-tasking data. Experiments in monolingual and multilingual settings show that DRPruning surpasses similarly sized models in both pruning and continued pretraining over perplexity, downstream tasks, and instruction tuning. Further analysis demonstrates the robustness of DRPruning towards various domains and distribution shifts. Furthermore, DRPruning can determine optimal reference losses and data ratios automatically, suggesting potential for broader applications. Code and scripts are available at https://github.com/hexuandeng/DRPruning.
pdf
bib
abs
How LLMs Comprehend Temporal Meaning in Narratives: A Case Study in Cognitive Evaluation of LLMs
Karin De Langis
|
Jong Inn Park
|
Andreas Schramm
|
Bin Hu
|
Khanh Chi Le
|
Dongyeop Kang
Large language models (LLMs) exihibit increasingly sophisticated linguistic capabilities, yet the extent to which these behaviors reflect human-like cognition versus advanced pattern recognition remains an open question.In this study, we investigate how LLMs process the temporal meaning of linguistic aspect in narratives that were previously used in human studies. Using an Expert-in-the-Loop probing pipeline, we conduct a series of targeted experiments to assess whether LLMs construct semantic representations and pragmatic inferences in a human-like manner.Our findings show that LLMs over-rely on prototypicality, produce inconsistent aspectual judgments, and struggle with causal reasoning derived from aspect, raising concerns about their ability to fully comprehend narratives.These results suggest that LLMs process aspect fundamentally differently from humans and lack robust narrative understanding.Beyond these empirical findings, we develop a standardized experimental framework for the reliable assessment of LLMs’ cognitive and linguistic capabilities.
pdf
bib
abs
Data Caricatures: On the Representation of African American Language in Pretraining Corpora
Nicholas Deas
|
Blake Vente
|
Amith Ananthram
|
Jessica A Grieser
|
Desmond U. Patton
|
Shana Kleiner
|
James R. Shepard Iii
|
Kathleen McKeown
With a combination of quantitative experiments, human judgments, and qualitative analyses, we evaluate the quantity and quality of African American Language (AAL) representation in 12 predominantly English, open-source pretraining corpora. We specifically focus on the sources, variation, and naturalness of included AAL texts representing the AAL speaking community. We find that AAL is underrepresented in all evaluated pretraining corpora compared to US demographics, constituting as few as 0.007% and at most 0.18% of documents. We also find that more than 25% of AAL texts in C4 may be perceived as inappropriate for LLMs to generate and to reinforce harmful stereotypes. Finally, we find that most automated filters are more likely to conserve White Mainstream English (WME) texts over AAL in pretraining corpora.
pdf
bib
abs
Language Model Probabilities are Not Calibrated in Numeric Contexts
Charles Lovering
|
Michael Krumdick
|
Viet Dac Lai
|
Varshini Reddy
|
Seth Ebner
|
Nilesh Kumar
|
Rik Koncel-Kedziorski
|
Chris Tanner
Some statements have one well-defined continuation (e.g., “the Eiffel Tower is in [Paris]"), whereas others have a natural distribution over multiple options (e.g., “the weighted coin flip was [Heads/Tails].") We argue that language model (LM) outputs should capture these natural distributions. Our work specifically tests whether LM output probabilities are calibrated to numeric information within their textual contexts. For example, if the context (the prompt) concerns two equally likely options (e.g., heads or tails for a fair coin), the LM output probabilities should also be equal. Likewise, in a context with nonuniformly likely events (e.g., rolling a pair with two dice) an LM should output proportionate probabilities. However, we find that even in simple settings, the best LMs (1) are poorly calibrated and (2) have systematic biases: artifacts like word identity, word order, and word frequency all impact calibration. For example, ‘gpt-4o-mini‘ often picks the first of two options presented in the prompt regardless of the options’ implied likelihoods, whereas ‘Llama-3.1-8B‘ picks the second. Models do not allocate probability mass among valid options in a calibrated manner.
pdf
bib
abs
MDCure: A Scalable Pipeline for Multi-Document Instruction-Following
Gabrielle Kaili-May Liu
|
Bowen Shi
|
Avi Caciularu
|
Idan Szpektor
|
Arman Cohan
Multi-document (MD) processing is crucial for LLMs to handle real-world tasks such as summarization and question-answering across large sets of documents. While LLMs have improved at processing long inputs, MD contexts still present unique difficulties, including management of inter-document dependencies, redundancy, and incoherent structures. To address this challenge, we introduce MDCure, a scalable and effective instruction data generation framework to enhance the MD capabilities of LLMs without the computational cost of pre-training or reliance on human-annotated data. MDCure generates high-quality synthetic MD instruction data over sets of articles via targeted prompts. We also introduce MDCureRM, a cost-effective, MD-specific reward model to score and filter generated data based on their training utility for MD settings. MDCure is compatible with open- and closed-source models in addition to policy optimization methods such as PPO, enabling even small open- source models to surpass proprietary LLMs as strong generators of high-quality MD instruction data without further data filtering. With MDCure, we fine-tune a wide variety of LLMs up to 70B parameters in size from the FlanT5, Qwen2, and LLAMA3.1 model families. Extensive evaluations on a wide range of MD and long-context benchmarks spanning various tasks and domains show MDCure consistently improves performance over pre-trained baselines and base models by up to 75.1%.
pdf
bib
abs
Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs
Sumanth Doddapaneni
|
Mohammed Safi Ur Rahman Khan
|
Dilip Venkatesh
|
Raj Dabre
|
Anoop Kunchukuttan
|
Mitesh M Khapra
Evaluating machine-generated text remains a significant challenge in NLP, especially for non-English languages. Current methodologies, including automated metrics, human assessments, and LLM-based evaluations, predominantly focus on English, revealing a significant gap in multilingual evaluation frameworks. We introduce the Cross Lingual Auto Evaluation (CIA) Suite, an extensible framework that includes evaluator LLMs (Hercule) and a novel test set (Recon) specifically designed for multilingual evaluation. Our test set features 500 human-annotated instructions spanning various task capabilities along with human judgment scores across six languages. This would enable benchmarking of general-purpose multilingual LLMs and facilitate meta-evaluation of Evaluator LLMs. The proposed model, Hercule, is a cross-lingual evaluation model that addresses the scarcity of reference answers in the target language by learning to assign scores to responses based on easily available reference answers in English. Our experiments demonstrate that Hercule aligns more closely with human judgments compared to proprietary models, demonstrating the effectiveness of such cross-lingual evaluation in low resource scenarios. Further, it is also effective in zero-shot evaluation on unseen languages. This study is the first comprehensive examination of cross-lingual evaluation using LLMs, presenting a scalable and effective approach for multilingual assessment. All code, datasets, and models will be publicly available to enable further research in this important area.
pdf
bib
abs
DeepReview: Improving LLM-based Paper Review with Human-like Deep Thinking Process
Minjun Zhu
|
Yixuan Weng
|
Linyi Yang
|
Yue Zhang
Large Language Models (LLMs) are increasingly utilized in scientific research assessment, particularly in automated paper review. However, existing LLM-based review systems face significant challenges, including limited domain expertise, hallucinated reasoning, and a lack of structured evaluation. To address these limitations, we introduce DeepReview, a multi-stage framework designed to emulate expert reviewers by incorporating structured analysis, literature retrieval, and evidence-based argumentation. Using DeepReview-13K, a curated dataset with structured annotations, we train DeepReviewer-14B, which outperforms CycleReviewer-70B with fewer tokens. In its best mode, DeepReviewer-14B achieves win rates of 88.21% and 80.20% against GPT-o1 and DeepSeek-R1 in evaluations. Our work sets a new benchmark for LLM-based paper review, with all resources publicly available.
pdf
bib
abs
Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient
Yuan Gao
|
Zujing Liu
|
Weizhong Zhang
|
Bo Du
|
Gui-Song Xia
Recent Large-Language Models (LLMs) pruning methods typically operate at the post-training phase without the expensive weight finetuning, however, their pruning criteria often rely on **heuristically hand-crafted metrics**, potentially leading to suboptimal performance. We instead propose a novel **optimization-based structural pruning** that learns the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model. To preserve the efficiency, our method **eliminates the back-propagation** through the LLM *per se* during the optimization, requiring only **the forward pass of the LLM**. We achieve this by learning an underlying Bernoulli distribution to sample binary pruning masks, where we decouple the Bernoulli parameters from the LLM loss, thus facilitating an efficient optimization via *policy gradient estimator* without back-propagation. As a result, our method is able to 1) *support global and heterogeneous pruning* (*i.e.*, our method automatically determines different redundancy for different layers), and 2) *optionally initialize with a metric-based method* (for our Bernoulli distributions). Extensive experiments conducted on LLaMA, LLaMA-2, LLaMA-3, Vicuna, and Mistral models using the C4 and WikiText2 datasets demonstrate the promising performance of our method in efficiency and effectiveness.
pdf
bib
abs
Tree-of-Debate: Multi-Persona Debate Trees Elicit Critical Thinking for Scientific Comparative Analysis
Priyanka Kargupta
|
Ishika Agarwal
|
Tal August
|
Jiawei Han
With the exponential growth of research facilitated by modern technology and improved accessibility, scientific discoveries have become increasingly fragmented within and across fields. This makes it challenging to assess the significance, novelty, incremental findings, and equivalent ideas between related works, particularly those from different research communities. Large language models (LLMs) have recently demonstrated strong quantitative and qualitative reasoning abilities, and multi-agent LLM debates have shown promise in handling complex reasoning tasks by exploring diverse perspectives and reasoning paths. Inspired by this, we introduce Tree-of-Debate (ToD), a framework which converts scientific papers into LLM personas that debate their respective novelties. To emphasize structured, critical reasoning rather than focusing solely on outcomes, ToD dynamically constructs a debate tree, enabling fine-grained analysis of independent novelty arguments within scholarly articles. Through experiments on scientific literature across various domains, evaluated by expert researchers, we demonstrate that ToD generates informative arguments, effectively contrasts papers, and supports researchers in their literature review.
pdf
bib
abs
Hierarchical Memory Organization for Wikipedia Generation
Eugene J. Yu
|
Dawei Zhu
|
Yifan Song
|
Xiangyu Wong
|
Jiebin Zhang
|
Wenxuan Shi
|
Xiaoguang Li
|
Qun Liu
|
Sujian Li
Generating Wikipedia articles autonomously is a challenging task requiring the integration of accurate, comprehensive, and well-structured information from diverse sources. This paper introduces the Memory Organization-based Generation (MOG) framework, a novel approach to address these challenges by leveraging a hierarchical memory architecture. MOG extracts fine-grained memory units from web documents, recursively organizes them into a Wikipedia-style hierarchical structure, and uses this structure to guide the generation process. This ensures alignment between memory and the article outline, improving both informativeness and verifiability while minimizing hallucinations. Additionally, a citation module is implemented to enhance traceability by linking every generated sentence to specific memory units. Evaluations on our newly created WikiStart dataset demonstrate that MOG outperforms baseline methods in producing informative and reliable articles, making it particularly robust in real-world scenarios.
pdf
bib
abs
Class Distillation with Mahalanobis Contrast: An Efficient Training Paradigm for Pragmatic Language Understanding Tasks
Chenlu Wang
|
Weimin Lyu
|
Ritwik Banerjee
Detecting deviant language such as sexism, or nuanced language such as metaphors or sarcasm, is crucial for enhancing the safety, clarity, and interpretation of social interactions. While existing classifiers deliver strong results on these tasks, they often come with significant computational cost and high data demands. In this work, we propose Class Distillation (ClaD), a novel training paradigm that targets the core challenge: distilling a small, well-defined target class from a highly diverse and heterogeneous background. ClaD integrates two key innovations: (i) a loss function informed by the structural properties of class distributions, based on Mahalanobis distance, and (ii) an interpretable decision algorithm optimized for class separation. Across three benchmark detection tasks – sexism, metaphor, and sarcasm – ClaD outperforms competitive baselines, and even with smaller language models and orders of magnitude fewer parameters, achieves performance comparable to several large language models. These results demonstrate ClaD as an efficient tool for pragmatic language understanding tasks that require gleaning a small target class from a larger heterogeneous background.
pdf
bib
abs
Structure-aware Domain Knowledge Injection for Large Language Models
Kai Liu
|
Ze Chen
|
Zhihang Fu
|
Wei Zhang
|
Rongxin Jiang
|
Fan Zhou
|
Yaowu Chen
|
Yue Wu
|
Jieping Ye
This paper introduces a pioneering methodology, termed StructTuning, to efficiently transform foundation Large Language Models (LLMs) into domain specialists. It significantly reduces the training corpus needs to a mere 5% while achieving an impressive 100% of traditional knowledge injection performance. Motivated by structured human education, we propose a novel two-stage strategy for knowledge injection and alignment: Structure-aware Continual Pre-Training (SCPT) and Structure-aware Supervised Fine-Tuning (SSFT). In the SCPT phase, we automatically extract the domain knowledge taxonomy and reorganize the training corpora, enabling LLMs to effectively link textual segments to targeted knowledge points within the taxonomy. In the SSFT phase, we explicitly prompt models to elucidate the underlying knowledge structure in their outputs, leveraging the structured domain insight to address practical problems. Our ultimate method was extensively evaluated across model architectures and scales on LongBench and MMedBench datasets, demonstrating superior performance against other knowledge injection methods. We also explored our method’s scalability across different training corpus sizes, laying the foundation to enhance domain-specific LLMs with better data utilization.
pdf
bib
abs
FinMME: Benchmark Dataset for Financial Multi-Modal Reasoning Evaluation
Junyu Luo
|
Zhizhuo Kou
|
Liming Yang
|
Xiao Luo
|
Jinsheng Huang
|
Zhiping Xiao
|
Jingshu Peng
|
Chengzhong Liu
|
Jiaming Ji
|
Xuanzhe Liu
|
Sirui Han
|
Ming Zhang
|
Yike Guo
Multimodal Large Language Models (MLLMs) have experienced rapid development in recent years. However, in the financial domain, there is a notable lack of effective and specialized multimodal evaluation datasets. To advance the development of MLLMs in the finance domain, we introduce FinMME, encompassing more than 11,000 high-quality financial research samples across 18 financial domains and 6 asset classes, featuring 10 major chart types and 21 subtypes. We ensure data quality through 20 annotators and carefully designed validation mechanisms. Additionally, we develop FinScore, an evaluation system incorporating hallucination penalties and multi-dimensional capability assessment to provide an unbiased evaluation. Extensive experimental results demonstrate that even state-of-the-art models like GPT-4o exhibit unsatisfactory performance on FinMME, highlighting its challenging nature. The benchmark exhibits high robustness with prediction variations under different prompts remaining below 1%, demonstrating superior reliability compared to existing datasets. Our dataset and evaluation protocol are available at https://huggingface.co/datasets/luojunyu/FinMME and https://github.com/luo-junyu/FinMME.
pdf
bib
abs
Dialectal Coverage And Generalization in Arabic Speech Recognition
Amirbek Djanibekov
|
Hawau Olamide Toyin
|
Raghad Alshalan
|
Abdullah Alatir
|
Hanan Aldarmaki
Developing robust automatic speech recognition (ASR) systems for Arabic requires effective strategies to manage its diversity. Existing ASR systems mainly cover the modern standard Arabic (MSA) variety and few high-resource dialects, but fall short in coverage and generalization across the multitude of spoken variants. Code-switching with English and French is also common in different regions of the Arab world, which challenges the performance of monolingual Arabic models. In this work, we introduce a suite of ASR models optimized to effectively recognize multiple variants of spoken Arabic, including MSA, various dialects, and code-switching. We provide open-source pre-trained models that cover data from 17 Arabic-speaking countries, and fine-tuned MSA and dialectal ASR models that include at least 11 variants, as well as multi-lingual ASR models covering embedded languages in code-switched utterances. We evaluate ASR performance across these spoken varieties and demonstrate both coverage and performance gains compared to prior models.
pdf
bib
abs
EditInspector: A Benchmark for Evaluation of Text-Guided Image Edits
Ron Yosef
|
Yonatan Bitton
|
Dani Lischinski
|
Moran Yanuka
Text-guided image editing, fueled by recent advancements in generative AI, is becoming increasingly widespread. This trend highlights the need for a comprehensive framework to verify text-guided edits and assess their quality. To address this need, we introduce EditInspector, a novel benchmark for evaluation of text-guided image edits, based on human annotations collected using an extensive template for edit verification. We leverage EditInspector to evaluate the performance of state-of-the-art (SoTA) vision and language models in assessing edits across various dimensions, including accuracy, artifact detection, visual quality, seamless integration with the image scene, adherence to common sense, and the ability to describe edit-induced changes. Our findings indicate that current models struggle to evaluate edits comprehensively and frequently hallucinate when describing the changes. To address these challenges, we propose two novel methods that outperform SoTA models in both artifact detection and difference caption generation.
pdf
bib
abs
Reconsidering LLM Uncertainty Estimation Methods in the Wild
Yavuz Faruk Bakman
|
Duygu Nur Yaldiz
|
Sungmin Kang
|
Tuo Zhang
|
Baturalp Buyukates
|
Salman Avestimehr
|
Sai Praneeth Karimireddy
Large Language Model (LLM) Uncertainty Estimation (UE) methods have become a crucial tool for detecting hallucinations in recent years. While numerous UE methods have been proposed, most existing studies evaluate them in isolated short-form QA settings using threshold-independent metrics such as AUROC or PRR. However, real-world deployment of UE methods introduces several challenges. In this work, we systematically examine four key aspects of deploying UE methods in practical settings. Specifically, we assess (1) the sensitivity of UE methods to decision threshold selection, (2) their robustness to query transformations such as typos, adversarial prompts, and prior chat history, (3) their applicability to long-form generation, and (4) strategies for handling multiple UE scores for a single query. Our evaluations on 19 UE methods reveal that most of them are highly sensitive to threshold selection when there is a distribution shift in the calibration dataset. While these methods generally exhibit robustness against previous chat history and typos, they are significantly vulnerable to adversarial prompts. Additionally, while existing UE methods can be adapted for long-form generation through various strategies, there remains considerable room for improvement. Lastly, ensembling multiple UE scores at test time provides a notable performance boost, which highlights its potential as a practical improvement strategy. Code is available at: https://github.com/duygunuryldz/uncertainty_in_the_wild.
pdf
bib
abs
Bregman Conditional Random Fields: Sequence Labeling with Parallelizable Inference Algorithms
Caio Corro
|
Mathieu Lacroix
|
Joseph Le Roux
We propose a novel discriminative model for sequence labeling called Bregman conditional random fields (BCRF).Contrary to standard linear-chain conditional random fields,BCRF allows fast parallelizable inference algorithms based on iterative Bregman projections.We show how such models can be learned using Fenchel-Young losses, including extension for learning from partial labels.Experimentally, our approach delivers comparable results to CRF while being faster, and achieves better results in highly constrained settings compared to mean field, another parallelizable alternative.
pdf
bib
abs
SEE: Strategic Exploration and Exploitation for Cohesive In-Context Prompt Optimization
Wendi Cui
|
Jiaxin Zhang
|
Zhuohang Li
|
Hao Sun
|
Damien Lopez
|
Kamalika Das
|
Bradley A. Malin
|
Sricharan Kumar
Designing optimal prompts for Large Language Models (LLMs) is a complex and resource-intensive task, often requiring substantial human expertise. Existing approaches typically separate the optimization of prompt instructions and in-context learning examples, leading to incohesive, suboptimal results. To overcome this limitation, we propose a novel Cohesive In-Context Prompt Optimization framework that refines both prompt instructions and examples. In our formulation, coherence refers to the degree to which instructions and examples work synergistically to improve task performance—emerging as a byproduct of performance-driven optimization. However, formulating such an optimization in the discrete and high-dimensional space of natural language poses significant challenges in both convergence and computational efficiency. To address these issues, we introduce SEE, a scalable and efficient prompt optimization framework that adopts metaheuristic optimization principles and strategically balances exploration and exploitation to enhance optimization performance and achieve efficient convergence. SEE features a quad-phased design that alternates between global traversal (exploration) and local optimization (exploitation) and adaptively chooses LLM operators during the optimization process. We have conducted a comprehensive evaluation across 35 benchmark tasks, and SEE significantly outperforms state-of-the-art baseline methods by a large margin, achieving an average performance gain of **13.94** while reducing computational costs by **58.67%**.
pdf
bib
abs
Programming by Example meets Historical Linguistics: A Large Language Model Based Approach to Sound Law Induction
Atharva Naik
|
Darsh Agrawal
|
Hong Sng
|
Clayton Marr
|
Kexun Zhang
|
Nathaniel Romney Robinson
|
Kalvin Chang
|
Rebecca Byrnes
|
Aravind Mysore
|
Carolyn Rose
|
David R. Mortensen
Historical linguists have long written “programs” that convert reconstructed words in an ancestor language into their attested descendants via ordered string rewrite functions (called sound laws) However, writing these programs is time-consuming, motivating the development of automated Sound Law Induction (SLI) which we formulate as Programming by Examples (PBE) with Large Language Models (LLMs) in this paper. While LLMs have been effective for code generation, recent work has shown that PBE is challenging but improvable by fine-tuning, especially with training data drawn from the same distribution as evaluation data. In this paper, we create a conceptual framework of what constitutes a “similar distribution” for SLI and propose four kinds of synthetic data generation methods with varying amounts of inductive bias to investigate what leads to the best performance. Based on the results, we create a SOTA open-source model for SLI as PBE (+6% pass rate with a third of the parameters of the second-best LLM) and also highlight exciting future directions for PBE research.
pdf
bib
abs
Synergizing Unsupervised Episode Detection with LLMs for Large-Scale News Events
Priyanka Kargupta
|
Yunyi Zhang
|
Yizhu Jiao
|
Siru Ouyang
|
Jiawei Han
State-of-the-art automatic event detection struggles with interpretability and adaptability to evolving large-scale key events—unlike episodic structures, which excel in these areas. Often overlooked, episodes represent cohesive clusters of core entities performing actions at a specific time and location; a partially ordered sequence of episodes can represent a key event. This paper introduces a novel task, **episode detection**, which identifies episodes within a news corpus of key event articles. Detecting episodes poses unique challenges, as they lack explicit temporal or locational markers and cannot be merged using semantic similarity alone. While large language models (LLMs) can aid with these reasoning difficulties, they suffer with long contexts typical of news corpora. To address these challenges, we introduce **EpiMine**, an unsupervised framework that identifies a key event’s candidate episodes by leveraging natural episodic partitions in articles, estimated through shifts in discriminative term combinations. These candidate episodes are more cohesive and representative of true episodes, synergizing with LLMs to better interpret and refine them into final episodes. We apply EpiMine to our three diverse, real-world event datasets annotated at the episode level, where it achieves a 59.2% average gain across all metrics compared to baselines.
pdf
bib
abs
Beyond True or False: Retrieval-Augmented Hierarchical Analysis of Nuanced Claims
Priyanka Kargupta
|
Runchu Tian
|
Jiawei Han
Claims made by individuals or entities are oftentimes nuanced and cannot be clearly labeled as entirely “true” or “false”—as is frequently the case with scientific and political claims. However, a claim (e.g., “vaccine A is better than vaccine B”) can be dissected into its integral aspects and sub-aspects (e.g., efficacy, safety, distribution), which are individually easier to validate. This enables a more comprehensive, structured response that provides a well-rounded perspective on a given problem while also allowing the reader to prioritize specific angles of interest within the claim (e.g., safety towards children). Thus, we propose ClaimSpect, a retrieval-augmented generation-based framework for automatically constructing a hierarchy of aspects typically considered when addressing a claim and enriching them with corpus-specific perspectives. This structure hierarchically partitions an input corpus to retrieve relevant segments, which assist in discovering new sub-aspects. Moreover, these segments enable the discovery of varying perspectives towards an aspect of the claim (e.g., support, neutral, or oppose) and their respective prevalence (e.g., “how many biomedical papers believe vaccine A is more transportable than B?”). We apply ClaimSpect to a wide variety of real-world scientific and political claims featured in our constructed dataset, showcasing its robustness and accuracy in deconstructing a nuanced claim and representing perspectives within a corpus. Through real-world case studies and human evaluation, we validate its effectiveness over multiple baselines.
pdf
bib
abs
The Task Shield: Enforcing Task Alignment to Defend Against Indirect Prompt Injection in LLM Agents
Feiran Jia
|
Tong Wu
|
Xin Qin
|
Anna Squicciarini
Large Language Model (LLM) agents are increasingly being deployed as conversational assistants capable of performing complex real-world tasks through tool integration. This enhanced ability to interact with external systems and process various data sources, while powerful, introduces significant security vulnerabilities. In particular, indirect prompt injection attacks pose a critical threat, where malicious instructions embedded within external data sources can manipulate agents to deviate from user intentions. While existing defenses show promise, they struggle to maintain robust security while preserving task functionality. We propose a novel and orthogonal perspective that reframes agent security from preventing harmful actions to ensuring task alignment, requiring every agent action to serve user objectives. Based on this insight, we develop Task Shield, a test-time defense mechanism that systematically verifies whether each instruction and tool call contributes to user-specified goals. Through experiments on the AgentDojo benchmark, we demonstrate that Task Shield reduces attack success rates (2.07%) while maintaining high task utility (69.79%) on GPT-4o, significantly outperforming existing defenses in various real-world scenarios.
pdf
bib
abs
Sandcastles in the Storm: Revisiting the (Im)possibility of Strong Watermarking
Fabrice Y Harel-Canada
|
Boran Erol
|
Connor Choi
|
Jason Liu
|
Gary Jiarui Song
|
Nanyun Peng
|
Amit Sahai
Watermarking AI-generated text is critical for combating misuse. Yet recent theoretical work argues that any watermark can be erased via random walk attacks that perturb text while preserving quality. However, such attacks rely on two key assumptions: (1) rapid mixing (watermarks dissolve quickly under perturbations) and (2) reliable quality preservation (automated quality oracles perfectly guide edits). Through large-scale experiments and human-validated assessments, we find mixing is slow: 100% of perturbed texts retain traces of their origin after hundreds of edits, defying rapid mixing. Oracles falter, as state-of-the-art quality detectors misjudge edits (77% accuracy), compounding errors during attacks. Ultimately, attacks underperform: automated walks remove watermarks just 26% of the time – dropping to 10% under human quality review. These findings challenge the inevitability of watermark removal. Instead, practical barriers – slow mixing and imperfect quality control – reveal watermarking to be far more robust than theoretical models suggest. The gap between idealized attacks and real-world feasibility underscores the need for stronger watermarking methods and more realistic attack models.
pdf
bib
abs
Time-MQA: Time Series Multi-Task Question Answering with Context Enhancement
Yaxuan Kong
|
Yiyuan Yang
|
Yoontae Hwang
|
Wenjie Du
|
Stefan Zohren
|
Zhangyang Wang
|
Ming Jin
|
Qingsong Wen
Time series data are foundational in finance, healthcare, and energy domains. However, most existing methods and datasets remain focused on a narrow spectrum of tasks, such as forecasting or anomaly detection. To bridge this gap, we introduce Time Series Multi-Task Question Answering (Time-MQA), a unified framework that enables natural language queries across multiple time series tasks - numerical analytical tasks and open-ended question answering with reasoning. Central to Time-MQA is the TSQA dataset, a large-scale dataset containing ~200k question-answer pairs derived from diverse time series spanning environment, traffic, etc. This comprehensive resource covers various time series lengths and promotes robust model development. We further demonstrate how continually pre-training large language models (Mistral 7B, Llama-3 8B, and Qwen-2.5 7B) on the TSQA dataset enhanced time series reasoning capabilities, moving beyond mere numeric tasks and enabling more advanced and intuitive interactions with temporal data. The complete TSQA dataset, models, user study questionnaires for evaluation, and other related materials have been open-sourced here.
pdf
bib
abs
From Perceptions to Decisions: Wildfire Evacuation Decision Prediction with Behavioral Theory-informed LLMs
Ruxiao Chen
|
Chenguang Wang
|
Yuran Sun
|
Xilei Zhao
|
Susu Xu
Evacuation decision prediction is critical for efficient and effective wildfire response by helping emergency management anticipate traffic congestion and bottlenecks, allocate resources, and minimize negative impacts. Traditional statistical methods for evacuation decision prediction fail to capture the complex and diverse behavioral logic of different individuals. In this work, for the first time, we introduce *FLARE*, short for facilitating LLM for advanced reasoning on wildfire evacuation decision prediction, a Large Language Model (LLM)-based framework that integrates behavioral theories and models to streamline the Chain-of-Thought (CoT) reasoning and subsequently integrate with memory-based Reinforcement Learning (RL) module to provide accurate evacuation decision prediction and understanding. Our proposed method addresses the limitations of using existing LLMs for evacuation behavioral predictions, such as limited survey data, mismatching with behavioral theory, conflicting individual preferences, implicit and complex mental states, and intractable mental state-behavior mapping. Experiments on three post-wildfire survey datasets show an average of 20.47% performance improvement over traditional theory-informed behavioral models, with strong cross-event generalizability. Our complete code is publicly available at https://github.com/SusuXu-s-Lab/FLARE
pdf
bib
abs
GETReason: Enhancing Image Context Extraction through Hierarchical Multi-Agent Reasoning
Shikhhar Siingh
|
Abhinav Rawat
|
Chitta Baral
|
Vivek Gupta
Publicly significant images from events carry valuable contextual information with applications in domains such as journalism and education. However, existing methodologies often struggle to accurately extract this contextual relevance from images. To address this challenge, we introduce GETREASON (Geospatial Event Temporal Reasoning), a framework designed to go beyond surfacelevel image descriptions and infer deeper contextual meaning. We hypothesize that extracting global event, temporal, and geospatial information from an image enables a more accurate understanding of its contextual significance. We also introduce a new metric GREAT (Geospatial, Reasoning and Event Accuracy with Temporal alignment) for a reasoning capturing evaluation. Our layered multi-agentic approach, evaluated using a reasoning-weighted metric, demonstrates that meaningful information can be inferred from images, allowing them to be effectively linked to their corresponding events and broader contextual background.
pdf
bib
abs
Hanging in the Balance: Pivotal Moments in Crisis Counseling Conversations
Vivian Nguyen
|
Lillian Lee
|
Cristian Danescu-Niculescu-Mizil
During a conversation, there can come certain moments where its outcome hangs in the balance. In these pivotal moments, how one responds can put the conversation on substantially different trajectories leading to significantly different outcomes. Systems that can detect when such moments arise could assist conversationalists in domains with highly consequential outcomes, such as mental health crisis counseling.In this work, we introduce an unsupervised computational method for detecting such pivotal moments as they happen. The intuition is that a moment is pivotal if our expectation of the conversation’s outcome varies widely depending on what might be said next. By applying our method to crisis counseling conversations, we first validate it by showing that it aligns with human perception—counselors take significantly longer to respond during moments detected by our method—and with the eventual conversational trajectory—which is more likely to change course at these times. We then use our framework to explore the relation between the counselor’s response during pivotal moments and the eventual outcome of the session.
pdf
bib
abs
Unveiling the Potential of BERT-family: A New Recipe for Building Scalable, General and Competitive Large Language Models
Yisheng Xiao
|
Juntao Li
|
Wenpeng Hu
|
Zhunchen Luo
|
Min Zhang
BERT-family have been increasingly explored for adaptation to scenarios beyond language understanding tasks, with more recent efforts focused on enabling them to become good instruction followers. These explorations have endowed BERT-family with new roles and human expectations, showcasing their potential on par with current state-of-the-art (SOTA) large language models (LLMs). However, several certain shortcomings in previous BERT-family, such as the relatively sub-optimal training corpora, learning procedure, and model architecture, all impede the further advancement of these models for serving as general and competitive LLMs. Therefore, we aim to address these deficiencies in this paper. Our study not only introduces a more suitable pre-training task that helps BERT-family excel in wider applications to realize generality but also explores the integration of cutting-edge technologies into our model to further enhance their capabilities. Our final models, termed **Bi**directional **G**eneral **L**anguage **M**odels (**BiGLM**), exhibit performance levels comparable to current SOTA LLMs across a spectrum of tasks. Moreover, we conduct detailed analyses to study the effects of scaling and training corpora for BiGLM. To the best of our knowledge, our work represents the early attempt to offer a recipe for building novel types of scalable, general, and competitive LLMs that diverge from current autoregressive modeling methodology. Our codes and models are available on Github.
pdf
bib
abs
TaxoAdapt: Aligning LLM-Based Multidimensional Taxonomy Construction to Evolving Research Corpora
Priyanka Kargupta
|
Nan Zhang
|
Yunyi Zhang
|
Rui Zhang
|
Prasenjit Mitra
|
Jiawei Han
The rapid evolution of scientific fields introduces challenges in organizing and retrieving scientific literature. While expert-curated taxonomies have traditionally addressed this need, the process is time-consuming and expensive. Furthermore, recent automatic taxonomy construction methods either (1) over-rely on a specific corpus, sacrificing generalizability, or (2) depend heavily on the general knowledge of large language models (LLMs) contained within their pre-training datasets, often overlooking the dynamic nature of evolving scientific domains. Additionally, these approaches fail to account for the multi-faceted nature of scientific literature, where a single research paper may contribute to multiple dimensions (e.g., methodology, new tasks, evaluation metrics, benchmarks). To address these gaps, we propose TaxoAdapt, a framework that dynamically adapts an LLM-generated taxonomy to a given corpus across multiple dimensions. TaxoAdapt performs iterative hierarchical classification, expanding both the taxonomy width and depth based on corpus’ topical distribution. We demonstrate its state-of-the-art performance across a diverse set of computer science conferences over the years to showcase its ability to structure and capture the evolution of scientific fields. As a multidimensional method, TaxoAdapt generates taxonomies that are 26.51% more granularity-preserving and 50.41% more coherent than the most competitive baselines judged by LLMs.
pdf
bib
abs
An Empirical Study of Iterative Refinements for Non-autoregressive Translation
Yisheng Xiao
|
Pei Guo
|
Zechen Sun
|
Juntao Li
|
Kai Song
|
Min Zhang
Iterative non-autoregressive (NAR) models share a spirit of mixed autoregressive (AR) and fully NAR models, seeking a balance between generation quality and inference efficiency. These models have recently demonstrated impressive performance in varied generation tasks, surpassing the autoregressive Transformer. However, they also face several challenges that impede further development. In this work, we target building more efficient and competitive iterative NAR models. Firstly, we produce two simple metrics to identify the potential problems existing in current refinement processes, and look back on the various iterative NAR models to find the key factors for realizing our purpose. Subsequently, based on the analyses of the limitations of previous inference algorithms, we propose a simple yet effective strategy to conduct efficient refinements without performance declines. Experiments on five widely used datasets show that our final models set the new state-of-the-art performance compared to all previous NAR models, even with fewer decoding steps, and outperform AR Transformer by around one BLEU on average. Our codes and models are available on Github.
pdf
bib
abs
Retrofitting Large Language Models with Dynamic Tokenization
Darius Feher
|
Ivan Vulić
|
Benjamin Minixhofer
Current language models (LMs) use a fixed, static subword tokenizer. This default choice typically results in degraded efficiency and language capabilities, especially in languages other than English. To address this issue, we challenge the static design and propose retrofitting LMs with dynamic tokenization: a way to dynamically decide on token boundaries based on the input text via a subword-merging algorithm inspired by byte-pair encoding. We merge frequent subword sequences in a batch, then apply a pre-trained embedding-prediction hypernetwork to compute the token embeddings on-the-fly. For encoder-style models (e.g., XLM-R), this on average reduces token sequence lengths by >20% across 14 languages while degrading performance by less than 2%. The same method applied to pre-filling and scoring in decoder-style models (e.g., Mistral-7B) results in minimal performance degradation at up to 17% reduction in sequence length. Overall, we find that dynamic tokenization can mitigate the limitations of static tokenization by substantially improving inference speed and promoting fairness across languages, enabling more equitable and adaptable LMs.
pdf
bib
abs
Principled Content Selection to Generate Diverse and Personalized Multi-Document Summaries
Vishakh Padmakumar
|
Zichao Wang
|
David Arbour
|
Jennifer Healey
While large language models (LLMs) are increasingly capable of handling longer contexts, recent work has demonstrated that they exhibit the _”lost in the middle”_ phenomenon (Liu et al., 2024) of unevenly attending to different parts of the provided context. This hinders their ability to cover diverse source material in multi-document summarization, as noted in the DiverseSumm benchmark (Huang et al., 2024). In this work, we contend that principled content selection is a simple way to increase source coverage on this task. As opposed to prompting an LLM to perform the summarization in a single step, we explicitly divide the task into three steps—(1) reducing document collections to atomic key points, (2) using determinantal point processes (DPP) to perform select key points that prioritize diverse content, and (3) rewriting to the final summary. By combining prompting steps, for extraction and rewriting, with principled techniques, for content selection, we consistently improve source coverage on the DiverseSumm benchmark across various LLMs. Finally, we also show that by incorporating relevance to a provided user intent into the DPP kernel, we can generate _personalized_ summaries that cover _relevant_ source information while retaining coverage.
pdf
bib
abs
Bilingual Zero-Shot Stance Detection
Chenye Zhao
|
Cornelia Caragea
Zero-shot stance detection (ZSSD) aims to determine whether the author of a text is in support, against, or neutral toward a target that is unseen during training. In this paper, we investigate ZSSD within a bilingual framework and compare it with cross-lingual and monolingual scenarios, in settings that have not previously been explored. Our study focuses on both noun-phrase and claim targets within in-domain and out-of-domain bilingual ZSSD scenarios. To support this research, we assemble Bi-STANCE, a comprehensive bilingual ZSSD dataset consisting of over 100,000 annotated text-target pairs in both Chinese and English, sourced from existing datasets. Additionally, we examine a more challenging aspect of bilingual ZSSD by focusing on claim targets with a low occurrence of shared words with their corresponding texts. As part of Bi-STANCE, we created an extended dataset that emphasizes this challenging scenario. To the best of our knowledge, we are the first to explore this difficult ZSSD setting. We investigate these tasks using state-of-the-art pre-trained language models (PLMs) and large language models (LLMs). We release our dataset and code at https://github.com/chenyez/BiSTANCE.
pdf
bib
abs
GrammaMT: Improving Machine Translation with Grammar-Informed In-Context Learning
Rita Ramos
|
Everlyn Asiko Chimoto
|
Maartje Ter Hoeve
|
Natalie Schluter
We introduce GrammaMT, a grammatically-aware prompting approach for machine translation that uses Interlinear Glossed Text (IGT), a common form of linguistic description providing morphological and lexical annotations for source sentences. GrammaMT proposes three prompting strategies: gloss-shot, chain-gloss and model-gloss. All are training-free, requiring only a few examples that involve minimal effort to collect, and making them well-suited for low-resource setups. Experiments show that GrammaMT enhances translation performance on open-source instruction-tuned LLMs for various low- to high-resource languages across three benchmarks: (1) the largest IGT corpus, (2) the challenging 2023 SIGMORPHON Shared Task data over endangered languages, and (3) even in an out-of-domain setting with FLORES. Moreover, ablation studies reveal that leveraging gloss resources could substantially boost MT performance (by over 17 BLEU points) if LLMs accurately generate or access input sentence glosses.
pdf
bib
abs
Theorem Prover as a Judge for Synthetic Data Generation
Joshua Ong Jun Leang
|
Giwon Hong
|
Wenda Li
|
Shay B Cohen
The demand for synthetic data in mathematical reasoning has increased due to its potential to enhance the mathematical capabilities of large language models (LLMs). However, ensuring the validity of intermediate reasoning steps remains a significant challenge, affecting data quality. While formal verification via theorem provers effectively validates LLM reasoning, the autoformalisation of mathematical proofs remains error-prone. In response, we introduce *iterative autoformalisation*, an approach that iteratively refines theorem prover formalisation to mitigate errors, thereby increasing the execution rate on the Lean prover from 60% to 87%. Building upon that, we introduce *Theorem Prover as a Judge (TP-as-a-Judge)*, a method that employs theorem prover formalisation to rigorously assess LLM intermediate reasoning, effectively integrating autoformalisation with synthetic data generation. Finally, we present *Reinforcement Learning from Theorem Prover Feedback (RLTPF),* a framework that replaces human annotation with theorem prover feedback in Reinforcement Learning from Human Feedback (RLHF). Across multiple LLMs, applying *TP-as-a-Judge* and *RLTPF* improves benchmarks with only 3,508 samples, achieving 5.56% accuracy gain on Mistral-7B for MultiArith, 6.00% on Llama-2-7B for SVAMP, and 3.55% on Llama-3.1-8B for AQUA.
pdf
bib
abs
Measuring the Effect of Transcription Noise on Downstream Language Understanding Tasks
Ori Shapira
|
Shlomo Chazan
|
Amir David Nissan Cohen
With the increasing prevalence of recorded human speech, spoken language understanding (SLU) is essential for its efficient processing. In order to process the speech, it is commonly transcribed using automatic speech recognition technology. This speech-to-text transition introduces errors into the transcripts, which subsequently propagate to downstream NLP tasks, such as dialogue summarization. While it is known that transcript noise affects downstream tasks, a general-purpose and systematic approach to analyzing its effects across different noise severities and types has not been addressed. We propose a configurable framework for assessing task models in diverse noisy settings, and for examining the impact of transcript-cleaning techniques. The framework facilitates the investigation of task model behavior, which can in turn support the development of effective SLU solutions. We exemplify the utility of our framework on three SLU tasks and four task models, offering insights regarding the effect of transcript noise on tasks in general and models in particular. For instance, we find that task models can tolerate a certain level of noise, and are affected differently by the types of errors in the transcript.
pdf
bib
abs
Assessing Reliability and Political Bias In LLMs’ Judgements of Formal and Material Inferences With Partisan Conclusions
Reto Gubelmann
|
Ghassen Karray
This article examines LLMs’ ability to correctly label simple inferences with partisan conclusions. For this, we develop a dataset with both formal and material inferences, containing logically equivalent pairs of inferences with conclusions that favor either the political left or the political right. This allows us to focus on political bias as a source of decrease in performance. Our samples are synthetically generated and thus highly controlled, covering both English and German. We assess the performance of 16 configurations of both open and proprietary state-of-the-art LLMs on that dataset, finding generally unreliable performance as well as widespread political bias which, in the case of the English samples, persists throughout our experimental settings.
pdf
bib
abs
PARME: Parallel Corpora for Low-Resourced Middle Eastern Languages
Sina Ahmadi
|
Rico Sennrich
|
Erfan Karami
|
Ako Marani
|
Parviz Fekrazad
|
Gholamreza Akbarzadeh Baghban
|
Hanah Hadi
|
Semko Heidari
|
Mahîr Dogan
|
Pedram Asadi
|
Dashne Bashir
|
Mohammad Amin Ghodrati
|
Kourosh Amini
|
Zeynab Ashourinezhad
|
Mana Baladi
|
Farshid Ezzati
|
Alireza Ghasemifar
|
Daryoush Hosseinpour
|
Behrooz Abbaszadeh
|
Amin Hassanpour
|
Bahaddin Jalal Hamaamin
|
Saya Kamal Hama
|
Ardeshir Mousavi
|
Sarko Nazir Hussein
|
Isar Nejadgholi
|
Mehmet Ölmez
|
Horam Osmanpour
|
Rashid Roshan Ramezani
|
Aryan Sediq Aziz
|
Ali Salehi
|
Mohammadreza Yadegari
|
Kewyar Yadegari
|
Sedighe Zamani Roodsari
The Middle East is characterized by remarkable linguistic diversity, with over 400 million inhabitants speaking more than 60 languages across multiple language families. This study presents a pioneering work in developing the first parallel corpora for eight severely under-resourced varieties in the region–PARME, addressing fundamental challenges in low-resource scenarios including non-standardized writing and dialectal complexity. Through an extensive community-driven initiative, volunteers contributed to the creation of over 36,000 translated sentences, marking a significant milestone in resource development. We evaluate machine translation capabilities through zero-shot approaches and fine-tuning experiments with pretrained machine translation models and provide a comprehensive analysis of limitations. Our findings reveal significant gaps in existing technologies for processing the selected languages, highlighting critical areas for improvement in language technology for Middle Eastern languages.
pdf
bib
abs
METAL: A Multi-Agent Framework for Chart Generation with Test-Time Scaling
Bingxuan Li
|
Yiwei Wang
|
Jiuxiang Gu
|
Kai-Wei Chang
|
Nanyun Peng
Chart generation aims to generate code to produce charts satisfying the desired visual properties, e.g., texts, layout, color, and type. It has great potential to empower the automatic professional report generation in financial analysis, research presentation, education, and healthcare. In this work, we build a vision-language model (VLM) based multi-agent framework for effective automatic chart generation. Generating high-quality charts requires both strong visual design skills and precise coding capabilities that embed the desired visual properties into code. Such a complex multi-modal reasoning process is difficult for direct prompting of VLMs. To resolve these challenges, we propose METAL, a multi-agent framework that decomposes the task of chart generation into the iterative collaboration among specialized agents. METAL achieves a 5.2% improvement in the F1 score over the current best result in the chart generation task. Additionally, METAL improves chart generation performance by 11.33% over Direct Prompting with LLaMA-3.2-11B.Furthermore, the METAL framework exhibits the phenomenon of test-time scaling: its performance increases monotonically as the logarithm of computational budget grows from 512 to 8192 tokens.
pdf
bib
abs
ConLoan: A Contrastive Multilingual Dataset for Evaluating Loanwords
Sina Ahmadi
|
Micha David Hess
|
Elena Álvarez-Mellado
|
Alessia Battisti
|
Cui Ding
|
Anne Göhring
|
Yingqiang Gao
|
Zifan Jiang
|
Andrianos Michail
|
Peshmerge Morad
|
Joel Niklaus
|
Maria Christina Panagiotopoulou
|
Stefano Perrella
|
Juri Opitz
|
Anastassia Shaitarova
|
Rico Sennrich
Lexical borrowing, the adoption of words from one language into another, is a ubiquitous linguistic phenomenon influenced by geopolitical, societal, and technological factors. This paper introduces ConLoan–a novel contrastive dataset comprising sentences with and without loanwords across 10 languages. Through systematic evaluation using this dataset, we investigate how state-of-the-art machine translation and language models process loanwords compared to their native alternatives. Our experiments reveal that these systems show systematic preferences for loanwords over native terms and exhibit varying performance across languages. These findings provide valuable insights for developing more linguistically robust NLP systems.
pdf
bib
abs
A Theory of Response Sampling in LLMs: Part Descriptive and Part Prescriptive
Sarath Sivaprasad
|
Pramod Kaushik
|
Sahar Abdelnabi
|
Mario Fritz
Large Language Models (LLMs) are increasingly utilized in autonomous decision-making, where they sample options from vast action spaces. However, the heuristics that guide this sampling process remain under-explored. We study this sampling behavior and show that this underlying heuristics resembles that of human decision-making: comprising a descriptive component (reflecting statistical norm) and a prescriptive component (implicit ideal encoded in the LLM) of a concept. We show that this deviation of a sample from the statistical norm towards a prescriptive component consistently appears in concepts across diverse real-world domains like public health, and economic trends. To further illustrate the theory, we demonstrate that concept prototypes in LLMs are affected by prescriptive norms, similar to the concept of normality in humans. Through case studies and comparison with human studies, we illustrate that in real-world applications, the shift of samples toward an ideal value in LLMs’ outputs can result in significantly biased decision-making, raising ethical concerns.
pdf
bib
abs
MEraser: An Effective Fingerprint Erasure Approach for Large Language Models
Jingxuan Zhang
|
Zhenhua Xu
|
Rui Hu
|
Wenpeng Xing
|
Xuhong Zhang
|
Meng Han
Large Language Models (LLMs) have become increasingly prevalent across various sectors, raising critical concerns about model ownership and intellectual property protection. Although backdoor-based fingerprinting has emerged as a promising solution for model authentication, effective attacks for removing these fingerprints remain largely unexplored. Therefore, We present Mismatched Eraser (MEraser), a novel method for effectively removing backdoor-based fingerprints from LLMs while maintaining model performance. Our approach leverages a two-phase fine-tuning strategy utilizing carefully constructed mismatched and clean datasets. Through extensive evaluation across multiple LLM architectures and fingerprinting methods, we demonstrate that MEraser achieves complete fingerprinting removal while maintaining model performance with minimal training data of fewer than 1,000 samples. Furthermore, we introduce a transferable erasure mechanism that enables effective fingerprinting removal across different models without repeated training. In conclusion, our approach provides a practical solution for fingerprinting removal in LLMs, reveals critical vulnerabilities in current fingerprinting techniques, and establishes comprehensive evaluation benchmarks for developing more resilient model protection methods in the future.
pdf
bib
abs
VISA: Retrieval Augmented Generation with Visual Source Attribution
Xueguang Ma
|
Shengyao Zhuang
|
Bevan Koopman
|
Guido Zuccon
|
Wenhu Chen
|
Jimmy Lin
Generation with source attribution is important for enhancing the verifiability of retrieval-augmented generation (RAG) systems. However, existing approaches in RAG primarily link generated content to document-level references, making it challenging for users to locate evidence among multiple content-rich retrieved documents. To address this challenge, we propose Retrieval-Augmented Generation with Visual Source Attribution (VISA), a novel approach that combines answer generation with visual source attribution. Leveraging large vision-language models (VLMs), VISA identifies the evidence and highlights the exact regions that support the generated answers with bounding boxes in the retrieved document screenshots. To evaluate its effectiveness, we curated two datasets: Wiki-VISA, based on crawled Wikipedia webpage screenshots, and Paper-VISA, derived from PubLayNet and tailored to the medical domain. Experimental results demonstrate the effectiveness of VISA for visual source attribution on documents’ original look, as well as highlighting the challenges for improvement.
pdf
bib
abs
DRAMA: Diverse Augmentation from Large Language Models to Smaller Dense Retrievers
Xueguang Ma
|
Xi Victoria Lin
|
Barlas Oguz
|
Jimmy Lin
|
Wen-tau Yih
|
Xilun Chen
Large language models (LLMs) have demonstrated strong effectiveness and robustness when fine-tuned as dense retrievers.However, their large parameter size presents significant computational challenges at inference time.While smaller retrievers offer better efficiency, they often fail to generalize effectively with limited supervised fine-tuning data.In this work, we introduce DRAMA, a training framework that leverages LLMs to train smaller generalizable dense retrievers.In particular, we adopt pruned LLMs as the backbone and train on diverse LLM-augmented data in a single-stage contrastive learning setup.Experiments show that DRAMA offers better multilingual and long-context capabilities than traditional encoder-based retrievers, and achieves strong performance across multiple tasks and languages.
pdf
bib
abs
Stochastic Chameleons: Irrelevant Context Hallucinations Reveal Class-Based (Mis)Generalization in LLMs
Ziling Cheng
|
Meng Cao
|
Marc-Antoine Rondeau
|
Jackie CK Cheung
The widespread success of LLMs on NLP benchmarks has been accompanied by concerns that LLMs function primarily as stochastic parrots that reproduce texts similar to what they saw during pre-training, often erroneously. But what is the nature of their errors, and do these errors exhibit any regularities? In this work, we examine irrelevant context hallucinations, in which models integrate misleading contextual cues into their predictions. Through behavioral analysis, we show that these errors result from a structured yet flawed mechanism that we term _class-based (mis)generalization_, in which models combine abstract class cues with features extracted from the query or context to derive answers. Furthermore, mechanistic interpretability experiments on Llama-3, Mistral, and Pythia across 39 factual recall relation types reveal that this behavior is reflected in the model’s internal computations: (i) abstract class representations are constructed in lower layers before being refined into specific answers in higher layers, (ii) feature selection is governed by two competing circuits — one prioritizing direct query-based reasoning, the other incorporating contextual cues — whose relative influences determine the final output. Our findings provide a more nuanced perspective on the stochastic parrot argument: through form-based training, LLMs can exhibit generalization leveraging abstractions, albeit in unreliable ways based on contextual cues — what we term _stochastic chameleons_.
pdf
bib
abs
MAPoRL: Multi-Agent Post-Co-Training for Collaborative Large Language Models with Reinforcement Learning
Chanwoo Park
|
Seungju Han
|
Xingzhi Guo
|
Asuman E. Ozdaglar
|
Kaiqing Zhang
|
Joo-Kyung Kim
Leveraging multi-agentic frameworks to enhance large language models (LLMs) has demonstrated significant potential recently, with most existing studies focusing on prompting and developing workflows with frozen LLMs. In this paper, we aim to further unleash the power of such multi-agentic frameworks for post-training LLMs for better collaboration. Specifically, we develop a new paradigm of Multi-Agent Post-co-training for collaborative LLMs with Reinforcement Learning (MAPoRL). In MAPoRL, multiple LLMs first generate their own responses and engage in discussions to collaboratively enhance the final response output; the final output is then scored by a verifier, where the scores serve as the reward and is maximized through multi-agent RL. Additionally, MAPoRL also reshapes the reward above with additional incentives to encourage corrective and persuasive outputs in the discussions. A key novelty from most existing LLM post-training paradigms is the advocacy of co-training multiple LLMs together, and the use of RL for better generalization. Accompanied by a few analytical insights, our experiments show that training single LLMs solely is insufficient for encouraging collaboration, while multi-agent co-training can significantly enhance the collaboration performance across multiple datasets, with generalization to unseen domains, compared to that of multiple LLMs before post-training.
pdf
bib
abs
Map&Make: Schema Guided Text to Table Generation
Naman Ahuja
|
Fenil Bardoliya
|
Chitta Baral
|
Vivek Gupta
Transforming dense, unstructured text into interpretable tables—commonly referred to as Text-to-Table generation—is a key task in information extraction. Existing methods often overlook what complex information to extract and how to infer it from text. We present Map&Make, a versatile approach that decomposes text into atomic propositions to infer latent schemas, which are then used to generate tables capturing both qualitative nuances and quantitative facts. We evaluate our method on three challenging datasets: Rotowire, known for its complex, multi-table schema; Livesum which requires numerical aggregation; and Wiki40 which require open text extraction from mulitple domains. By correcting hallucination errors in Rotowire, we also provide a cleaner benchmark. Our method shows significant gains in both accuracy and interpretability across comprehensive comparative and referenceless metrics. Finally, ablation studies highlight the key factors driving performance and validate the utility of our approach in structured summarization. Code and data are available at: https://coral-lab-asu.github.io/map-make.
pdf
bib
abs
IRIS: Interpretable Retrieval-Augmented Classification for Long Interspersed Document Sequences
Fengnan Li
|
Elliot D. Hill
|
Jiang Shu
|
Jiaxin Gao
|
Matthew M. Engelhard
Transformer-based models have achieved state-of-the-art performance in document classification but struggle with long-text processing due to the quadratic computational complexity in the self-attention module. Existing solutions, such as sparse attention, hierarchical models, and key sentence extraction, partially address the issue but still fall short when the input sequence is exceptionally lengthy. To address this challenge, we propose **IRIS** (**I**nterpretable **R**etrieval-Augmented Classification for long **I**nterspersed Document **S**equences), a novel, lightweight framework that utilizes retrieval to efficiently classify long documents while enhancing interpretability. IRIS segments documents into chunks, stores their embeddings in a vector database, and retrieves those most relevant to a given task using learnable query vectors. A linear attention mechanism then aggregates the retrieved embeddings for classification, allowing the model to process arbitrarily long documents without increasing computational cost and remaining trainable on a single GPU. Our experiments across six datasets show that IRIS achieves comparable performance to baseline models on standard benchmarks, and excels in three clinical note disease risk prediction tasks where documents are extremely long and key information is sparse. Furthermore, IRIS provides global interpretability by revealing a clear summary of key risk factors identified by the model. These findings highlight the potential of IRIS as an efficient and interpretable solution for long-document classification, particularly in healthcare applications where both performance and explainability are crucial.
pdf
bib
abs
Symmetrical Visual Contrastive Optimization: Aligning Vision-Language Models with Minimal Contrastive Images
Shengguang Wu
|
Fan-Yun Sun
|
Kaiyue Wen
|
Nick Haber
Recent studies have shown that Large Vision-Language Models (VLMs) tend to neglect image content and over-rely on language-model priors, resulting in errors in visually grounded tasks and hallucinations. We hypothesize that this issue arises because existing VLMs are not explicitly trained to generate texts that are accurately grounded in fine-grained image details. To enhance visual feedback during VLM training, we propose S-VCO (Symmetrical Visual Contrastive Optimization), a novel finetuning objective that steers the model toward capturing important visual details and aligning them with corresponding text tokens. To further facilitate this detailed alignment, we introduce MVC, a paired image-text dataset built by automatically filtering and augmenting visual counterfactual data to challenge the model with hard contrastive cases involving Minimal Visual Contrasts. Experiments show that our method consistently improves VLM performance across diverse benchmarks covering various abilities and domains, achieving up to a 22% reduction in hallucinations, and significant gains in vision-centric and general tasks. Notably, these improvements become increasingly pronounced in benchmarks with higher visual dependency. In short, S-VCO offers a significant enhancement of VLM’s visually-dependent task performance while retaining or even improving the model’s general abilities.
pdf
bib
abs
Can we Retrieve Everything All at Once? ARM: An Alignment-Oriented LLM-based Retrieval Method
Peter Baile Chen
|
Yi Zhang
|
Mike Cafarella
|
Dan Roth
Real-world open-domain questions can be complex, especially when answering them requires integrating information from multiple sources. Effectively identifying the necessary information involves *aligning* it with the available data and its organization. However, existing RAG solutions address the alignment problem in a limited manner. Using off-the-shelf LLMs for question decomposition lacks awareness of the available data and its structure, often resulting in suboptimal retrieval performance. Alternatively, iteratively generating follow-up queries and interacting with the data collection, as explored in agentic RAG approaches, shows potential but is often *inefficient* since each successive query depends on previous results rather than being guided by the overall organization of the available data. To address the *alignment* problem, we introduce an LLM-based retrieval method — ARM, designed to better align questions with the organization of the data collection. Instead of solely matching query utterance, ARM explores *relationships among data objects*, enabling a retrieve-all-at-once solution for complex queries. Experimental results demonstrate that ARM significantly outperforms existing RAG methods on various complex open-domain QA tasks across multiple modalities, achieving superior retrieval performance and downstream accuracy while significantly lowering monetary costs.
pdf
bib
abs
R2D2: Remembering, Replaying and Dynamic Decision Making with a Reflective Agentic Memory
Tenghao Huang
|
Kinjal Basu
|
Ibrahim Abdelaziz
|
Pavan Kapanipathi
|
Jonathan May
|
Muhao Chen
The proliferation of web agents necessitates advanced navigation and interaction strategies within complex web environments. Current models often struggle with efficient navigation and action execution due to limited visibility and understanding of web structures. Our proposed R2D2 framework addresses these challenges by integrating two paradigms: Remember and Reflect. The Remember paradigm utilizes a replay buffer that aids agents in reconstructing the web environment dynamically, thus enabling the formulation of a detailed “map” of previously visited pages. This helps in reducing navigational errors and optimizing the decision-making process during web interactions. Conversely, the Reflect paradigm allows agents to learn from past mistakes by providing a mechanism for error analysis and strategy refinement, enhancing overall task performance. We evaluate R2D2 using the WEBARENA benchmark, demonstrating significant improvements over existing methods, including a 50% reduction in navigation errors and a threefold increase in task completion rates. Our findings suggest that a combination of memory-enhanced navigation and reflective learning promisingly advances the capabilities of web agents, potentially benefiting various applications such as automated customer service and personal digital assistants.
pdf
bib
abs
FairI Tales: Evaluation of Fairness in Indian Contexts with a Focus on Bias and Stereotypes
Janki Atul Nawale
|
Mohammed Safi Ur Rahman Khan
|
Janani D
|
Mansi Gupta
|
Danish Pruthi
|
Mitesh M Khapra
Existing studies on fairness are largely Western-focused, making them inadequate for culturally diverse countries such as India. To address this gap, we introduce INDIC-BIAS, a comprehensive India-centric benchmark designed to evaluate fairness of LLMs across 85 identity groups encompassing diverse castes, religions, regions, and tribes. We first consult domain experts to curate over 1,800 socio-cultural topics spanning behaviors and situations, where biases and stereotypes are likely to emerge. Grounded in these topics, we generate and manually validate 20,000 real-world scenario templates to probe LLMs for fairness. We structure these templates into three evaluation tasks: plausibility, judgment, and generation. Our evaluation of 14 popular LLMs on these tasks reveals strong negative biases against marginalized identities, with models frequently reinforcing common stereotypes. Additionally, we find that models struggle to mitigate bias even when explicitly asked to rationalize their decision. Our evaluation provides evidence of both allocative and representational harms that current LLMs could cause towards Indian identities, calling for a more cautious usage in practical applications. We release INDIC-BIAS as an open-source benchmark to advance research on benchmarking and mitigating biases and stereotypes in the Indian context.
pdf
bib
abs
SpeechIQ: Speech-Agentic Intelligence Quotient Across Cognitive Levels in Voice Understanding by Large Language Models
Zhen Wan
|
Chao-Han Huck Yang
|
Yahan Yu
|
Jinchuan Tian
|
Sheng Li
|
Ke Hu
|
Zhehuai Chen
|
Shinji Watanabe
|
Fei Cheng
|
Chenhui Chu
|
Sadao Kurohashi
We introduce Speech-based Intelligence Quotient (SIQ) as a new form of human cognition-inspired evaluation pipeline for voice understanding large language models (LLM_Voice), designed to assess their voice understanding ability. Moving beyond popular voice understanding metrics such as word error rate (WER), SIQ examines LLM_Voice across three cognitive levels motivated by Bloom’s Taxonomy: (1) Remembering (i.e., WER for verbatim accuracy); (2) Understanding (i.e., similarity of LLM’s interpretations); and (3) Application (i.e., QA accuracy for simulating downstream tasks). We demonstrate that SIQ not only quantifies voice understanding abilities but also provides unified comparisons between cascaded methods (e.g., ASR-LLM) and end-to-end models, identifies annotation errors in existing benchmarks, and detects hallucinations in LLM_Voice. Our framework represents a first-of-its-kind intelligence examination that bridges cognitive principles with voice-oriented benchmarks, while exposing overlooked challenges in multi-modal training. Our code and data will be open source to encourage future studies.
pdf
bib
abs
Predicting Implicit Arguments in Procedural Video Instructions
Anil Batra
|
Laura Sevilla-Lara
|
Marcus Rohrbach
|
Frank Keller
Procedural texts help AI enhance reasoning about context and action sequences. Transforming these into Semantic Role Labeling (SRL) improves understanding of individual steps by identifying predicate-argument structure like verb,what,where/with. Procedural instructions are highly elliptic, for instance, (i) add cucumber to the bowl and (ii) add sliced tomatoes, the second step’s where argument is inferred from the context, referring to where the cucumber was placed. Prior SRL benchmarks often miss implicit arguments, leading to incomplete understanding. To address this, we introduce Implicit-VidSRL, a dataset that necessitates inferring implicit and explicit arguments from contextual information in multimodal cooking procedures. Our proposed dataset benchmarks multimodal models’ contextual reasoning, requiring entity tracking through visual changes in recipes. We study recent multimodal LLMs and reveal that they struggle to predict implicit arguments of what and where/with from multi-modal procedural data given the verb. Lastly, we propose iSRL-Qwen2-VL, which achieves a 17% relative improvement in F1-score for what-implicit and a 14.7% for where/with-implicit semantic roles over GPT-4o.
pdf
bib
abs
PIGuard: Prompt Injection Guardrail via Mitigating Overdefense for Free
Hao Li
|
Xiaogeng Liu
|
Ning Zhang
|
Chaowei Xiao
Prompt injection attacks pose a critical threat to large language models (LLMs), enabling goal hijacking and data leakage. Prompt guard models, though effective in defense, suffer from over-defense—falsely flagging benign inputs as malicious due to trigger word bias. To address this issue, we introduce NotInject, an evaluation dataset that systematically measures over-defense across various prompt guard models. NotInject contains 339 benign samples enriched with trigger words common in prompt injection attacks, enabling fine-grained evaluation. Our results show that state-of-the-art models suffer from over-defense issues, with accuracy dropping close to random guessing levels (60%). To mitigate this, we propose PIGuard, a novel prompt guard model that incorporates a new training strategy, Mitigating Over-defense for Free (MOF), which significantly reduces the bias on trigger words. PIGuard demonstrates state-of-the-art performance on diverse benchmarks including NotInject, surpassing the existing best model by 30.4%, offering a robust and open-source solution for detecting prompt injection attacks. The code and datasets are released at https://github.com/leolee99/PIGuard.
pdf
bib
abs
CLIPErase: Efficient Unlearning of Visual-Textual Associations in CLIP
Tianyu Yang
|
Lisen Dai
|
Xiangqi Wang
|
Minhao Cheng
|
Yapeng Tian
|
Xiangliang Zhang
Machine unlearning (MU) has gained significant attention as a means to remove the influence of specific data from a trained model without requiring full retraining. While progress has been made in unimodal domains like text and image classification, unlearning in multimodal models remains relatively under-explored. In this work, we address the unique challenges of unlearning in CLIP, a prominent multimodal model that aligns visual and textual representations. We introduce CLIPErase, a novel approach that disentangles and selectively forgets both visual and textual associations, ensuring that unlearning does not compromise model performance.CLIPErase consists of three key modules: a Forgetting Module that disrupts the associations in the forget set, a Retention Module that preserves performance on the retain set, and a Consistency Module that maintains consistency with the original model. Extensive experiments on CIFAR-100, Flickr30K, and Conceptual 12M across five CLIP downstream tasks, as well as an evaluation on diffusion models, demonstrate that CLIPErase effectively removes designated associations from multimodal samples in downstream tasks, while preserving the model’s performance on the retain set after unlearning.
pdf
bib
abs
ViGiL3D: A Linguistically Diverse Dataset for 3D Visual Grounding
Austin Wang
|
ZeMing Gong
|
Angel X Chang
3D visual grounding (3DVG) involves localizing entities in a 3D scene referred to by natural language text. Such models are useful for embodied AI and scene retrieval applications, which involve searching for objects or patterns using natural language descriptions. While recent works have focused on LLM-based scaling of 3DVG datasets, these datasets do not capture the full range of potential prompts which could be specified in the English language. To ensure that we are scaling up and testing against a useful and representative set of prompts, we propose a framework for linguistically analyzing 3DVG prompts and introduce Visual Grounding with Diverse Language in 3D (ViGiL3D), a diagnostic dataset for evaluating visual grounding methods against a diverse set of language patterns. We evaluate existing open-vocabulary 3DVG methods to demonstrate that these methods are not yet proficient in understanding and identifying the targets of more challenging, out-of-distribution prompts, toward real-world applications.
pdf
bib
abs
The time scale of redundancy between prosody and linguistic context
Tamar I Regev
|
Chiebuka Ohams
|
Shaylee Xie
|
Lukas Wolf
|
Evelina Fedorenko
|
Alex Warstadt
|
Ethan Wilcox
|
Tiago Pimentel
In spoken communication, information is transmitted not only via words, but also through a rich array of non-verbal signals, including prosody—the non-segmental auditory features of speech. Do these different communication channels carry distinct information? Prior work has shown that the information carried by prosodic features is substantially redundant with that carried by the surrounding words. Here, we systematically examine the time scale of this relationship, studying how it varies with the length of past and future contexts. We find that a word’s prosodic features require an extended past context (3-8 words across different features) to be reliably predicted. Given that long-scale contextual information decays in memory, prosody may facilitate communication by adding information that is locally unique. We also find that a word’s prosodic features show some redundancy with future words, but only with a short scale of 1-2 words, consistent with reports of incremental short-term planning in language production. Thus, prosody may facilitate communication by helping listeners predict upcoming material. In tandem, our results highlight potentially distinct roles that prosody plays in facilitating integration of words into past contexts and in helping predict upcoming words.
pdf
bib
abs
Basic Reading Distillation
Zhi Zhou
|
Sirui Miao
|
Xiangyu Duan
|
Hao Yang
|
Min Zhang
Large language models (LLMs) have demonstrated remarkable abilities in various natural language processing areas, but they demand high computation resources which limits their deployment in real-world. Distillation is one technique to solve this problem through either knowledge distillation or task distillation. Both distillation approaches train small models to imitate specific features of LLMs, but they all neglect basic reading education for small models on generic texts that are unrelated to downstream tasks. In this paper, we propose basic reading distillation (BRD) which educates a small model to imitate LLMs basic reading behaviors, such as named entity recognition, question raising and answering, on each sentence. After such basic education, we apply the small model on various tasks including language inference benchmarks and BIG-bench tasks. It shows that the small model can outperform or perform comparable to over 20x bigger LLMs. Analysis reveals that BRD effectively influences the probability distribution of the small model, and has orthogonality to either knowledge distillation or task distillation.
pdf
bib
abs
Quantized Can Still Be Calibrated: A Unified Framework to Calibration in Quantized Large Language Models
Mingyu Zhong
|
Guanchu Wang
|
Yu-Neng Chuang
|
Na Zou
Although weight quantization helps large language models (LLMs) in resource-constrained environments, its influence on the uncertainty calibration remains unexplored. To bridge this gap, we presents a comprehensive investigation of uncertainty calibration for quantized LLMs in this work. Specifically, we propose an analytic method to estimate the upper bound of calibration error (UBCE) for LLMs. Our method separately discusses the calibration error of the model’s correct and incorrect predictions, indicating a theoretical improvement of calibration error caused by the weight quantization. Our study demonstrates that quantized models consistently exhibit worse calibration performance than full-precision models, supported by consistent analysis across multiple LLMs and datasets. To address the calibration issues of quantized models, we propose a novel method of post calibration for recovering the calibration performance of quantized models through soft-prompt tuning. Specifically, we inject soft tokens to quantized models after the embedding layers, and optimize these tokens to recover the calibration error caused by the weight quantization. Experimental results on multiple datasets demonstrate our effectiveness in improving the uncertainty calibration of quantized LLMs, facilitating more reliable weight quantization in resource-constrained environments.
pdf
bib
abs
A Spatio-Temporal Point Process for Fine-Grained Modeling of Reading Behavior
Francesco Ignazio Re
|
Andreas Opedal
|
Glib Manaiev
|
Mario Giulianelli
|
Ryan Cotterell
Reading is a process that unfolds across space and time. Standard modeling approaches, however, overlook much of the spatio-temporal dynamics involved in reading by relying on aggregated reading measurements—typically only focusing on fixation durations—and employing models with strong simplifying assumptions. In this paper, we propose a generative model that captures not only how long fixations last, but also where they land and when they occur. To this end, we model reading scanpaths via two conditionally independent distributions: one for fixation location and timing, and another for fixation duration.The location (and timing) of fixation shifts, so-called saccades, are modeled using a spatio-temporal Hawkes process, which captures how each fixation excites the probability of a new fixation occurring near it in time and space. Empirically, our Hawkes process model exhibits higher likelihood on held-out reading data than baselines. The duration time of fixation events is modeled as a function of fixation-specific features convolved across time, thus capturing non-stationary delayed effects. We find that convolution-based approaches demonstrate weak predictive power when modeling disaggregated fixation durations. Similarly, our analysis of surprisal theory on disaggregated data reveals limited effectiveness in predicting both where fixations occur and how long they last.
pdf
bib
abs
More is not always better? Enhancing Many-Shot In-Context Learning with Differentiated and Reweighting Objectives
Xiaoqing Zhang
|
Ang Lv
|
Yuhan Liu
|
Flood Sung
|
Wei Liu
|
Jian Luan
|
Shuo Shang
|
Xiuying Chen
|
Rui Yan
Large language models (LLMs) excel at few-shot in-context learning (ICL) without requiring parameter updates. However, as ICL demonstrations increase from a few to many, performance tends to plateau and eventually decline. We identify two primary causes for this trend: the suboptimal negative log-likelihood (NLL) optimization objective and the incremental data noise. To address these issues, we introduce DrICL, a novel optimization method that enhances model performance through Differentiated and Reweighting objectives. Globally, DrICL utilizes differentiated learning to optimize the NLL objective, ensuring that many-shot performance surpasses zero-shot levels. Locally, it dynamically adjusts the weighting of many-shot demonstrations by leveraging cumulative advantages inspired by reinforcement learning, thereby mitigating the impact of noisy data.Recognizing the lack of multi-task datasets with diverse many-shot distributions, we develop the Many-Shot ICL Benchmark (ICL-50)-a large-scale benchmark of 50 tasks that cover shot numbers from 1 to 350 within sequences of up to 8,000 tokens-for both fine-tuning and evaluation purposes.Experimental results demonstrate that LLMs enhanced with DrICL achieve significant improvements in many-shot setups across various tasks, including both in-domain and out-of-domain scenarios.We release the code and dataset hoping to facilitate further research in many-shot ICL.
pdf
bib
abs
Astute RAG: Overcoming Imperfect Retrieval Augmentation and Knowledge Conflicts for Large Language Models
Fei Wang
|
Xingchen Wan
|
Ruoxi Sun
|
Jiefeng Chen
|
Sercan O Arik
Retrieval augmented generation (RAG), while effectively integrating external knowledge to address the inherent limitations of large language models (LLMs), can be hindered by imperfect retrieval that contain irrelevant, misleading, or even malicious information. Previous studies have rarely connected the behavior of RAG through joint analysis, particularly regarding error propagation coming from imperfect retrieval and potential conflicts between LLMs’ internal knowledge and external sources. Through comprehensive and controlled analyses under realistic conditions, we find that imperfect retrieval augmentation is inevitable, common, and harmful. We identify the knowledge conflicts between LLM-internal and external knowledge from retrieval as a bottleneck to overcome imperfect retrieval in the post-retrieval stage of RAG. To address this, we propose Astute RAG, a novel RAG approach designed to be resilient to imperfect retrieval augmentation. It adaptively elicits essential information from LLMs’ internal knowledge, iteratively consolidates internal and external knowledge with source-awareness, and finalizes the answer according to information reliability. Our experiments with Gemini and Claude demonstrate the superior performance of Astute RAG compared to previous robustness-enhanced RAG approaches. Specifically, Astute RAG is the only RAG method that achieves performance comparable to or even surpassing conventional use of LLMs under the worst-case scenario. Further analysis reveals the effectiveness of Astute RAG in resolving knowledge conflicts, thereby improving the trustworthiness of RAG.
pdf
bib
abs
SubLIME: Subset Selection via Rank Correlation Prediction for Data-Efficient LLM Evaluation
Gayathri Saranathan
|
Cong Xu
|
Mahammad Parwez Alam
|
Tarun Kumar
|
Martin Foltin
|
Soon Yee Wong
|
Suparna Bhattacharya
The rapid expansion of Large Language Models (LLMs) and natural language processing datasets has made exhaustive benchmark evaluations computationally prohibitive. Inspired by high-stakes competitions like the International Mathematical Olympiad-where a few well-chosen problems suffice to differentiate top performers—we present SubLIME, which reduces evaluation costs by 80% to 99% while preserving ranking fidelity. It trains a Rank Correlation Prediction (RCP) model that combines limited performance data from only 5-20 anchor LLMs with dataset intrinsic metrics - Difficulty, Quality, and Distributional Dispersion-to predict how closely a candidate subset reflects full-benchmark rankings. Guided by these predictions, SubLIME selects a “winning” subset (1-20% of full set data) for evaluating new LLMs, preserving global rankings significant better than other data-efficient methods across ten diverse benchmarks.
pdf
bib
abs
M³GQA: A Multi-Entity Multi-Hop Multi-Setting Graph Question Answering Benchmark
Boci Peng
|
Yongchao Liu
|
Xiaohe Bo
|
Jiaxin Guo
|
Yun Zhu
|
Xuanbo Fan
|
Chuntao Hong
|
Yan Zhang
Recently, GraphRAG systems have achieved remarkable progress in enhancing the performance and reliability of large language models (LLMs). However, most previous benchmarks are template-based and primarily focus on few-entity queries, which are monotypic and simplistic, failing to offer comprehensive and robust assessments. Besides, the lack of ground-truth reasoning paths also hinders the assessments of different components in GraphRAG systems. To address these limitations, we propose M³GQA, a complex, diverse, and high-quality GraphRAG benchmark focusing on multi-entity queries, with six distinct settings for comprehensive evaluation. In order to construct diverse data with semantically correct ground-truth reasoning paths, we introduce a novel reasoning-driven four-step data construction method, including tree sampling, reasoning path backtracking, query creation, and multi-stage refinement and filtering. Extensive experiments demonstrate that M³GQA effectively reflects the capabilities of GraphRAG methods, offering valuable insights into the model performance and reliability. By pushing the boundaries of current methods, M³GQA establishes a comprehensive, robust, and reliable benchmark for advancing GraphRAG research.
pdf
bib
abs
LSSF: Safety Alignment for Large Language Models through Low-Rank Safety Subspace Fusion
Guanghao Zhou
|
Panjia Qiu
|
Cen Chen
|
Hongyu Li
|
Jason Chu
|
Xin Zhang
|
Jun Zhou
The safety mechanisms of large language models (LLMs) exhibit notable fragility, as even fine-tuning on datasets without harmful content may still undermine their safety capabilities. Meanwhile, existing safety alignment methods predominantly rely on the fine-tuning process, which inadvertently leads to the increased complexity and computational resources required. To address these issues, we introduce LSSF, a novel safety re-alignment framework with Low-Rank Safety Subspace Fusison. Our proposed method exploits the low-rank characteristics of safety information in LLMs by constructing a low-rank projection matrix to extract the principal components of safety vectors. Notably, this projection matrix represents the low-rank safety subspace of the LLMs, which we have observed to remain stable during fine-tuning process and is isolated from the model’s general capabilities. These principal components are used to effectively restore safety alignment when combined with fine-tuned LLMs through linear arithmetic. Additionally, to account for the varying encoding densities of safety information across different layers of LLMs, we propose a novel metric called safety singular value entropy. This metric quantifies the encoding density and allows for the dynamic computation of the safety-critical rank for each safety vector. Extensive experiments demonstrate that our proposed post-hoc alignment method can effectively restore the safety alignment of fine-tuned models with minimal impact on their performance on downstream tasks.
pdf
bib
abs
ETF: An Entity Tracing Framework for Hallucination Detection in Code Summaries
Kishan Maharaj
|
Vitobha Munigala
|
Srikanth G. Tamilselvam
|
Prince Kumar
|
Sayandeep Sen
|
Palani Kodeswaran
|
Abhijit Mishra
|
Pushpak Bhattacharyya
Recent advancements in large language models (LLMs) have significantly enhanced their ability to understand both natural language and code, driving their use in tasks like natural language-to-code (NL2Code) and code summarisation. However, LLMs are prone to hallucination—outputs that stray from intended meanings. Detecting hallucinations in code summarisation is especially difficult due to the complex interplay between programming and natural languages. We introduce a first-of-its-kind dataset, CodeSumEval, with ~10K samples, curated specifically for hallucination detection in code summarisation. We further propose a novel Entity Tracing Framework (ETF) that a) utilises static program analysis to identify code entities from the program and b) uses LLMs to map and verify these entities and their intents within generated code summaries. Our experimental analysis demonstrates the framework’s effectiveness, leading to a 73% F1 score. The proposed approach provides a method for detecting hallucinations by tracing entities from the summary to the code, allowing us to evaluate summary accuracy and localise the error within the summary.
pdf
bib
abs
Meta-Tool: Unleash Open-World Function Calling Capabilities of General-Purpose Large Language Models
Shengqian Qin
|
Yakun Zhu
|
Linjie Mu
|
Shaoting Zhang
|
Xiaofan Zhang
Large language models (LLMs) have showcased remarkable capabilities as autonomous agents when augmented with external tools. Equipped with fixed tool sets, LLMs struggle with addressing diverse user inquiries in open-world tasks. To evaluate and boost the performance of LLMs in dealing with complex demands in the real-world, we propose open-world function calling, where LLMs need to retrieve suitable tools from a pre-defined external tool library and use retrieved tools to resolve the user’s problem. We introduce Meta-Tool, a versatile and plug-and-play tool retrieval system as the access of LLMs to external tool library. Drawing inspiration from the myriad of enhanced approaches associated with Retrieval-Augmented Generation (RAG), Meta-Tool employs a hypothesize-retrieve-invoke framework. We further propose Meta-Bench, a comprehensive benchmark for evaluating LLMs in open-world function calling and associated tasks. Meta-Bench encompasses 2,800 dialogues and 7,361 tools, spanning ten distinct scenarios to provide robust and diverse test categories. In conjunction, we present MT-LLaMA, a finetuned version of LLaMA-3.1, which exhibits remarkable performance improvements. Our empirical experiments reveal that Meta-Tool significantly enhances the ability of advanced LLMs to retrieve and leverage the most suitable tools compared to previous tool retrieval methods. Moreover, our fine-tuning enables even smaller-sized LLMs to achieve comparable even exceeding results to GPT-4o. Both the benchmark and the model are made publicly available at https://github.com/qinshengqian/Meta-Tool to foster further research and development in the field.
pdf
bib
abs
Benchmarking and Improving Large Vision-Language Models for Fundamental Visual Graph Understanding and Reasoning
Yingjie Zhu
|
Xuefeng Bai
|
Kehai Chen
|
Yang Xiang
|
Jun Yu
|
Min Zhang
Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across diverse tasks. Despite great success, recent studies show that LVLMs encounter substantial limitations when engaging with visual graphs. To study the reason behind these limitations, we propose VGCure, a comprehensive benchmark covering 22 tasks for examining the fundamental graph understanding and reasoning capacities of LVLMs. Extensive evaluations conducted on 14 LVLMs reveal that LVLMs are weak in basic graph understanding and reasoning tasks, particularly those concerning relational or structurally complex information. Based on this observation, we propose a structure-aware fine-tuning framework to enhance LVLMs with structure learning abilities through three self-supervised learning tasks. Experiments validate the effectiveness of our method in improving LVLMs’ performance on fundamental and downstream graph learning tasks, as well as enhancing their robustness against complex visual graphs.
pdf
bib
abs
ISR: Self-Refining Referring Expressions for Entity Grounding
Zhuocheng Yu
|
Bingchan Zhao
|
Yifan Song
|
Sujian Li
|
Zhonghui He
Entity grounding, a crucial task in constructing multimodal knowledge graphs, aims to align entities from knowledge graphs with their corresponding images. Unlike conventional visual grounding tasks that use referring expressions (REs) as inputs, entity grounding relies solely on entity names and types, presenting a significant challenge. To address this, we introduce a novel **I**terative **S**elf-**R**efinement (**ISR**) scheme to enhance the multimodal large language model’s capability to generate high quality REs for the given entities as explicit contextual clues. This training scheme, inspired by human learning dynamics and human annotation processes, enables the MLLM to iteratively generate and refine REs by learning from successes and failures, guided by outcome rewards from a visual grounding model. This iterative cycle of self-refinement avoids overfitting to fixed annotations and fosters continued improvement in referring expression generation. Extensive experiments demonstrate that our methods surpasses other methods in entity grounding, highlighting its effectiveness, robustness and potential for broader applications.
pdf
bib
abs
Activating Distributed Visual Region within LLMs for Efficient and Effective Vision-Language Training and Inference
Siyuan Wang
|
Dianyi Wang
|
Chengxing Zhou
|
Zejun Li
|
Zhihao Fan
|
Xuanjing Huang
|
Zhongyu Wei
Large Vision-Language Models (LVLMs) typically learn visual capacity through visual instruction tuning, involving updates to both a projector and their LLM backbones. Inspired by the concept of a visual region in the human brain, we investigate the existence of an analogous visual region within LLMs that functions as a cognitive core, and explore the potential of efficient training of LVLMs via selective layers tuning. Using Bunny-Llama-3-8B-V for detailed analysis and other three LVLMs for validation across diverse visual and textual tasks, we find that selectively updating 25% of LLMs layers, when sparsely and uniformly distributed, can preserve nearly 99% of visual performance and maintain or improve textual task results, while effectively reducing training time. Based on this targeted training approach, we further propose a novel visual region-based pruning paradigm, removing non-critical layers outside the visual region, which can achieve minimal performance loss. This study offers an effective and efficient strategy for LVLM training and inference by activating a layer-wise visual region within LLMs, which proves consistently effective across different models.
pdf
bib
abs
CCHall: A Novel Benchmark for Joint Cross-Lingual and Cross-Modal Hallucinations Detection in Large Language Models
Yongheng Zhang
|
Xu Liu
|
Ruoxi Zhou
|
Qiguang Chen
|
Hao Fei
|
Wenpeng Lu
|
Libo Qin
Investigating hallucination issues in large language models (LLMs) within cross-lingual and cross-modal scenarios can greatly advance the large-scale deployment in real-world applications. Nevertheless, the current studies are limited to a single scenario, either cross-lingual or cross-modal, leaving a gap in the exploration of hallucinations in the joint cross-lingual and cross-modal scenarios. Motivated by this, we introduce a novel joint Cross-lingual and Cross-modal Hallucinations benchmark (CCHall) to fill this gap. Specifically, CCHall simultaneously incorporates both cross-lingual and cross-modal hallucination scenarios, which can be used to assess the cross-lingual and cross-modal capabilities of LLMs. Furthermore, we conduct a comprehensive evaluation on CCHall, exploring both mainstream open-source and closed-source LLMs. The experimental results highlight that current LLMs still struggle with CCHall. We hope CCHall can serve as a valuable resource to assess LLMs in joint cross-lingual and cross-modal scenarios.
pdf
bib
abs
TestNUC: Enhancing Test-Time Computing Approaches and Scaling through Neighboring Unlabeled Data Consistency
Henry Peng Zou
|
Zhengyao Gu
|
Yue Zhou
|
Yankai Chen
|
Weizhi Zhang
|
Liancheng Fang
|
Yibo Wang
|
Yangning Li
|
Kay Liu
|
Philip S. Yu
Test-time computing approaches, which leverage additional computational resources during inference, have been proven effective in enhancing large language model performance. This work introduces a novel, linearly scaling approach, TestNUC, that improves test-time predictions by leveraging the local consistency of neighboring unlabeled data-it classifies an input instance by considering not only the model’s prediction on that instance but also on neighboring unlabeled instances. We evaluate TestNUC across eight diverse datasets, spanning intent classification, topic mining, domain discovery, and emotion detection, demonstrating its consistent superiority over baseline methods such as standard prompting and self-consistency. Furthermore, TestNUC can be seamlessly integrated with existing test-time computing approaches, substantially boosting their performance. Our analysis reveals that TestNUC scales effectively with increasing amounts of unlabeled data and performs robustly across different embedding models, making it practical for real-world applications. Our code is available at https://github.com/HenryPengZou/TestNUC.
pdf
bib
abs
The Esethu Framework: Reimagining Sustainable Dataset Governance and Curation for Low-Resource Languages
Jenalea Rajab
|
Anuoluwapo Aremu
|
Everlyn Asiko Chimoto
|
Dale Dunbar
|
Graham Morrissey
|
Fadel Thior
|
Luandrie Potgieter
|
Jessica Ojo
|
Atnafu Lambebo Tonja
|
Wilhelmina NdapewaOnyothi Nekoto
|
Pelonomi Moiloa
|
Jade Abbott
|
Vukosi Marivate
|
Benjamin Rosman
This paper presents the Esethu Framework, a sustainable data curation framework specifically designed to empower local communities and ensure equitable benefit-sharing from their linguistic resource. This framework is supported by the Esethu license, a novel community-centric data license. As a proof of concept, we introduce the Vuk’uzenzele isiXhosa Speech Dataset (ViXSD), an open-source corpus developed under the Esethu Framework and License. The dataset, containing read speech from native isiXhosa speakers enriched with demographic and linguistic metadata, demonstrates how community-driven licensing and curation principles can bridge resource gaps in automatic speech recognition (ASR) for African languages while safeguarding the interests of data creators. We describe the framework guiding dataset development, outline the Esethu license provisions, present the methodology for ViXSD, and present ASR experiments validating ViXSD’s usability in building and refining voice-driven applications for isiXhosa.
pdf
bib
abs
Theoretical Analysis of Hierarchical Language Recognition and Generation by Transformers without Positional Encoding
Daichi Hayakawa
|
Issei Sato
In this study, we provide constructive proof that Transformers can recognize and generate hierarchical language efficiently with respect to model size, even without the need for a specific positional encoding.Specifically, we show that causal masking and a starting token enable Transformers to compute positional information and depth within hierarchical structures.We demonstrate that Transformers without positional encoding can generate hierarchical languages. Furthermore, we suggest that explicit positional encoding might have a detrimental effect on generalization with respect to sequence length.
pdf
bib
abs
Less is More: Explainable and Efficient ICD Code Prediction with Clinical Entities
James C. Douglas
|
Yidong Gan
|
Ben Hachey
|
Jonathan K. Kummerfeld
Clinical coding, assigning standardized codes to medical notes, is critical for epidemiological research, hospital planning, and reimbursement. Neural coding models generally process entire discharge summaries, which are often lengthy and contain information that is not relevant to coding. We propose an approach that combines Named Entity Recognition (NER) and Assertion Classification (AC) to filter for clinically important content before supervised code prediction. On MIMIC-IV, a standard evaluation dataset, our approach achieves near-equivalent performance to a state-of-the-art full-text baseline while using only 22% of the content and reducing training time by over half. Additionally, mapping model attention to complete entity spans yields coherent, clinically meaningful explanations, capturing coding-relevant modifiers such as acuity and laterality. We release a newly annotated NER+AC dataset for MIMIC-IV, designed specifically for ICD coding. Our entity-centric approach lays a foundation for more transparent and cost-effective assisted coding.
pdf
bib
abs
Benchmarking LLMs and LLM-based Agents in Practical Vulnerability Detection for Code Repositories
Alperen Yildiz
|
Sin G Teo
|
Yiling Lou
|
Yebo Feng
|
Chong Wang
|
Dinil Mon Divakaran
Large Language Models (LLMs) have shown promise in software vulnerability detection, particularly on function-level benchmarks like Devign and BigVul. However, real-world detection requires interprocedural analysis, as vulnerabilities often emerge through multi-hop function calls rather than isolated functions. While repository-level benchmarks like ReposVul and VulEval introduce interprocedural context, they remain computationally expensive, lack pairwise evaluation of vulnerability fixes, and explore limited context retrieval, limiting their practicality.We introduce JITVul, a JIT vulnerability detection benchmark linking each function to its vulnerability-introducing and fixing commits. Built from 879 CVEs spanning 91 vulnerability types, JITVul enables comprehensive evaluation of detection capabilities. Our results show that ReAct Agents, leveraging thought-action-observation and interprocedural context, perform better than LLMs in distinguishing vulnerable from benign code. While prompting strategies like Chain-of-Thought help LLMs, ReAct Agents require further refinement. Both methods show inconsistencies, either misidentifying vulnerabilities or over-analyzing security guards, indicating significant room for improvement.
pdf
bib
abs
Multi-Modality Expansion and Retention for LLMs through Parameter Merging and Decoupling
Junlin Li
|
Guodong Du
|
Jing Li
|
Sim Kuan Goh
|
Wenya Wang
|
Yequan Wang
|
Fangming Liu
|
Ho-Kin Tang
|
Saleh Alharbi
|
Daojing He
|
Min Zhang
Fine-tuning Large Language Models (LLMs) with multimodal encoders on modality-specific data expands the modalities that LLMs can handle, leading to the formation of Multimodal LLMs (MLLMs). However, this paradigm heavily relies on resource-intensive and inflexible fine-tuning from scratch with new multimodal data. In this paper, we propose MMER (Multi-modality Expansion and Retention), a training-free approach that integrates existing MLLMs for effective multimodal expansion while retaining their original performance. Specifically, MMER reuses MLLMs’ multimodal encoders while merging their LLM parameters. By comparing original and merged LLM parameters, MMER generates binary masks to approximately separate LLM parameters for each modality. These decoupled parameters can independently process modality-specific inputs, reducing parameter conflicts and preserving original MLLMs’ fidelity. MMER can also mitigate catastrophic forgetting by applying a similar process to MLLMs fine-tuned on new tasks. Extensive experiments show significant improvements over baselines, proving that MMER effectively expands LLMs’ multimodal capabilities while retaining 99% of the original performance, and also markedly mitigates catastrophic forgetting.
pdf
bib
abs
Serial Lifelong Editing via Mixture of Knowledge Experts
YuJu Cheng
|
Yu-Chu Yu
|
Kai-Po Chang
|
Yu-Chiang Frank Wang
It is challenging to update Large language models (LLMs) since real-world knowledge evolves. While existing Lifelong Knowledge Editing (LKE) methods efficiently update sequentially incoming edits, they often struggle to precisely overwrite the outdated knowledge with the latest one, resulting in conflicts that hinder LLMs from determining the correct answer. To address this Serial Lifelong Knowledge Editing (sLKE) problem, wepropose a novel Mixture-of-Knowledge-Experts scheme with an Activation-guided Routing Mechanism (ARM), which assigns specialized experts to store domain-specific knowledge and ensures that each update completely overwrites old information with the latest data. Furthermore, we introduce a novel sLKE benchmark where answers to the same concept are updated repeatedly, to assess the ability of editing methods to refresh knowledge accurately. Experimental results on both LKE and sLKE benchmarks show that our ARM performs favorably against SOTA knowledge editing methods.
pdf
bib
abs
A Survey on Efficient Large Language Model Training: From Data-centric Perspectives
Junyu Luo
|
Bohan Wu
|
Xiao Luo
|
Zhiping Xiao
|
Yiqiao Jin
|
Rong-Cheng Tu
|
Nan Yin
|
Yifan Wang
|
Jingyang Yuan
|
Wei Ju
|
Ming Zhang
Post-training of Large Language Models (LLMs) is crucial for unlocking their task generalization potential and domain-specific capabilities. However, the current LLM post-training paradigm faces significant data challenges, including the high costs of manual annotation and diminishing marginal returns on data scales. Therefore, achieving data-efficient post-training has become a key research question. In this paper, we present the first systematic survey of data-efficient LLM post-training from a data-centric perspective. We propose a taxonomy of data-efficient LLM post-training methods, covering data selection, data quality enhancement, synthetic data generation, data distillation and compression, and self-evolving data ecosystems. We summarize representative approaches in each category and outline future research directions. By examining the challenges in data-efficient LLM post-training, we highlight open problems and propose potential research avenues. We hope our work inspires further exploration into maximizing the potential of data utilization in large-scale model training. Paper List: https://github.com/luo-junyu/Awesome-Data-Efficient-LLM
pdf
bib
abs
IMOL: Incomplete-Modality-Tolerant Learning for Multi-Domain Fake News Video Detection
Zhi Zeng
|
Jiaying Wu
|
Minnan Luo
|
Herun Wan
|
Xiangzheng Kong
|
Zihan Ma
|
Guang Dai
|
Qinghua Zheng
While recent advances in fake news video detection have shown promising potential, existing approaches typically (1) focus on a specific domain (e.g., politics) and (2) assume the availability of multiple modalities, including video, audio, description texts, and related images. However, these methods struggle to generalize to real-world scenarios, where questionable information spans diverse domains and is often modality-incomplete due to factors such as upload degradation or missing metadata. To address these challenges, we introduce two real-world multi-domain news video benchmarks that reflect modality incompleteness and propose IMOL, an incomplete-modality-tolerant learning framework for multi-domain fake news video detection. Inspired by cognitive theories suggesting that humans infer missing modalities through cross-modal guidance and retrieve relevant knowledge from memory for reference, IMOL employs a hierarchical transferable information integration strategy. This consists of two key phases: (1) leveraging cross-modal consistency to reconstruct missing modalities and (2) refining sample-level transferable knowledge through cross-sample associative reasoning. Extensive experiments demonstrate that IMOL significantly enhances the performance and robustness of multi-domain fake news video detection while effectively generalizing to unseen domains under incomplete modality conditions.
pdf
bib
abs
DDxTutor: Clinical Reasoning Tutoring System with Differential Diagnosis-Based Structured Reasoning
Qian Wu
|
Zheyao Gao
|
Longfei Gou
|
Qi Dou
Clinical diagnosis education requires students to master both systematic reasoning processes and comprehensive medical knowledge. While recent advances in Large Language Models (LLMs) have enabled various medical educational applications, these systems often provide direct answers that could reduce students’ cognitive engagement and lead to fragmented learning. Motivated by these challenges, we propose DDxTutor, a framework that follows differential diagnosis principles to decompose clinical reasoning into teachable components. It consists of a structured reasoning module that analyzes clinical clues and synthesizes diagnostic conclusions, and an interactive dialogue framework that guides students through this process. To enable such tutoring, we construct DDxReasoning, a dataset of 933 clinical cases with fine-grained diagnostic steps verified by doctors. Our experiments demonstrate that fine-tuned LLMs achieve strong performance in generating structured teaching references and conducting interactive diagnostic tutoring dialogues. Human evaluation by medical educators and students validates the framework’s potential and effectiveness for clinical diagnosis education. Our project is available at https://github.com/med-air/DDxTutor.
pdf
bib
abs
SocialEval: Evaluating Social Intelligence of Large Language Models
Jinfeng Zhou
|
Yuxuan Chen
|
Yihan Shi
|
Xuanming Zhang
|
Leqi Lei
|
Yi Feng
|
Zexuan Xiong
|
Miao Yan
|
Xunzhi Wang
|
Yaru Cao
|
Jianing Yin
|
Shuai Wang
|
Quanyu Dai
|
Zhenhua Dong
|
Hongning Wang
|
Minlie Huang
LLMs exhibit promising Social Intelligence (SI) in modeling human behavior, raising the need to evaluate LLMs’ SI and their discrepancy with humans. SI equips humans with interpersonal abilities to behave wisely in navigating social interactions to achieve social goals. This presents an operational evaluation paradigm: outcome-oriented goal achievement evaluation and process-oriented interpersonal ability evaluation, which existing work fails to address. To this end, we propose SocialEval, a script-based bilingual SI benchmark, integrating outcome- and process-oriented evaluation by manually crafting narrative scripts. Each script is structured as a world tree that contains plot lines driven by interpersonal ability, providing a comprehensive view of how LLMs navigate social interactions. Experiments show that LLMs fall behind humans on both SI evaluations, exhibit prosociality, and prefer more positive social behaviors, even if they lead to goal failure. Analysis of LLMs’ formed representation space and neuronal activations reveals that LLMs have developed ability-specific functional partitions akin to the human brain.
pdf
bib
abs
Hidden in Plain Sight: Evaluation of the Deception Detection Capabilities of LLMs in Multimodal Settings
Md Messal Monem Miah
|
Adrita Anika
|
Xi Shi
|
Ruihong Huang
Detecting deception in an increasingly digital world is both a critical and challenging task. In this study, we present a comprehensive evaluation of the automated deception detection capabilities of Large Language Models (LLMs) and Large Multimodal Models (LMMs) across diverse domains. We assess the performance of both open-source and proprietary LLMs on three distinct datasets—real-life trial interviews (RLTD), instructed deception in interpersonal scenarios (MU3D), and deceptive reviews (OpSpam). We systematically analyze the effectiveness of different experimental setups for deception detection, including zero-shot and few-shot approaches with random or similarity-based in-context example selection. Our findings indicate that fine-tuned LLMs achieve state-of-the-art performance on textual deception detection, whereas LMMs struggle to fully leverage multimodal cues, particularly in real-world settings. Additionally, we analyze the impact of auxiliary features, such as non-verbal gestures, video summaries, and evaluate the effectiveness of different promptingstrategies, such as direct label generation and post-hoc reasoning generation. Experiments unfold that reasoning-based predictions do not consistently improve performance over direct classification, contrary to the expectations.
pdf
bib
abs
Analyzing and Mitigating Inconsistency in Discrete Speech Tokens for Neural Codec Language Models
Wenrui Liu
|
Zhifang Guo
|
Jin Xu
|
Yuanjun Lv
|
Yunfei Chu
|
Zemin Liu
|
Junyang Lin
Building upon advancements in Large Language Models (LLMs), the field of audio processing has seen increased interest in training speech generation tasks with discrete speech token sequences. However, directly discretizing speech by neural audio codecs often results in sequences that fundamentally differ from text sequences. Unlike text, where text token sequences are deterministic, discrete speech tokens can exhibit significant variability based on contextual factors, while still producing perceptually identical audio segments. We refer to this phenomenon as Discrete Representation Inconsistency (DRI). This inconsistency can lead to a single speech segment being represented by multiple divergent sequences, which creates confusion in neural codec language models and results in poor generated speech. In this paper, we quantitatively analyze the DRI phenomenon within popular audio tokenizers such as EnCodec. Our approach effectively mitigates the DRI phenomenon of the neural audio codec. Furthermore, extensive experiments on the neural codec language model over LibriTTS and large-scale MLS dataset (44,000 hours) demonstrate the effectiveness and generality of our method. The demo of audio samples is available at https://consistencyinneuralcodec.github.io.
pdf
bib
abs
PlanningArena: A Modular Benchmark for Multidimensional Evaluation of Planning and Tool Learning
Zihan Zheng
|
Tianle Cui
|
Chuwen Xie
|
Jiahui Pan
|
Qianglong Chen
|
Lewei He
One of the research focuses of large language models (LLMs) is the ability to generate action plans. Recent studies have revealed that the performance of LLMs can be significantly improved by integrating external tools. Based on this, we propose a benchmark framework called PlanningArena, which aims to simulate real application scenarios and provide a series of apps and API tools that may be involved in the actual planning process. This framework adopts a modular task structure and combines user portrait analysis to evaluate the ability of LLMs in correctly selecting tools, logical reasoning in complex scenarios, and parsing user information. In addition, we deeply diagnose the task execution effect of LLMs from both macro and micro levels. The experimental results show that even the most outstanding GPT-4o and DeepSeekV3 models only achieved a total score of 56.5% and 41.9% in PlanningArena, respectively, indicating that current LLMs still face challenges in logical reasoning, context memory, and tool calling when dealing with different structures, scenarios, and their complexity. Through this benchmark, we further explore the path to optimize LLMs to perform planning tasks.
pdf
bib
abs
FocusLLM: Precise Understanding of Long Context by Dynamic Condensing
Zhenyu Li
|
Yike Zhang
|
Tengyu Pan
|
Yutao Sun
|
Zhichao Duan
|
Junjie Fang
|
Rong Han
|
Zixuan Wang
|
Jianyong Wang
Empowering LLMs with the ability to precisely understand long contexts is crucial for many downstream applications. However, handling long contexts with conventional transformer architecture requires substantial training and inference resources. Existing context condensing methods cannot accurately understand the full context, as there is a considerable amount of information loss in the condensing process. To address these issues, we present **FocusLLM**, a framework designed to extend the fixed context length of any decoder-only LLM, allowing the model to focus on relevant information from very long sequences. FocusLLM first divides long text input into chunks based on the model’s original context length. It then employs the **_dynamic condensing_** process to distill crucial information from each chunk. Ultimately, through the novel **_parallel decoding_** mechanism, FocusLLM can integrate the extracted information into its local context. FocusLLM stands out for great training efficiency and versatility: trained with an 8K input length and with much less training cost than previous methods, FocusLLM exhibits superior performance across downstream tasks and maintains strong language modeling ability when handling extensive long texts, even up to 400K tokens. Our code is available at https://github.com/leezythu/FocusLLM.
pdf
bib
abs
Negative Matters: Multi-Granularity Hard-Negative Synthesis and Anchor-Token-Aware Pooling for Enhanced Text Embeddings
Tengyu Pan
|
Zhichao Duan
|
Zhenyu Li
|
Bowen Dong
|
Ning Liu
|
Xiuxing Li
|
Jianyong Wang
Text embedding models are essential for various natural language processing tasks, enabling the effective encoding of semantic information into dense vector representations. These models are typically optimized using triplets of (query, positive, negative) data pairs for contrastive learning, where the negative samples play a critical role in enhancing the model’s ability to discern subtle semantic distinctions. In this work, we introduce a **M**ulti-**G**ranularity **H**ard-negative (MGH) synthesis framework that leverages large language models (LLMs) to generate diverse negative samples with varying levels of similarity with the query. This approach facilitates a coarse-to-fine curriculum learning strategy during supervised training, allowing the embedding model to progressively learn more nuanced semantic representations. Meanwhile, we propose an **A**nchor **T**oken **A**ware (ATA) pooling method that assigns higher weights to anchor tokens based on aggregation patterns observed in LLMs, improving text embedding accuracy without increasing model complexity. Comprehensive experiments on the MTEB benchmark demonstrate that our methods achieve state-of-the-art performance, surpassing existing synthesis strategies both with synthetic data and when combined with public retrieval datasets.
pdf
bib
abs
GPT-4 as a Homework Tutor Can Improve Student Engagement and Learning Outcomes
Alessandro Vanzo
|
Sankalan Pal Chowdhury
|
Mrinmaya Sachan
This work contributes to the scarce empirical literature on LLM-based interactive homework in real-world educational settings and offers a practical, scalable solution to improve homework in schools. Homework is an important part of education in schools across the world, but to maximize benefit, it must be accompanied by feedback and follow-up questions. We developed a prompting strategy that enables GPT-4 to conduct interactive homework sessions for high school students learning English as a second language. Our strategy requires minimal effort in content preparation, one of the key challenges of alternatives such as home tutors or ITSs. We carried out a Randomized Controlled Trial (RCT) in four high-school classes, replacing traditional homework with GPT-4 homework sessions for the treatment group. We found that the treatment group had higher levels of satisfaction and desire to keep using the system among the students. This occurred without compromising learning outcomes, and one group even showed significantly better learning gains.
pdf
bib
abs
Diffusion Models Through a Global Lens: Are They Culturally Inclusive?
Zahra Bayramli
|
Ayhan Suleymanzade
|
Na Min An
|
Huzama Ahmad
|
Eunsu Kim
|
Junyeong Park
|
James Thorne
|
Alice Oh
Text-to-image diffusion models have recently enabled the creation of visually compelling, detailed images from textual prompts. However, their ability to accurately represent various cultural nuances remains an open question. In our work, we introduce CULTDIFF benchmark, evaluating whether state-of-the-art diffusion models can generate culturally specific images spanning ten countries. We show that these models often fail to generate cultural artifacts in architecture, clothing, and food, especially for underrepresented country regions, by conducting a fine-grained analysis of different similarity aspects, revealing significant disparities in cultural relevance, description fidelity, and realism compared to real-world reference images. With the collected human evaluations, we develop a neural-based image-image similarity metric, namely, CULTDIFF-S, to predict human judgment on real and generated images with cultural artifacts. Our work highlights the need for more inclusive generative AI systems and equitable dataset representation over a wide range of cultures.
pdf
bib
abs
Efficient Safety Alignment of Large Language Models via Preference Re-ranking and Representation-based Reward Modeling
Deng Qiyuan
|
Xuefeng Bai
|
Kehai Chen
|
Yaowei Wang
|
Liqiang Nie
|
Min Zhang
Reinforcement Learning (RL) algorithms for safety alignment of Large Language Models (LLMs), such as Direct Preference Optimization (DPO), encounter the challenge of distribution shift. Current approaches typically address this issue through online sampling from the target policy, which requires significant computational resources.In this paper, we hypothesize that during off-policy training, while the ranking order of output generated by policy changes, their overall distribution remains relatively stable.This stability allows the conversion of the sampling process from the target policy into a computationallyefficient re-ranking of preference data.Building on this hypothesis, we propose a new framework that leverages the model’s intrinsic safety judgment capability to extract reward signals, which are then used to calculate label confidence for preference reordering. Extensive experiments and theoretical analysis demonstrate that the proposed method effectively addresses the distribution shift issue, remarkably enhancing the safety performance while avoiding about 300x computational overheads.
pdf
bib
abs
English-based acoustic models perform well in the forced alignment of two English-based Pacific Creoles
Sam Passmore
|
Lila San Roque
|
Kirsty Gillespie
|
Saurabh Nath
|
Kira Davey
|
Keira Mullan
|
Tim Cawley
|
Jennifer Biggs
|
Rosey Billington
|
Bethwyn Evans
|
Nick Thieberger
|
Danielle Barth
Expanding the breadth languages used to study sociophonetic variation and change is an important step in the theoretical development of sociophonetics. As data archives grow, forced alignment can accelerate the study of sociophonetic variation in minority languages. This paper examines the application of English and custom-made acoustic models on the alignment of vowels in two Pacific Creoles, Tok Pisin (59 hours) and Bislama (38.5 hours). We find that English models perform acceptably well in both languages, and as well as humans in vowel environments described as ‘Highly Reliable’. Custom models performed better in Bislama than Tok Pisin. We end the paper with recommendations on the use of cross-linguistic acoustic models in the case of English-Based Creoles.
pdf
bib
abs
Subtle Errors in Reasoning: Preference Learning via Error-injected Self-editing
Kaishuai Xu
|
Tiezheng Yu
|
Wenjun Hou
|
Yi Cheng
|
Chak Tou Leong
|
Liangyou Li
|
Xin Jiang
|
Lifeng Shang
|
Qun Liu
|
Wenjie Li
Large Language Models (LLMs) have exhibited strong mathematical reasoning prowess, tackling tasks ranging from basic arithmetic to advanced competition-level problems. However, frequently occurring subtle yet critical errors, such as miscalculations or incorrect substitutions, limit the LLMs’ full potential. Existing studies to improve mathematical ability typically involve applying preference learning to step-wise solution pairs. Although these methods leverage samples of varying granularity to mitigate reasoning errors, they overlook critical subtle errors. In this work, we propose a novel preference learning framework called eRror-Injected Self-Editing (RISE), which injects predefined subtle errors into pivotal tokens in reasoning or computation steps to construct hard pairs for error mitigation. In detail, RISE uses the LLM itself to edit a small number of tokens in the solution, injecting designed subtle errors. Then, pairs composed of self-edited solutions and their corresponding correct ones, along with pairs of correct and incorrect solutions obtained through sampling, are used together for subtle error-aware DPO training. Compared with other preference learning methods, RISE further refines the training objective without requiring fine-grained sampling or preference annotation. Extensive experiments validate the effectiveness of RISE, with preference learning on Qwen2-7B-Instruct yielding notable improvements of 3.0% on GSM8K and 7.9% on MATH with only 4.5K training samples. Moreover, the effect of error mitigation extends from mathematical reasoning to logical reasoning and code generation.
pdf
bib
abs
Truth Knows No Language: Evaluating Truthfulness Beyond English
Blanca Calvo Figueras
|
Eneko Sagarzazu
|
Julen Etxaniz
|
Jeremy Barnes
|
Pablo Gamallo
|
Iria de-Dios-Flores
|
Rodrigo Agerri
We introduce a professionally translated extension of the TruthfulQA benchmark designed to evaluate truthfulness in Basque, Catalan, Galician, and Spanish. Truthfulness evaluations of large language models (LLMs) have primarily been focused on English. However, the ability of LLMs to maintain truthfulness across languages remains under-explored. Our study evaluates 12 state-of-the-art open LLMs, comparing base and instruction-tuned models using human evaluation, multiple-choice metrics, and LLM-as-a-Judge scoring. Our findings reveal that, while LLMs perform best in English and worst in Basque (the lowest-resourced language), overall truthfulness discrepancies across languages are smaller than anticipated. Furthermore, we show that LLM-as-a-Judge correlates more closely with human judgments than multiple-choice metrics, and that informativeness plays a critical role in truthfulness assessment. Our results also indicate that machine translation provides a viable approach for extending truthfulness benchmarks to additional languages, offering a scalable alternative to professional translation. Finally, we observe that universal knowledge questions are better handled across languages than context- and time-dependent ones, highlighting the need for truthfulness evaluations that account for cultural and temporal variability. Datasets, models and code are publicly available under open licenses.
pdf
bib
abs
Revisiting Compositional Generalization Capability of Large Language Models Considering Instruction Following Ability
Yusuke Sakai
|
Hidetaka Kamigaito
|
Taro Watanabe
In generative commonsense reasoning tasks such as CommonGen, generative large language models (LLMs) compose sentences that include all given concepts. However, when focusing on instruction-following capabilities, if a prompt specifies a concept order, LLMs must generate sentences that adhere to the specified order. To address this, we propose Ordered CommonGen, a benchmark designed to evaluate the compositional generalization and instruction-following abilities of LLMs. This benchmark measures ordered coverage to assess whether concepts are generated in the specified order, enabling a simultaneous evaluation of both abilities. We conducted a comprehensive analysis using 36 LLMs and found that, while LLMs generally understand the intent of instructions, biases toward specific concept order patterns often lead to low-diversity outputs or identical results even when the concept order is altered. Moreover, even the most instruction-compliant LLM achieved only about 75% ordered coverage, highlighting the need for improvements in both instruction-following and compositional generalization capabilities.
pdf
bib
abs
Batayan: A Filipino NLP benchmark for evaluating Large Language Models
Jann Railey Montalan
|
Jimson Paulo Layacan
|
David Demitri Africa
|
Richell Isaiah S. Flores
|
Michael T. Lopez Ii
|
Theresa Denise Magsajo
|
Anjanette Cayabyab
|
William Chandra Tjhi
Recent advances in large language models (LLMs) have demonstrated remarkable capabilities on widely benchmarked high-resource languages. However, linguistic nuances of under-resourced languages remain unexplored. We introduce Batayan, a holistic Filipino benchmark that systematically evaluates LLMs across three key natural language processing (NLP) competencies: understanding, reasoning, and generation. Batayan consolidates eight tasks, three of which have not existed prior for Filipino corpora, covering both Tagalog and code-switched Taglish utterances. Our rigorous, native-speaker-driven adaptation and validation processes ensures fluency and authenticity to the complex morphological and syntactic structures of Filipino, alleviating the pervasive translationese bias in existing Filipino corpora. We report empirical results on a variety of open-source and commercial LLMs, highlighting significant performance gaps that signal the under-representation of Filipino in pre-training corpora, the unique hurdles in modeling Filipino’s rich morphology and construction, and the importance of explicit Filipino language support. Moreover, we discuss the practical challenges encountered in dataset construction and propose principled solutions for building culturally and linguistically-faithful resources in under-represented languages. We also provide a public evaluation suite as a clear foundation for iterative, community-driven progress in Filipino NLP.
pdf
bib
abs
HintsOfTruth: A Multimodal Checkworthiness Detection Dataset with Real and Synthetic Claims
Michiel Van Der Meer
|
Pavel Korshunov
|
Sébastien Marcel
|
Lonneke Van Der Plas
Misinformation can be countered with fact-checking, but the process is costly and slow. Identifying checkworthy claims is the first step, where automation can help scale fact-checkers’ efforts. However, detection methods struggle with content that is (1) multimodal, (2) from diverse domains, and (3) synthetic. We introduce HintsOfTruth, a public dataset for multimodal checkworthiness detection with 27K real-world and synthetic image/claim pairs. The mix of real and synthetic data makes this dataset unique and ideal for benchmarking detection methods. We compare fine-tuned and prompted Large Language Models (LLMs). We find that well-configured lightweight text-based encoders perform comparably to multimodal models but the former only focus on identifying non-claim-like content. Multimodal LLMs can be more accurate but come at a significant computational cost, making them impractical for large-scale applications. When faced with synthetic data, multimodal models perform more robustly.
pdf
bib
abs
CityNavAgent: Aerial Vision-and-Language Navigation with Hierarchical Semantic Planning and Global Memory
Weichen Zhang
|
Chen Gao
|
Shiquan Yu
|
Ruiying Peng
|
Baining Zhao
|
Qian Zhang
|
Jinqiang Cui
|
Xinlei Chen
|
Yong Li
Aerial vision-and-language navigation (VLN) — requiring drones to interpret natural language instructions and navigate complex urban environments — emerges as a critical embodied AI challenge that bridges human-robot interaction, 3D spatial reasoning, and real-world deployment. Although existing ground VLN agents achieved notable results in indoor and outdoor settings, they struggle in aerial VLN due to the absence of predefined navigation graphs and the exponentially expanding action space in long-horizon exploration. In this work, we propose CityNavAgent, a large language model (LLM)-empowered agent that significantly reduces the navigation complexity for urban aerial VLN. Specifically, we design a hierarchical semantic planning module (HSPM) that decomposes the long-horizon task into sub-goals with different semantic levels. The agent reaches the target progressively by achieving sub-goals with different capacities of the LLM. Additionally, a global memory module storing historical trajectories into a topological graph is developed to simplify navigation for visited targets. Extensive benchmark experiments show that our method achieves state-of-the-art performance with significant improvement. Further experiments demonstrate the effectiveness of different modules of CityNavAgent for aerial VLN in continuous city environments.
pdf
bib
abs
It’s Not a Walk in the Park! Challenges of Idiom Translation in Speech-to-text Systems
Iuliia Zaitova
|
Badr M. Abdullah
|
Wei Xue
|
Dietrich Klakow
|
Bernd Möbius
|
Tania Avgustinova
Idioms are defined as a group of words with a figurative meaning not deducible from their individual components. Although modern machine translation systems have made remarkable progress, translating idioms remains a major challenge, especially for speech-to-text systems, where research on this topic is notably sparse. In this paper, we systematically evaluate idiom translation as compared to conventional news translation in both text-to-text machine translation (MT) and speech-to-text translation (SLT) systems across two language pairs (German to English, Russian to English). We compare state-of-the-art end-to-end SLT systems (SeamlessM4T SLT-to-text, Whisper Large v3) with MT systems (SeamlessM4T SLT-to-text, No Language Left Behind), Large Language Models (DeepSeek, LLaMA) and cascaded alternatives. Our results reveal that SLT systems experience a pronounced performance drop on idiomatic data, often reverting to literal translations even in higher layers, whereas MT systems and Large Language Models demonstrate better handling of idioms. These findings underscore the need for idiom-specific strategies and improved internal representations in SLT architectures.
pdf
bib
abs
PolyNarrative: A Multilingual, Multilabel, Multi-domain Dataset for Narrative Extraction from News Articles
Nikolaos Nikolaidis
|
Nicolas Stefanovitch
|
Purificação Silvano
|
Dimitar Iliyanov Dimitrov
|
Roman Yangarber
|
Nuno Guimarães
|
Elisa Sartori
|
Ion Androutsopoulos
|
Preslav Nakov
|
Giovanni Da San Martino
|
Jakub Piskorski
We present polyNarrative, a new multilingual dataset of news articles, annotated for narratives. Narratives are overt or implicit claims, recurring across articles and languages, promoting a specific interpretation or viewpoint on an ongoing topic, often propagating mis/disinformation. We developed two-level taxonomies with coarse- and fine-grained narrative labels for two domains: (i) climate change and (ii) the military conflict between Ukraine and Russia. We collected news articles in four languages (Bulgarian, English, Portuguese, and Russian) related to the two domains and manually annotated them at the paragraph level. We make the dataset publicly available, along with experimental results of several strong baselines that assign narrative labels to news articles at the paragraph or the document level. We believe that this dataset will foster research in narrative detection and enable new research directions towards more multi-domain and highly granular narrative related tasks.
pdf
bib
abs
A Parameter-Efficient and Fine-Grained Prompt Learning for Vision-Language Models
Yongbin Guo
|
Shuzhen Li
|
Zhulin Liu
|
Tong Zhang
|
C.L.Philip Chen
Current vision-language models (VLMs) understand complex vision-text tasks by extracting overall semantic information from large-scale cross-modal associations. However, extracting from large-scale cross-modal associations often smooths out semantic details and requires large computations, limiting multimodal fine-grained understanding performance and efficiency. To address this issue, this paper proposes a detail-oriented prompt learning (DoPL) method for vision-language models to implement fine-grained multi-modal semantic alignment with merely 0.25M trainable parameters. According to the low-entropy information concentration theory, DoPL explores shared interest tokens from text-vision correlations and transforms them into alignment weights to enhance text prompt and vision prompt via detail-oriented prompt generation. It effectively guides the current frozen layer to extract fine-grained text-vision alignment cues. Furthermore, DoPL constructs detail-oriented prompt generation for each frozen layer to implement layer-by-layer localization of fine-grained semantic alignment, achieving precise understanding in complex vision-text tasks. DoPL performs well in parameter-efficient fine-grained semantic alignment with only 0.12% tunable parameters for vision-language models. The state-of-the-art results over the previous parameter-efficient fine-tuning methods and full fine-tuning approaches on six benchmarks demonstrate the effectiveness and efficiency of DoPL in complex multi-modal tasks.
pdf
bib
abs
Persona Dynamics: Unveiling the Impact of Persona Traits on Agents in Text-Based Games
Seungwon Lim
|
Seungbeen Lee
|
Dongjun Min
|
Youngjae Yu
Artificial agents are increasingly central to complex interactions and decision-making tasks, yet aligning their behaviors with desired human values remains an open challenge. In this work, we investigate how human-like personality traits influence agent behavior and performance within text-based interactive environments. We introduce PANDA: Personality Adapted Neural Decision Agents, a novel method for projecting human personality traits onto agents to guide their behavior. To induce personality in a text-based game agent, (i) we train a personality classifier to identify what personality type the agent’s actions exhibit, and (ii) we integrate the personality profiles directly into the agent’s policy-learning pipeline. By deploying agents embodying 16 distinct personality types across 25 text-based games and analyzing their trajectories, we demonstrate that an agent’s action decisions can be guided toward specific personality profiles. Moreover, certain personality types, such as those characterized by higher levels of Openness, display marked advantages in performance. These findings underscore the promise of personality-adapted agents for fostering more aligned, effective, and human-centric decision-making in interactive environments.
pdf
bib
abs
SeedBench: A Multi-task Benchmark for Evaluating Large Language Models in Seed Science
Jie Ying
|
Zihong Chen
|
Zhefan Wang
|
Wanli Jiang
|
Chenyang Wang
|
Zhonghang Yuan
|
Haoyang Su
|
Huanjun Kong
|
Fan Yang
|
Nanqing Dong
Seed science is essential for modern agriculture, directly influencing crop yields and global food security. However, challenges such as interdisciplinary complexity and high costs with limited returns hinder progress, leading to a shortage of experts and insufficient technological support. While large language models (LLMs) have shown promise across various fields, their application in seed science remains limited due to the scarcity of digital resources, complex gene-trait relationships, and the lack of standardized benchmarks. To address this gap, we introduce SeedBench—the first multi-task benchmark specifically designed for seed science. Developed in collaboration with domain experts, SeedBench focuses on seed breeding and simulates key aspects of modern breeding processes. We conduct a comprehensive evaluation of 26 leading LLMs, encompassing proprietary, open-source, and domain-specific fine-tuned models. Our findings not only highlight the substantial gaps between the power of LLMs and the real-world seed science problems, but also make a foundational step for research on LLMs for seed design.
pdf
bib
abs
-Stance: A Large-Scale Real World Dataset of Stances in Legal Argumentation
Ankita Gupta
|
Douglas Rice
|
Brendan O’Connor
We present -Stance, a large-scale dataset of stances involved in legal argumentation.-Stance contains stance-annotated argument pairs, semi-automatically mined from millions of examples of U.S. judges citing precedent in context using citation signals. The dataset aims to facilitate work on the legal argument stance classification task, which involves assessing whether a case summary strengthens or weakens a legal argument (polarity) and to what extent (intensity). To assess the complexity of this task, we evaluate various existing NLP methods, including zero-shot prompting proprietary large language models (LLMs), and supervised fine-tuning of smaller open-weight language models (LMs) on 𝛿-Stance. Our findings reveal that although prompting proprietary LLMs can help predict stance polarity, supervised model fine-tuning on -Stance is necessary to distinguish intensity. We further find that alternative strategies such as domain-specific pretraining and zero-shot prompting using masked LMs remain insufficient. Beyond our dataset’s utility for the legal domain, we further find that fine-tuning small LMs on -Stance improves their performance in other domains. Finally, we study how temporal changes in signal definition can impact model performance, highlighting the importance of careful data curation for downstream tasks by considering the historical and sociocultural context. We publish the associated dataset to foster further research on legal argument reasoning.
pdf
bib
abs
Re3Syn: A Dependency-Based Data Synthesis Framework for Long-Context Post-training
Zhiyang Zhang
|
Ziqiang Liu
|
Huiming Wang
|
Renke Shan
|
Li Kuang
|
Lu Wang
|
De Wen Soh
An important trend in the realm of large language models (LLMs) is the development of longer context windows. However, training LLMs with long context windows to acquire the capability of effectively modeling lengthy inputs is often hindered by the scarcity of naturally long-context data. Existing methods for constructing long-context data by concatenating short documents have overlooked a crucial characteristic of long-context data quality, namely semantic dependency. In this paper, we propose a novel framework called Retrieval, Dependency Recognition, and Reorder for data synthesis (Re3Syn), which leverages semantic similarity to retrieve relevant documents and form several batches. Within each batch, the framework comprehensively recognizes dependency and utilizes them, along with a reorder algorithm, to organize the short documents into coherent long-context data. Comprehensive experiment on multiple benchmarks indicate that the data generated by the Re3Syn has longer dependencies and significantly enhances the model’s long-context capabilities. For reproducibility, we will release our codebase upon acceptance.
pdf
bib
abs
Enabling Chatbots with Eyes and Ears: An Immersive Multimodal Conversation System for Dynamic Interactions
Jihyoung Jang
|
Minwook Bae
|
Minji Kim
|
Dilek Hakkani-Tür
|
Hyounghun Kim
As chatbots continue to evolve toward human-like, real-world, interactions, multimodality remains an active area of research and exploration. So far, efforts to integrate multimodality into chatbots have primarily focused on image-centric tasks, such as visual dialogue and image-based instructions, placing emphasis on the “eyes” of human perception while neglecting the “ears”, namely auditory aspects. Moreover, these studies often center around static interactions that focus on discussing the modality rather than naturally incorporating it into the conversation, which limits the richness of simultaneous, dynamic engagement. Furthermore, while multimodality has been explored in multi-party and multi-session conversations, task-specific constraints have hindered its seamless integration into dynamic, natural conversations. To address these challenges, this study aims to equip chatbots with “eyes and ears” capable of more immersive interactions with humans. As part of this effort, we introduce a new multimodal conversation dataset, Multimodal Multi-Session Multi-Party Conversation (M3C), and propose a novel multimodal conversation model featuring multimodal memory retrieval. Our model, trained on the M3C, demonstrates the ability to seamlessly engage in long-term conversations with multiple speakers in complex, real-world-like settings, effectively processing visual and auditory inputs to understand and respond appropriately. Human evaluations highlight the model’s strong performance in maintaining coherent and dynamic interactions, demonstrating its potential for advanced multimodal conversational agents.
pdf
bib
abs
Multimodal Coreference Resolution for Chinese Social Media Dialogues: Dataset and Benchmark Approach
Xingyu Li
|
Chen Gong
|
Guohong Fu
Multimodal coreference resolution (MCR) aims to identify mentions referring to the same entity across different modalities, such as text and visuals, and is essential for understanding multimodal content. In the era of rapidly growing multimodal content and social media, MCR is particularly crucial for interpreting user interactions and bridging text-visual references to improve communication and personalization. However, MCR research for real-world dialogues remains unexplored due to the lack of sufficient data resources. To address this gap, we introduce TikTalkCoref, the first Chinese multimodal coreference dataset for social media in real-world scenarios, derived from the popular Douyin short-video platform. This dataset pairs short videos with corresponding textual dialogues from user comments and includes manually annotated coreference clusters for both person mentions in the text and the coreferential person head regions in the corresponding video frames. We also present an effective benchmark approach for MCR, focusing on the celebrity domain, and conduct extensive experiments on our dataset, providing reliable benchmark results for this newly constructed dataset. We release the TikTalkCoref dataset to facilitate future research on MCR for real-world social media dialogues at https://github.com/lxystaruni/TikTalkCoref.
pdf
bib
abs
TACLR: A Scalable and Efficient Retrieval-based Method for Industrial Product Attribute Value Identification
Yindu Su
|
Huike Zou
|
Lin Sun
|
Ting Zhang
|
Haiyang Yang
|
Chen Li Yu
|
David Lo
|
Qingheng Zhang
|
Shuguang Han
|
Jufeng Chen
Product Attribute Value Identification (PAVI) involves identifying attribute values from product profiles, a key task for improving product search, recommendation, and business analytics on e-commerce platforms.However, existing PAVI methods face critical challenges, such as inferring implicit values, handling out-of-distribution (OOD) values, and producing normalized outputs.To address these limitations, we introduce Taxonomy-Aware Contrastive Learning Retrieval (TACLR), the first retrieval-based method for PAVI.TACLR formulates PAVI as an information retrieval task by encoding product profiles and candidate values into embeddings and retrieving values based on their similarity. It leverages contrastive training with taxonomy-aware hard negative sampling and employs adaptive inference with dynamic thresholds.TACLR offers three key advantages: (1) it effectively handles implicit and OOD values while producing normalized outputs; (2) it scales to thousands of categories, tens of thousands of attributes, and millions of values; and (3) it supports efficient inference for high-load industrial deployment.Extensive experiments on proprietary and public datasets validate the effectiveness and efficiency of TACLR. Further, it has been successfully deployed on the real-world e-commerce platform Xianyu, processing millions of product listings daily with frequently updated, large-scale attribute taxonomies. We release the code to facilitate reproducibility and future research at https://github.com/SuYindu/TACLR.
pdf
bib
abs
Theory of Mind in Large Language Models: Assessment and Enhancement
Ruirui Chen
|
Weifeng Jiang
|
Chengwei Qin
|
Cheston Tan
Theory of Mind (ToM)—the ability to reason about the mental states of oneself and others—is a cornerstone of human social intelligence. As Large Language Models (LLMs) become increasingly integrated into daily life, understanding their ability to interpret and respond to human mental states is crucial for enabling effective interactions. In this paper, we review LLMs’ ToM capabilities by analyzing both evaluation benchmarks and enhancement strategies. For evaluation, we focus on recently proposed and widely used story-based benchmarks. For enhancement, we provide an in-depth analysis of recent methods aimed at improving LLMs’ ToM abilities. Furthermore, we outline promising directions for future research to further advance these capabilities and better adapt LLMs to more realistic and diverse scenarios. Our survey serves as a valuable resource for researchers interested in evaluating and advancing LLMs’ ToM capabilities.
pdf
bib
abs
Completing A Systematic Review in Hours instead of Months with Interactive AI Agents
Rui Qiu
|
Shijie Chen
|
Yu Su
|
Po-Yin Yen
|
Han Wei Shen
Systematic reviews (SRs) are vital for evidence-based practice in high stakes disciplines, such as healthcare, but are often impeded by intensive labors and lengthy processes that can take months to complete. Due to the high demand for domain expertise, existing automatic summarization methods fail to accurately identify relevant studies and generate high-quality summaries. To that end, we introduce InsightAgent, a human-centered interactive AI agent powered by large language models that revolutionize this workflow. InsightAgent partitions a large literature corpus based on semantics and employs a multi-agent design for more focused processing of literature, leading to significant improvement in the quality of generated SRs. InsightAgent also provides intuitive visualizations of the corpus and agent trajectories, allowing users to effortlessly monitor the actions of the agent and provide real-time feedback based on their expertise. Our user studies with 9 medical professionals demonstrate that the visualization and interaction mechanisms can effectively improve the quality of synthesized SRs by 27.2%, reaching 79.7% of human-written quality. At the same time, user satisfaction is improved by 34.4%. With InsightAgent, it only takes a clinician about 1.5 hours, rather than months, to complete a high-quality systematic review.
pdf
bib
abs
CMHKF: Cross-Modality Heterogeneous Knowledge Fusion for Weakly Supervised Video Anomaly Detection
Guohua Wang
|
Shengping Song
|
Wuchun He
|
Yongsen Zheng
Weakly supervised video anomaly detection (WSVAD) presents a challenging task focused on detecting frame-level anomalies using only video-level labels. However, existing methods focus mainly on visual modalities, neglecting rich multi-modality information. This paper proposes a novel framework, Cross-Modality Heterogeneous Knowledge Fusion (CMHKF), that integrates cross-modality knowledge from video, audio, and text to improve anomaly detection and localization. To achieve adaptive cross-modality heterogeneous knowledge learning, we designed two components: Cross-Modality Video-Text Knowledge Alignment (CVKA) and Audio Modality Feature Adaptive Extraction (AFAE). They extract and aggregate features by exploring inter-modality correlations. By leveraging abundant cross-modality knowledge, our approach improves the discrimination between normal and anomalous segments. Extensive experiments on XD-Violence show our method significantly enhances accuracy and robustness in both coarse-grained and fine-grained anomaly detection.
pdf
bib
abs
CLaSp: In-Context Layer Skip for Self-Speculative Decoding
Longze Chen
|
Renke Shan
|
Huiming Wang
|
Lu Wang
|
Ziqiang Liu
|
Run Luo
|
Jiawei Wang
|
Hamid Alinejad-Rokny
|
Min Yang
Speculative decoding (SD) is a promising method for accelerating the decoding process of Large Language Models (LLMs). The efficiency of SD primarily hinges on the consistency between the draft model and the verify model. However, existing drafting approaches typically require additional modules to be trained, which can be challenging to implement and ensure compatibility across various LLMs. In this paper, we propose CLaSp, an in-context layer-skipping strategy for self-speculative decoding. Unlike prior methods, CLaSp does not require additional drafting modules or extra training. Instead, it employs a plug-and-play mechanism by skipping intermediate layers of the verify model to construct a compressed draft model. Specifically, we develop a dynamic programming algorithm that optimizes the layer-skipping process by leveraging the complete hidden states from the last verification stage as an objective. This enables CLaSp to dynamically adjust its layer-skipping strategy after each verification stage, without relying on pre-optimized sets of skipped layers. Experimental results across diverse downstream tasks demonstrate that CLaSp achieves a speedup of 1.3× ∼ 1.7× on LLaMA3 series models without altering the original distribution of the generated text.
pdf
bib
abs
Teaching Text Agents to Learn Sequential Decision Making from Failure
Canasai Kruengkrai
|
Koichiro Yoshino
Text-based reinforcement-learning agents improve their policies by interacting with their environments to collect more training data. However, these self-collected data inevitably contain intermediate failed actions caused by attempting physically infeasible behaviors and/or hallucinations. Directly learning a policy from such trajectories can reinforce incorrect behaviors and reduce task success rates. In this paper, we propose a failed action-aware objective that suppresses the negative impact of failed actions during training by assigning zero return based on textual feedback. Building on this objective, we introduce a perturbation method that leverages unsuccessful trajectories to construct new successful ones that share the same goal. This allows agents to benefit from diverse experiences without further interaction with the environment. Experiments in ALFWorld and ScienceWorld demonstrate that our method significantly outperforms strong baselines and generalizes across environments. Code is available at https://github.com/riken-grp/text-agent.
pdf
bib
abs
The Harmonic Structure of Information Contours
Eleftheria Tsipidi
|
Samuel Kiegeland
|
Franz Nowak
|
Tianyang Xu
|
Ethan Wilcox
|
Alex Warstadt
|
Ryan Cotterell
|
Mario Giulianelli
The uniform information density (UID) hypothesis proposes that speakers aim to distribute information evenly throughout a text, balancing production effort and listener comprehension difficulty. However, language typically does not maintain a strictly uniform information rate; instead, it fluctuates around a global average. These fluctuations are often explained by factors such as syntactic constraints, stylistic choices, or audience design. In this work, we explore an alternative perspective: that these fluctuations may be influenced by an implicit linguistic pressure towards periodicity, where the information rate oscillates at regular intervals, potentially across multiple frequencies simultaneously. We apply harmonic regression and introduce a novel extension called time scaling to detect and test for such periodicity in information contours. Analyzing texts in English, Spanish, German, Dutch, Basque, and Brazilian Portuguese, we find consistent evidence of periodic patterns in information rate. Many dominant frequencies align with discourse structure, suggesting these oscillations reflect meaningful linguistic organization. Beyond highlighting the connection between information rate and discourse structure, our approach offers a general framework for uncovering structural pressures at various levels of linguistic granularity.
pdf
bib
abs
REAL-MM-RAG: A Real-World Multi-Modal Retrieval Benchmark
Navve Wasserman
|
Roi Pony
|
Oshri Naparstek
|
Adi Raz Goldfarb
|
Eli Schwartz
|
Udi Barzelay
|
Leonid Karlinsky
Accurate multi-modal document retrieval is crucial for Retrieval-Augmented Generation (RAG), yet existing benchmarks do not fully capture real-world challenges with their current design. We introduce REAL-MM-RAG, an automatically generated benchmark designed to address four key properties essential for real-world retrieval: (i) multi-modal documents, (ii) enhanced difficulty, (iii) Realistic-RAG queries and (iv) accurate labeling. Additionally, we propose a multi-difficulty-level scheme based on query rephrasing to evaluate models’ semantic understanding beyond keyword matching. Our benchmark reveals significant model weaknesses, particularly in handling table-heavy documents and robustness to query rephrasing. To mitigate these shortcomings, we curate a rephrased training set and introduce a new finance-focused, table-heavy dataset. Fine-tuning on these datasets enables models to achieve state-of-the-art retrieval performance on REAL-MM-RAG benchmark. Our work offers a better way to evaluate and improve retrieval in multi-modal RAG systems while also providing training data and models that address current limitations.
pdf
bib
abs
Only a Little to the Left: A Theory-grounded Measure of Political Bias in Large Language Models
Mats Faulborn
|
Indira Sen
|
Max Pellert
|
Andreas Spitz
|
David Garcia
Prompt-based language models like GPT4 and LLaMa have been used for a wide variety of use cases such as simulating agents, searching for information, or for content analysis. For all of these applications and others, political biases in these models can affect their performance. Several researchers have attempted to study political bias in language models using evaluation suites based on surveys, such as the Political Compass Test (PCT), often finding a particular leaning favored by these models. However, there is some variation in the exact prompting techniques, leading to diverging findings, and most research relies on constrained-answer settings to extract model responses. Moreover, the Political Compass Test is not a scientifically valid survey instrument. In this work, we contribute a political bias measured informed by political science theory, building on survey design principles to test a wide variety of input prompts, while taking into account prompt sensitivity. We then prompt 11 different open and commercial models, differentiating between instruction-tuned and non-instruction-tuned models, and automatically classify their political stances from 88,110 responses. Leveraging this dataset, we compute political bias profiles across different prompt variations and find that while PCT exaggerates bias in certain models like GPT3.5, measures of political bias are often unstable, but generally more left-leaning for instruction-tuned models. Code and data are available at https://github.com/MaFa211/theory_grounded_pol_bias.
pdf
bib
abs
LongSafety: Evaluating Long-Context Safety of Large Language Models
Yida Lu
|
Jiale Cheng
|
Zhexin Zhang
|
Shiyao Cui
|
Cunxiang Wang
|
Xiaotao Gu
|
Yuxiao Dong
|
Jie Tang
|
Hongning Wang
|
Minlie Huang
As Large Language Models (LLMs) continue to advance in understanding and generating long sequences, new safety concerns have been introduced through the long context. However, the safety of LLMs in long-context tasks remains under-explored, leaving a significant gap in both evaluation and improvement of their safety. To address this, we introduce LongSafety, the first comprehensive benchmark specifically designed to evaluate LLM safety in open-ended long-context tasks. LongSafety encompasses 7 categories of safety issues and 6 user-oriented long-context tasks, with a total of 1,543 test cases, averaging 5,424 words per context. Our evaluation towards 16 representative LLMs reveals significant safety vulnerabilities, with most models achieving safety rates below 55%. Our findings also indicate that strong safety performance in short-context scenarios does not necessarily correlate with safety in long-context tasks, emphasizing the unique challenges and urgency of improving long-context safety. Moreover, through extensive analysis, we identify challenging safety issues and task types for long-context models. Furthermore, we find that relevant context and extended input sequences can exacerbate safety risks in long-context scenarios, highlighting the critical need for ongoing attention to long-context safety challenges. Our code and data will be publicly available.
pdf
bib
abs
Exploiting Contextual Knowledge in LLMs through 𝒱-usable Information based Layer Enhancement
Xiaowei Yuan
|
Zhao Yang
|
Ziyang Huang
|
Yequan Wang
|
Siqi Fan
|
Yiming Ju
|
Jun Zhao
|
Kang Liu
Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet they often struggle with context-faithfulness generations that properly reflect contextual knowledge. While existing approaches focus on enhancing the decoding strategies, they ignore the fundamental mechanism of how contextual information is processed within LLMs’ internal states. As a result, LLMs remain limited in their ability to fully leverage contextual knowledge. In this paper, we propose Context-aware Layer Enhancement (CaLE), a novel intervention method that enhances the utilization of contextual knowledge within LLMs’ internal representations. By employing 𝒱-usable information analysis, CaLE strategically amplifies the growth of contextual information at an optimal layer, thereby enriching representations in the final layer. Our experiments demonstrate that CaLE effectively improves context-faithful generation in Question-Answering tasks, particularly in scenarios involving unknown or conflicting contextual knowledge.
pdf
bib
abs
Unintended Harms of Value-Aligned LLMs: Psychological and Empirical Insights
Sooyung Choi
|
Jaehyeok Lee
|
Xiaoyuan Yi
|
Jing Yao
|
Xing Xie
|
JinYeong Bak
The application scope of Large Language Models (LLMs) continues to expand, leading to increasing interest in personalized LLMs that align with human values. However, aligning these models with individual values raises significant safety concerns, as certain values may correlate with harmful information. In this paper, we identify specific safety risks associated with value-aligned LLMs and investigate the psychological principles behind these challenges. Our findings reveal two key insights. (1) Value-aligned LLMs are more prone to harmful behavior compared to non-fine-tuned models and exhibit slightly higher risks in traditional safety evaluations than other fine-tuned models. (2) These safety issues arise because value-aligned LLMs genuinely generate text according to the aligned values, which can amplify harmful outcomes. Using a dataset with detailed safety categories, we find significant correlations between value alignment and safety risks, supported by psychological hypotheses. This study offers insights into the “black box” of value alignment and proposes in-context alignment methods to enhance the safety of value-aligned LLMs.
pdf
bib
abs
Maximal Matching Matters: Preventing Representation Collapse for Robust Cross-Modal Retrieval
Hani Alomari
|
Anushka Sivakumar
|
Andrew Zhang
|
Chris Thomas
Cross-modal image-text retrieval is challenging because of the diverse possible associations between content from different modalities. Traditional methods learn a single-vector embedding to represent semantics of each sample, but struggle to capture nuanced and diverse relationships that can exist across modalities. Set-based approaches, which represent each sample with multiple embeddings, offer a promising alternative, as they can capture richer and more diverse relationships. In this paper, we show that, despite their promise, these set-based representations continue to face issues including sparse supervision and set collapse, which limits their effectiveness. To address these challenges, we propose Maximal Pair Assignment Similarity to optimize one-to-one matching between embedding sets which preserve semantic diversity within the set. We also introduce two loss functions to further enhance the representations: Global Discriminative Loss to enhance distinction among embeddings, and Intra-Set Divergence Loss to prevent collapse within each set. Our method achieves state-of-the-art performance on MS-COCO and Flickr30k without relying on external data.
pdf
bib
abs
The Noisy Path from Source to Citation: Measuring How Scholars Engage with Past Research
Hong Chen
|
Misha Teplitskiy
|
David Jurgens
Academic citations are widely used for evaluating research and tracing knowledge flows. Such uses typically rely on raw citation counts and neglect variability in citation types. In particular, citations can vary in their fidelity as original knowledge from cited studies may be paraphrased, summarized, or reinterpreted, possibly wrongly, leading to variation in how much information changes from cited to citing paper. In this study, we introduce a computational pipeline to quantify citation fidelity at scale. Using full texts of papers, the pipeline identifies citations in citing papers and the corresponding claims in cited papers, and applies supervised models to measure fidelity at the sentence level. Analyzing a large-scale multi-disciplinary dataset of approximately 13 million citation sentence pairs, we find that citation fidelity is higher when authors cite papers that are 1) more recent and intellectually close, 2) more accessible, and 3) the first author has a lower H-index and the author team is medium-sized. Using a quasi-experiment, we establish the “telephone effect” - when citing papers have low fidelity to the original claim, future papers that cite the citing paper and the original have lower fidelity to the original. Our work reveals systematic differences in citation fidelity, underscoring the limitations of analyses that rely on citation quantity alone and the potential for distortion of evidence.
pdf
bib
abs
MAPLE: Enhancing Review Generation with Multi-Aspect Prompt LEarning in Explainable Recommendation
Ching-Wen Yang
|
Zhi-Quan Feng
|
Ying-Jia Lin
|
Che Wei Chen
|
Kun-da Wu
|
Hao Xu
|
Yao Jui-Feng
|
Hung-Yu Kao
Explainable Recommendation task is designed to receive a pair of user and item and output explanations to justify why an item is recommended to a user. Many models approach review generation as a proxy for explainable recommendations. While these models can produce fluent and grammatically correct sentences, they often lack preciseness and fail to provide personalized informative recommendations. To address this issue, we propose a personalized, aspect-controlled model called Multi-Aspect Prompt LEarner (MAPLE), which integrates aspect category as another input dimension to facilitate memorizing fine-grained aspect terms. Experiments conducted on two real-world review datasets in the restaurant domain demonstrate that MAPLE significantly outperforms baseline review-generation models. MAPLE excels in both text and feature diversity, ensuring that the generated content covers a wide range of aspects. Additionally, MAPLE delivers good generation quality while maintaining strong coherence and factual relevance. The code and dataset used in this paper can be found at https://github.com/Nana2929/MAPLE.
pdf
bib
abs
Separating Tongue from Thought: Activation Patching Reveals Language-Agnostic Concept Representations in Transformers
Clément Dumas
|
Chris Wendler
|
Veniamin Veselovsky
|
Giovanni Monea
|
Robert West
A central question in multilingual language modeling is whether large language models (LLMs) develop a universal concept representation, disentangled from specific languages. In this paper, we address this question by analyzing latent representations (latents) during a word-translation task in transformer-based LLMs. We strategically extract latents from a source translation prompt and insert them into the forward pass on a target translation prompt. By doing so, we find that the output language is encoded in the latent at an earlier layer than the concept to be translated. Building on this insight, we conduct two key experiments. First, we demonstrate that we can change the concept without changing the language and vice versa through activation patching alone. Second, we show that patching with the mean representation of a concept across different languages does not affect the models’ ability to translate it, but instead improves it. Finally, we generalize to multi-token generation and demonstrate that the model can generate natural language description of those mean representations. Our results provide evidence for the existence of language-agnostic concept representations within the investigated models.
pdf
bib
abs
Behavioural vs. Representational Systematicity in End-to-End Models: An Opinionated Survey
Ivan Vegner
|
Sydelle De Souza
|
Valentin Forch
|
Martha Lewis
|
Leonidas A. A. Doumas
A core aspect of compositionality, systematicity is a desirable property in ML models as it enables strong generalization to novel contexts. This has led to numerous studies proposing benchmarks to assess systematic generalization, as well as models and training regimes designed to enhance it. Many of these efforts are framed as addressing the challenge posed by Fodor and Pylyshyn. However, while they argue for systematicity of representations, existing benchmarks and models primarily focus on the systematicity of behaviour. We emphasize the crucial nature of this distinction. Furthermore, building on Hadley’s (1994) taxonomy of systematic generalization, we analyze the extent to which behavioural systematicity is tested by key benchmarks in the literature across language and vision. Finally, we highlight ways of assessing systematicity of representations in ML models as practiced in the field of mechanistic interpretability.
pdf
bib
abs
Dynamic Chunking and Selection for Reading Comprehension of Ultra-Long Context in Large Language Models
Boheng Sheng
|
Jiacheng Yao
|
Meicong Zhang
|
Guoxiu He
Large language models (LLMs) often struggle to accurately read and comprehend extremely long texts. Current methods for improvement typically rely on splitting long contexts into fixed-length chunks. However, fixed truncation risks separating semantically relevant content, leading to ambiguity and compromising accurate understanding. To overcome this limitation, we propose a straightforward approach for dynamically separating and selecting chunks of long context, facilitating a more streamlined input for LLMs. In particular, we compute semantic similarities between adjacent sentences, using lower similarities to adaptively divide long contexts into variable-length chunks. We further train a question-aware classifier to select sensitive chunks that are critical for answering specific questions. Experimental results on both single-hop and multi-hop question-answering benchmarks show that the proposed approach consistently outperforms strong baselines. Notably, it maintains robustness across a wide range of input lengths, handling sequences of up to 256k tokens. Our datasets and code are available at the following link: https://github.com/ECNU-Text-Computing/DCS
pdf
bib
abs
DualRAG: A Dual-Process Approach to Integrate Reasoning and Retrieval for Multi-Hop Question Answering
Rong Cheng
|
Jinyi Liu
|
Yan Zheng
|
Fei Ni
|
Jiazhen Du
|
Hangyu Mao
|
Fuzheng Zhang
|
Bo Wang
|
Jianye Hao
Multi-Hop Question Answering (MHQA) tasks permeate real-world applications, posing challenges in orchestrating multi-step reasoning across diverse knowledge domains. While existing approaches have been improved with iterative retrieval, they still struggle to identify and organize dynamic knowledge. To address this, we propose DualRAG, a synergistic dual-process framework that seamlessly integrates reasoning and retrieval. DualRAG operates through two tightly coupled processes: Reasoning-augmented Querying (RaQ) and progressive Knowledge Aggregation (pKA). They work in concert: as RaQ navigates the reasoning path and generates targeted queries, pKA ensures that newly acquired knowledge is systematically integrated to support coherent reasoning. This creates a virtuous cycle of knowledge enrichment and reasoning refinement. Through targeted fine-tuning, DualRAG preserves its sophisticated reasoning and retrieval capabilities even in smaller-scale models, demonstrating its versatility and core advantages across different scales. Extensive experiments demonstrate that this dual-process approach substantially improves answer accuracy and coherence, approaching, and in some cases surpassing, the performance achieved with oracle knowledge access. These results establish DualRAG as a robust and efficient solution for complex multi-hop reasoning tasks.
pdf
bib
abs
Deliberate Reasoning in Language Models as Structure-Aware Planning with an Accurate World Model
Siheng Xiong
|
Ali Payani
|
Yuan Yang
|
Faramarz Fekri
Enhancing the reasoning capabilities of language models (LMs) remains a key challenge, especially for tasks that require complex, multi-step decision-making where existing Chain-of-Thought (CoT) approaches struggle with consistency and verification. In this paper, we propose a novel reasoning framework, referred to as Structure-aware Planning with an Accurate World Model (SWAP), that integrates structured knowledge representation with learned planning. Unlike prior methods that rely purely on natural language reasoning, SWAP leverages entailment graphs to encode structured dependencies and enable symbolic verification of intermediate steps. To systematically construct and update the graph, SWAP employs a policy model to propose candidate expansions and a world model to predict structural updates. To improve accuracy, the world model generates multiple alternative updates, and a discriminator re-ranks them based on plausibility. To encourage diverse exploration, we introduce Diversity-based Modelling (DM), which samples candidates from the remaining probability mass after removing previously sampled candidates from the original policy distribution. Additionally, SWAP improves the discrimination accuracy through Contrastive Ranking (CR), which directly compares candidates within prompts and incorporates meta-knowledge to improve ranking quality. We evaluate SWAP across diverse reasoning-intensive benchmarks including math reasoning, logical reasoning, and coding tasks. Extensive experiments demonstrate that SWAP significantly improves upon the base models and consistently outperforms existing reasoning methods.
pdf
bib
abs
Refining Salience-Aware Sparse Fine-Tuning Strategies for Language Models
Xinxin Liu
|
Aaron Thomas
|
Cheng Zhang
|
Jianyi Cheng
|
Yiren Zhao
|
Xitong Gao
Parameter-Efficient Fine-Tuning (PEFT) has gained prominence through low-rank adaptation methods like LoRA. In this paper, we focus on sparsity-based PEFT (SPEFT), which introduces trainable sparse adaptations to the weight matrices in the model, offering greater flexibility in selecting fine-tuned parameters compared to low-rank methods. We conduct the first systematic evaluation of salience metrics for SPEFT, inspired by zero-cost NAS proxies, and identify simple gradient-based metrics is reliable, and results are on par with the best alternatives, offering both computational efficiency and robust performance. Additionally, we compare static and dynamic masking strategies, finding that static masking, which predetermines non-zero entries before training, delivers efficiency without sacrificing performance, while dynamic masking offers no substantial benefits. Across NLP tasks, a simple gradient-based, static SPEFT consistently outperforms other fine-tuning methods for LLMs, providing a simple yet effective baseline for SPEFT. Our work challenges the notion that complexity is necessary for effective PEFT, while our open-source framework establishes a reproducible benchmark for future research.
pdf
bib
abs
Efficient Many-Shot In-Context Learning with Dynamic Block-Sparse Attention
Emily Xiao
|
Chin-Jou Li
|
Yilin Zhang
|
Graham Neubig
|
Amanda Bertsch
Many-shot in-context learning has recently shown promise as an alternative to finetuning, with the major advantage that the same model can be served for multiple tasks. However, this shifts the computational burden from training-time to inference-time, making deployment of many-shot ICL challenging to justify in-practice. This cost is further increased if a custom demonstration set is retrieved for each inference example. We present Dynamic Block-Sparse Attention, an optimized method for retrieval-based many-shot in-context learning. By combining carefully designed block-sparse attention and retrieval of cached groups of demonstrations, we achieve comparable per-example latency to finetuning while maintaining on average >95% of the best method’s accuracy across strong ICL and finetuning baselines. We hope that this will further enable the deployment of many-shot ICL at scale.
pdf
bib
abs
ScaleBiO: Scalable Bilevel Optimization for LLM Data Reweighting
Rui Pan
|
Dylan Zhang
|
Hanning Zhang
|
Xingyuan Pan
|
Minrui Xu
|
Jipeng Zhang
|
Renjie Pi
|
Xiaoyu Wang
|
Tong Zhang
Bilevel optimization has shown its utility across various machine learning settings, yet most algorithms in practice require second-order information, making it challenging to scale them up. Only recently, a paradigm of first-order algorithms has emerged in the theoretical literature, capable of effectively addressing bilevel optimization problems. Nevertheless, the practical efficiency of this paradigm remains unverified, particularly in the context of large language models (LLMs). This paper introduces the first scalable instantiation of this paradigm called _ScaleBiO_, focusing on bilevel optimization for large-scale LLM data reweighting. By combining with a recently proposed memory-efficient training technique called LISA, our novel algorithm allows the paradigm to scale to ~30B-sized LLMs on 8×H100 GPUs, marking the first successful application of bilevel optimization under practical scenarios for large-sized LLMs. Empirically, extensive experiments on data reweighting verify the effectiveness of ScaleBiO for different-scaled models, including Llama-3-8B, Gemma-2-9B, Qwen-2-7B, and Qwen-2.5-32B, where bilevel optimization succeeds in instruction-following and math reasoning tasks, outperforming several popular baselines, including uniform sampling, influence-aware data filtering, and reference-model-based sampling methods. Theoretically, ScaleBiO ensures the optimality of the learned data weights, along with a convergence guarantee matching the conventional first-order bilevel optimization paradigm on smooth and strongly convex objectives.
pdf
bib
abs
PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference
Jiaming Ji
|
Donghai Hong
|
Borong Zhang
|
Boyuan Chen
|
Josef Dai
|
Boren Zheng
|
Tianyi Alex Qiu
|
Jiayi Zhou
|
Kaile Wang
|
Boxun Li
|
Sirui Han
|
Yike Guo
|
Yaodong Yang
In this work, we introduce the PKU-SafeRLHF dataset, designed to promote research on safety alignment in large language models (LLMs). As a sibling project to SafeRLHF and BeaverTails, we separate annotations of helpfulness and harmlessness for question-answering pairs, providing distinct perspectives on these coupled attributes. Overall, we provide 44.6k refined prompts and 265k question-answer pairs with safety meta-labels for 19 harm categories and three severity levels ranging from minor to severe, with answers generated by Llama-family models. Based on this, we collected 166.8k preference data, including dual-preference (helpfulness and harmlessness decoupled) and single-preference data (trade-off the helpfulness and harmlessness from scratch), respectively. Using the large-scale annotation data, we further train severity-sensitive moderation for the risk control of LLMs and safety-centric RLHF algorithms for the safety alignment of LLMs. We believe this dataset will be a valuable resource for the community, aiding in the safe deployment of LLMs.
pdf
bib
abs
What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective
Ming Li
|
Yanhong Li
|
Tianyi Zhou
What makes a difference in the post-training of LLMs? We investigate the training patterns of different layers in large language models (LLMs) through the lens of the gradient. We are specifically interested in how fast vs. slow thinking affects the layer-wise gradients, given the recent popularity of training LLMs on reasoning paths such as chain-of-thoughts (CoT) and process rewards. In our study, fast thinking without CoT leads to larger gradients and larger differences of gradients across layers than slow thinking (Detailed CoT), indicating the learning stability brought by the latter. Additionally, we study whether the gradient patterns can reflect the correctness of responses when training different LLMs using slow vs. fast thinking paths. The results show that the gradients of slow thinking can distinguish correct and irrelevant reasoning paths. As a comparison, we conduct similar gradient analyses on non-reasoning knowledge learning tasks, on which, however, trivially increasing the response length does not lead to similar behaviors of slow thinking. Our study strengthens fundamental understandings of LLM training and sheds novel insights on its efficiency and stability, which pave the way towards building a generalizable System-2 agent.
pdf
bib
abs
Beyond Text Compression: Evaluating Tokenizers Across Scales
Jonas F. Lotz
|
António V. Lopes
|
Stephan Peitz
|
Hendra Setiawan
|
Leonardo Emili
The choice of tokenizer can profoundly impact language model performance, yet accessible and reliable evaluations of tokenizer quality remain an open challenge. Inspired by scaling consistency, we show that smaller models can accurately predict significant differences in tokenizer impact on larger models at a fraction of the compute cost. By systematically evaluating both English-centric and multilingual tokenizers, we find that tokenizer choice has negligible effects on tasks in English but results in consistent performance differences in multilingual settings. We propose new intrinsic tokenizer metrics inspired by Zipf’s law that correlate more strongly with downstream performance than text compression when modeling unseen languages. By combining several metrics to capture multiple aspects of tokenizer behavior, we develop a reliable framework for intrinsic tokenizer evaluations. Our work offers a more efficient path to informed tokenizer selection in future language model development.
pdf
bib
abs
Emergent Abilities of Large Language Models under Continued Pre-training for Language Adaptation
Ahmed Elhady
|
Eneko Agirre
|
Mikel Artetxe
Continued pretraining (CPT) is a popular approach to adapt existing large language models (LLMs) to new languages. When doing so, it is common practice to include a portion of English data in the mixture, but its role has not been carefully studied to date. In this work, we show that including English does not impact validation perplexity, yet it is critical for the emergence of downstream capabilities in the target language. We introduce a language-agnostic benchmark for in-context learning (ICL), which reveals catastrophic forgetting early on CPT when English is not included. This in turn damages the ability of the model to generalize to downstream prompts as measured by perplexity, even if it does not manifest in terms of accuracy until later in training, and can be tied to a big shift in the model parameters. Based on these insights, we introduce curriculum learning and exponential moving average (EMA) of weights as effective alternatives to mitigate the need for English. All in all, our work sheds light into the dynamics by which emergent abilities arise when doing CPT for language adaptation, and can serve as a foundation to design more effective methods in the future.
pdf
bib
abs
R-Fairness: Assessing Fairness of Ranking in Subjective Data
Lorenzo Balzotti
|
Donatella Firmani
|
Jerin George Mathew
|
Riccardo Torlone
|
Sihem Amer-Yahia
Subjective data, reflecting individual opinions, permeates platforms like Yelp and Amazon, influencing everyday decisions. Upon a user query, collaborative rating platforms return a collection of items ranked in an order that is often not transparent to the users. Then, each item is presented with a collection of reviews in an order that typically is, again, rather opaque. Despite the prevalence of such platforms, little attention has been given to fairness in their context, where groups writing best-ranked reviews for best-ranked items have more influence on users’ behavior. We design and evaluate a fairness assessment pipeline that starts with a data collection phase to gather reviews from real-world platforms, by submitting artificial user queries and iterating through rated items. Following that, a group assignment phase computes and infers relevant groups for each review, based on review content and user data. Finally, the third step assesses and evaluates the fairness of rankings for different user groups. The key contributions are comparing group exposure for different queries and platforms and comparing how popular fairness definitions behave in different settings. Experiments on real datasets reveal insights into the impact of item ranking on fairness computation and the varying robustness of these measures.
pdf
bib
abs
RePanda: Pandas-powered Tabular Verification and Reasoning
Atoosa Chegini
|
Keivan Rezaei
|
Hamid Eghbalzadeh
|
Soheil Feizi
Fact-checking tabular data is essential for ensuring the accuracy of structured information in domains such as journalism, finance, and scientific research. However, existing methods often rely on black-box models with opaque reasoning. We introduce RePanda, a structured fact verification approach that translates claims into executable pandas queries, enabling interpretable and verifiable reasoning.To train RePanda, we construct PanTabFact, a structured dataset derived from TabFact, where claims are paired with executable queries generated using DeepSeek-Chat and refined through automated error correction. Fine-tuning DeepSeek-coder-7B-instruct-v1.5 on PanTabFact, RePanda achieves 84.09% accuracy on TabFact. To assess Out-of-Distribution (OOD) generalization, we create a dataset named WikiFact from WikiTableQuestions by transforming question-answer pairs into factual claims. Without additional fine-tuning, RePanda achieves 84.72% accuracy on WikiFact, significantly outperforming all other baselines and demonstrating strong OOD robustness. PanTabFact is publically available on HuggingFace at datasets/AtoosaChegini/PanTabFact.Beyond fact verification, RePanda extends to tabular question answering by generating executable queries that retrieve precise answers. To support this, we introduce PanWiki, a dataset mapping WikiTableQuestions to pandas queries. Fine-tuning on PanWiki, RePanda achieves 75.1% accuracy in direct answer retrieval. These results highlight the effectiveness of structured execution-based reasoning for tabular verification and question answering.
pdf
bib
abs
Towards Style Alignment in Cross-Cultural Translation
Shreya Havaldar
|
Adam Stein
|
Eric Wong
|
Lyle Ungar
Successful communication depends on the speaker’s intended style (i.e., what the speaker is trying to convey) aligning with the listener’s interpreted style (i.e., what the listener perceives). However, cultural differences often lead to misalignment between the two; for example, politeness is often lost in translation. We characterize the ways that LLMs fail to translate style – biasing translations towards neutrality and performing worse in non-Western languages. We mitigate these failures with RASTA (Retrieval-Augmented STylistic Alignment), a method that leverages learned stylistic concepts to encourage LLM translation to appropriately convey cultural communication norms and align style.
pdf
bib
abs
TiC-LM: A Web-Scale Benchmark for Time-Continual LLM Pretraining
Jeffrey Li
|
Mohammadreza Armandpour
|
Seyed Iman Mirzadeh
|
Sachin Mehta
|
Vaishaal Shankar
|
Raviteja Vemulapalli
|
Samy Bengio
|
Oncel Tuzel
|
Mehrdad Farajtabar
|
Hadi Pouransari
|
Fartash Faghri
Large Language Models (LLMs) trained on historical web data inevitably become outdated. We investigate evaluation strategies and update methods for LLMs as new data becomes available. We introduce a web-scale dataset for time-continual pretraining of LLMs derived from 114 dumps of Common Crawl (CC) – orders of magnitude larger than previous continual language modeling benchmarks. We also design time-stratified evaluations across both general CC data and specific domains (Wikipedia, StackExchange, and code documentation) to assess how well various continual learning methods adapt to new data while retaining past knowledge. Our findings demonstrate that, on general CC data, autoregressive meta-schedules combined with a fixed-ratio replay of older data can achieve comparable held-out loss to re-training from scratch, while requiring significantly less computation (2.6x). However, the optimal balance between incorporating new data and replaying old data differs as replay is crucial to avoid forgetting on generic web data but less so on specific domains.
pdf
bib
abs
Entailed Between the Lines: Incorporating Implication into NLI
Shreya Havaldar
|
Hamidreza Alvari
|
John Palowitch
|
Mohammad Javad Hosseini
|
Senaka Buthpitiya
|
Alex Fabrikant
Much of human communication depends on implication, conveying meaning beyond literal words to express a wider range of thoughts, intentions, and feelings. For models to better understand and facilitate human communication, they must be responsive to the text’s implicit meaning. We focus on Natural Language Inference (NLI), a core tool for many language tasks, and find that state-of-the-art NLI models and datasets struggle to recognize a range of cases where entailment is implied, rather than explicit from the text. We formalize implied entailment as an extension of the NLI task and introduce the Implied NLI dataset (INLI) to help today’s LLMs both recognize a broader variety of implied entailments and to distinguish between implicit and explicit entailment. We show how LLMs fine-tuned on INLI understand implied entailment and can generalize this understanding across datasets and domains.
pdf
bib
abs
Multi-Level Explanations for Generative Language Models
Lucas Monteiro Paes
|
Dennis Wei
|
Hyo Jin Do
|
Hendrik Strobelt
|
Ronny Luss
|
Amit Dhurandhar
|
Manish Nagireddy
|
Karthikeyan Natesan Ramamurthy
|
Prasanna Sattigeri
|
Werner Geyer
|
Soumya Ghosh
Despite the increasing use of large language models (LLMs) for context-grounded tasks like summarization and question-answering, understanding what makes an LLM produce a certain response is challenging. We propose Multi-Level Explanations for Generative Language Models (MExGen), a technique to provide explanations for context-grounded text generation. MExGen assigns scores to parts of the context to quantify their influence on the model’s output. It extends attribution methods like LIME and SHAP to LLMs used in context-grounded tasks where (1) inference cost is high, (2) input text is long, and (3) the output is text. We conduct a systematic evaluation, both automated and human, of perturbation-based attribution methods for summarization and question answering. The results show that our framework can provide more faithful explanations of generated output than available alternatives, including LLM self-explanations. We open-source code for MExGen as part of the ICX360 toolkit: https://github.com/IBM/ICX360.
pdf
bib
abs
A Multi-Agent Framework for Mitigating Dialect Biases in Privacy Policy Question-Answering Systems
Đorđe Klisura
|
Astrid R Bernaga Torres
|
Anna Karen Gárate-Escamilla
|
Rajesh Roshan Biswal
|
Ke Yang
|
Hilal Pataci
|
Anthony Rios
Privacy policies inform users about data collection and usage, yet their complexity limits accessibility for diverse populations. Existing Privacy Policy Question Answering (QA) systems exhibit performance disparities across English dialects, disadvantaging speakers of non-standard varieties. We propose a novel multi-agent framework inspired by human-centered design principles to mitigate dialectal biases. Our approach integrates a Dialect Agent, which translates queries into Standard American English (SAE) while preserving dialectal intent, and a Privacy Policy Agent, which refines predictions using domain expertise. Unlike prior approaches, our method does not require retraining or dialect-specific fine-tuning, making it broadly applicable across models and domains. Evaluated on PrivacyQA and PolicyQA, our framework improves GPT-4o-mini’s zero-shot accuracy from 0.394 to 0.601 on PrivacyQA and from 0.352 to 0.464 on PolicyQA, surpassing or matching few-shot baselines without additional training data. These results highlight the effectiveness of structured agent collaboration in mitigating dialect biases and underscore the importance of designing NLP systems that account for linguistic diversity to ensure equitable access to privacy information.
pdf
bib
abs
Low-Bit Quantization Favors Undertrained LLMs
Xu Ouyang
|
Tao Ge
|
Thomas Hartvigsen
|
Zhisong Zhang
|
Haitao Mi
|
Dong Yu
Low-bit quantization improves machine learning model efficiency but surprisingly favors undertrained large language models (LLMs). Larger models or those trained on fewer tokens exhibit less quantization-induced degradation (QiD), while smaller, well-trained models face significant performance losses. To gain deeper insights into this trend, we study over 1500+ quantized LLM checkpoints of various sizes and at different training levels (undertrained or fully trained) in a controlled setting, deriving scaling laws for understanding the relationship between QiD and factors: the number of training tokens, model size and bit width.With our derived scaling laws, we propose a novel perspective that we can use QiD to measure an LLM’s training levels and determine the number of training tokens required for fully training LLMs of various sizes. Moreover, we use the scaling laws to predict the quantization performance of different-sized LLMs trained with tokens. Our projection shows that the low-bit quantization performance of future models, which are expected to be trained with over \textcolor{red}{100~trillion} tokens, may NOT be desirable. This poses a potential challenge for low-bit quantization in the future and highlights the need for awareness of a model’s training level when evaluating low-bit quantization research. To facilitate future research on this problem, we release all the 1500+ quantized checkpoints used in this work at https://huggingface.co/Xu-Ouyang.
pdf
bib
abs
LETS-C: Leveraging Text Embedding for Time Series Classification
Rachneet Kaur
|
Zhen Zeng
|
Tucker Balch
|
Manuela Veloso
Recent advancements in language modeling have shown promising results when applied to time series data. In particular, fine-tuning pre-trained large language models (LLMs) for time series classification tasks has achieved state-of-the-art (SOTA) performance on standard benchmarks. However, these LLM-based models have a significant drawback due to the large model size, with the number of trainable parameters in the millions. In this paper, we propose an alternative approach to leveraging the success of language modeling in the time series domain. Instead of fine-tuning LLMs, we utilize a text embedding model to embed time series and then pair the embeddings with a simple classification head composed of convolutional neural networks (CNN) and multilayer perceptron (MLP). We conducted extensive experiments on a well-established time series classification benchmark. We demonstrated LETS-C not only outperforms the current SOTA in classification accuracy but also offers a lightweight solution, using only 14.5% of the trainable parameters on average compared to the SOTA model. Our findings suggest that leveraging text embedding models to encode time series data, combined with a simple yet effective classification head, offers a promising direction for achieving high-performance time series classification while maintaining a lightweight model architecture.
pdf
bib
abs
UrbanVideo-Bench: Benchmarking Vision-Language Models on Embodied Intelligence with Video Data in Urban Spaces
Baining Zhao
|
Jianjie Fang
|
Zichao Dai
|
Ziyou Wang
|
Jirong Zha
|
Weichen Zhang
|
Chen Gao
|
Yue Wang
|
Jinqiang Cui
|
Xinlei Chen
|
Yong Li
Large multimodal models exhibit remarkable intelligence, yet their embodied cognitive abilities during motion in open-ended urban aerial spaces remain to be explored. We introduce a benchmark to evaluate whether video-large language models (Video-LLMs) can naturally process continuous first-person visual observations like humans, enabling recall, perception, reasoning, and navigation. We have manually control drones to collect 3D embodied motion video data from real-world cities and simulated environments, resulting in 1.5k video clips. Then we design a pipeline to generate 5.2k multiple-choice questions. Evaluations of 17 widely-used Video-LLMs reveal current limitations in urban embodied cognition. Correlation analysis provides insight into the relationships between different tasks, showing that causal reasoning has a strong correlation with recall, perception, and navigation, while the abilities for counterfactual and associative reasoning exhibit lower correlation with other tasks. We also validate the potential for Sim-to-Real transfer in urban embodiment through fine-tuning.
pdf
bib
abs
HELIOS: Harmonizing Early Fusion, Late Fusion, and LLM Reasoning for Multi-Granular Table-Text Retrieval
Sungho Park
|
Joohyung Yun
|
Jongwuk Lee
|
Wook-Shin Han
Table-text retrieval aims to retrieve relevant tables and text to support open-domain question answering. Existing studies use either early or late fusion, but face limitations. Early fusion pre-aligns a table row with its associated passages, forming “stars,” which often include irrelevant contexts and miss query-dependent relationships. Late fusion retrieves individual nodes, dynamically aligning them, but it risks missing relevant contexts. Both approaches also struggle with advanced reasoning tasks, such as column-wise aggregation and multi-hop reasoning. To address these issues, we propose HELIOS, which combines the strengths of both approaches. First, the edge-based bipartite subgraph retrieval identifies finer-grained edges between table segments and passages, effectively avoiding the inclusion of irrelevant contexts. Then, the query-relevant node expansion identifies the most promising nodes, dynamically retrieving relevant edges to grow the bipartite subgraph, minimizing the risk of missing important contexts. Lastly, the star-based LLM refinement performs logical inference at the star graph level rather than the bipartite subgraph, supporting advanced reasoning tasks. Experimental results show that HELIOS outperforms state-of-the-art models with a significant improvement up to 42.6% and 39.9% in recall and nDCG, respectively, on the OTT-QA benchmark.
pdf
bib
abs
ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities
Adhiraj Ghosh
|
Sebastian Dziadzio
|
Ameya Prabhu
|
Vishaal Udandarao
|
Samuel Albanie
|
Matthias Bethge
Traditional fixed test datasets fall short in evaluating the open-ended capabilities of foundation models. To address this, we propose ONEBench (OpeN-Ended Benchmarking), a new paradigm that consolidates individual evaluation datasets into a unified, ever-expanding sample pool. ONEBench enables custom benchmarks for specific capabilities while reusing and aggregating samples, mitigating overfitting and dataset bias for broader capability assessment. It reframes model evaluation as selecting and aggregating sample-level tests.Transitioning from task-specific benchmarks to ONEBench introduces two challenges: heterogeneity (aggregating diverse metrics) and incompleteness(comparing models tested on different data subsets). To address these, we propose an aggregation algorithm that ensures identifiability (asymptotically recovering ground-truth scores) and rapid convergence, enabling accurate model comparisons with relatively little data. On homogenous datasets, our algorithm produces rankings that highly correlate with average scores. Moreover, it remains robust to over 95% missing measurements, reducing evaluation costs by up to 20x with minimal impact on rankings. We introduce ONEBench-LLM for language models and ONEBench-LMM for vision-language models, unifying evaluations across these domains, and enabling targeted model testing across diverse capabilities.
pdf
bib
abs
La Leaderboard: A Large Language Model Leaderboard for Spanish Varieties and Languages of Spain and Latin America
María Grandury
|
Javier Aula-Blasco
|
Júlia Falcão
|
Clémentine Fourrier
|
Miguel González Saiz
|
Gonzalo Martínez
|
Gonzalo Santamaria Gomez
|
Rodrigo Agerri
|
Nuria Aldama García
|
Luis Chiruzzo
|
Javier Conde
|
Helena Gomez Adorno
|
Marta Guerrero Nieto
|
Guido Ivetta
|
Natàlia López Fuertes
|
Flor Miriam Plaza-del-Arco
|
María-Teresa Martín-Valdivia
|
Helena Montoro Zamorano
|
Carmen Muñoz Sanz
|
Pedro Reviriego
|
Leire Rosado Plaza
|
Alejandro Vaca Serrano
|
Estrella Vallecillo-Rodríguez
|
Jorge Vallego
|
Irune Zubiaga
Leaderboards showcase the current capabilities and limitations of Large Language Models (LLMs). To motivate the development of LLMs that represent the linguistic and cultural diversity of the Spanish-speaking community, we present La Leaderboard, the first open-source leaderboard to evaluate generative LLMs in languages and language varieties of Spain and Latin America. La Leaderboard is a community-driven project that aims to establish an evaluation standard for everyone interested in developing LLMs for the Spanish-speaking community. This initial version combines 66 datasets in Catalan, Basque, Galician, and different Spanish varieties, showcasing the evaluation results of 50 models. To encourage community-driven development of leaderboards in other languages, we explain our methodology, including guidance on selecting the most suitable evaluation setup for each downstream task. In particular, we provide a rationale for using fewer few-shot examples than typically found in the literature, aiming to reduce environmental impact and facilitate access to reproducible results for a broader research community.
pdf
bib
abs
Why Prompt Design Matters and Works: A Complexity Analysis of Prompt Search Space in LLMs
Xiang Zhang
|
Juntai Cao
|
Chenyu You
|
Dujian Ding
Despite the remarkable successes of Large Language Models (LLMs), the underlying Transformer architecture has inherent limitations in handling complex reasoning tasks. Chain-of-Thought (CoT) prompting has emerged as a practical workaround, but most CoT-based methods rely on a single generic prompt like “think step by step,” with no task-specific adaptation. These approaches expect the model to discover an effective reasoning path on its own, forcing it to search through a vast prompt space. In contrast, many work has explored task-specific prompt designs to boost performance. However, these designs are typically developed through trial and error, lacking a theoretical ground. As a result, prompt engineering remains largely ad hoc and unguided.In this paper, we provide a theoretical framework that explains why some prompts succeed while others fail. We show that prompts function as selectors, extracting specific task-relevant information from the model’s full hidden state during CoT reasoning. Each prompt defines a unique trajectory through the answer space, and the choice of this trajectory is crucial for task performance and future navigation in the answer space.We analyze the complexity of finding optimal prompts and the size of the prompt space for a given task. Our theory reveals principles behind effective prompt design and shows that naive CoT—using model-self-guided prompt like “think step by step” —can severely hinder performance. Showing that optimal prompt search can lead to over a 50% improvement on reasoning tasks through experiments, our work provide a theoretical foundation for prompt engineering.
pdf
bib
abs
Energy Considerations of Large Language Model Inference and Efficiency Optimizations
Jared Fernandez
|
Clara Na
|
Vashisth Tiwari
|
Yonatan Bisk
|
Sasha Luccioni
|
Emma Strubell
As large language models (LLMs) scale in size and adoption, their computational and environmental costs continue to rise. Prior benchmarking efforts have primarily focused on latency reduction in idealized settings, often overlooking the diverse real-world inference workloads that shape energy use. In this work, we systematically analyze the energy implications of common inference efficiency optimizations across diverse Natural Language Processing (NLP) and generative Artificial Intelligence (AI) workloads, including conversational AI and code generation. We introduce a modeling approach that approximates real-world LLM workflows through a binning strategy for input-output token distributions and batch size variations. Our empirical analysis spans software frameworks, decoding strategies, GPU architectures, online and offline serving settings, and model parallelism configurations. We show that the effectiveness of inference optimizations is *highly sensitive to workload geometry, software stack, and hardware accelerators*, demonstrating that naive energy estimates based on FLOPs or theoretical GPU utilization significantly underestimate real-world energy consumption.Our findings reveal that the proper application of relevant inference efficiency optimizations can reduce total energy use by up to **73%** from unoptimized baselines. These insights provide a foundation for sustainable LLM deployment and inform energy-efficient design strategies for future AI infrastructure.
pdf
bib
abs
Optimizing Pre-Training Data Mixtures with Mixtures of Data Expert Models
Lior Belenki
|
Alekh Agarwal
|
Tianze Shi
|
Kristina Toutanova
We propose a method to optimize language model pre-training data mixtures through efficient approximation of the cross-entropy loss corresponding to each candidate mixture via a Mixture of Data Experts (MDE). We use this approximation as a source of additional features in a regression model, trained from observations of model loss for a small number of mixtures. Experiments with Transformer decoder-only language models in the range of 70M to 10B parameters on the SlimPajama dataset show that our method achieves significantly better performance than approaches that train regression models using only the mixture rates as input features. Combining this improved optimization method with an objective that takes into account cross-entropy on end task data leads to superior performance on few-shot downstream evaluations. We also provide theoretical insights on why aggregation of data expert predictions can provide good approximations to model losses for data mixtures.
pdf
bib
abs
BFS-Prover: Scalable Best-First Tree Search for LLM-based Automatic Theorem Proving
Ran Xin
|
Chenguang Xi
|
Jie Yang
|
Feng Chen
|
Hang Wu
|
Xia Xiao
|
Yifan Sun
|
Shen Zheng
|
Ming Ding
Recent advancements in large language models (LLMs) have spurred growing interest in automatic theorem proving using Lean4, where effective tree search methods are crucial for navigating the underlying large proof search spaces. While the existing approaches primarily rely on value functions and/or Monte Carlo Tree Search (MCTS), the potential of simpler methods like Best-First Tree Search (BFS) remains underexplored. In this paper, we investigate whether BFS can achieve competitive performance in large-scale theorem proving tasks. We present BFS-Prover, a scalable expert iteration framework, featuring three key innovations. First, we implement strategic data filtering at each expert iteration round, excluding problems solvable via beam search node expansion to focus on harder cases. Second, we improve the sample efficiency of BFS through Direct Preference Optimization (DPO) applied to state-tactic pairs automatically annotated with compiler error feedback, refining the LLM’s policy to prioritize productive expansions. Third, we employ length normalization in BFS to encourage exploration of deeper proof paths. BFS-Prover achieves a state-of-the-art score of 72.95 on the MiniF2F test set and therefore challenges the perceived necessity of complex tree search methods, demonstrating that BFS can achieve competitive performance when properly scaled.
pdf
bib
abs
Magnet: Multi-turn Tool-use Data Synthesis and Distillation via Graph Translation
Fan Yin
|
Zifeng Wang
|
I-Hung Hsu
|
Jun Yan
|
Ke Jiang
|
Yanfei Chen
|
Jindong Gu
|
Long Le
|
Kai-Wei Chang
|
Chen-Yu Lee
|
Hamid Palangi
|
Tomas Pfister
Large language models (LLMs) have exhibited the ability to effectively utilize external tools to address user queries. However, their performance may be limited in complex, multi-turn interactions involving users and multiple tools. To address this, we propose Magnet, a principled framework for synthesizing high-quality training trajectories to enhance the function calling capability of large language model agents in multi-turn conversations with humans. The framework is based on automatic and iterative translations from a function signature path to a sequence of queries and executable function calls. We model the complicated function interactions in multi-turn cases with graph and design novel node operations to build reliable signature paths. Motivated by context distillation, when guiding the generation of positive and negative trajectories using a teacher model, we provide reference function call sequences as positive hints in context and contrastive, incorrect function calls as negative hints. Experiments show that training with the positive trajectories with supervised fine-tuning and preference optimization against negative trajectories, our 14B model, Magnet-14B-mDPO, obtains 68.01 on BFCL-v3 and 73.30 on ToolQuery, surpassing the performance of the teacher model Gemini-1.5-pro-002 by a large margin in function calling.
pdf
bib
abs
Logic-Regularized Verifier Elicits Reasoning from LLMs
Xinyu Wang
|
Changzhi Sun
|
Lian Cheng
|
Yuanbin Wu
|
Dell Zhang
|
Xiaoling Wang
|
Xuelong Li
Verifiers are crucial components for enhancing modern LLMs’ reasoning capability. Typical verifiers require resource-intensive supervised dataset construction, which is costly and faces limitations in data diversity. In this paper, we propose LOVER, an unsupervised verifier regularized by logical rules. LOVER treats the verifier as a binary latent variable, utilizing internal activations and enforcing three logical constraints on multiple reasoning paths: negation consistency, intra-group consistency, and inter-group consistency (grouped by the final answer). By incorporating logical rules as priors, LOVER can leverage unlabeled examples and is directly compatible with any off-the-shelf LLMs. Experiments on 10 datasets demonstrate that LOVER significantly outperforms unsupervised baselines, achieving performance comparable to the supervised verifier (reaching its 95% level on average).
pdf
bib
abs
Squeezed Attention: Accelerating Long Context Length LLM Inference
Coleman Richard Charles Hooper
|
Sehoon Kim
|
Hiva Mohammadzadeh
|
Monishwaran Maheswaran
|
Sebastian Zhao
|
June Paik
|
Michael W. Mahoney
|
Kurt Keutzer
|
Amir Gholami
Emerging Large Language Model (LLM) applications require long input context in order to perform complex tasks like document analysis and code generation.For these long context length applications, the length of the input prompt poses a significant challenge in terms of inference efficiency since the inference costs increase linearly with sequence length.However, for many of these applications, much of the context in the prompt is fixed across different user inputs, thereby providing the opportunity to perform offline optimizations in order to process user inputs quickly, as they are received. We propose Squeezed Attention to accelerate LLM applications where a large portion of the input context is fixed.We first leverage K-means clustering offline to group the keys for the fixed context based on semantic similarity and represent each cluster with a single centroid value.During inference, we compare query tokens from the user input with the centroids to predict which keys from the fixed context are semantically relevant, and then compute exact attention using only the important keys, thereby reducing bandwidth and computational costs. We also present a hierarchical version of our algorithm which can reduce the complexity of attention from linear to logarithmic with respect to the fixed context length.We evaluate our method on various long-context benchmarks including LongBench, where it achieves a 3.1× reduction in KV budget with no noticeable accuracy loss and up to an 8× reduction with only a 0.5 point accuracy gap for the LLaMA-2-7B-32K, LWM-Text-Chat-1M, and Longchat-7B-v1.5-32K models.Futhermore, we implement kernels for centroid comparison and sparse FlashAttention with important keys, achieving more than 4× speedups during both the prefill and generation phases for long-context inference.Our code is available at https://github.com/SqueezeAILab/SqueezedAttention.
pdf
bib
abs
LangMark: A Multilingual Dataset for Automatic Post-Editing
Diego Velazquez
|
Mikaela Grace
|
Konstantinos Karageorgos
|
Lawrence Carin
|
Aaron Schliem
|
Dimitrios Zaikis
|
Roger Wechsler
Automatic post-editing (APE) aims to correct errors in machine-translated text, enhancing translation quality, while reducing the need for human intervention. Despite advances in neural machine translation (NMT), the development of effective APE systems has been hindered by the lack of large-scale multilingual datasets specifically tailored to NMT outputs. To address this gap, we present and release LangMark, a new human-annotated multilingual APE dataset for English translation to seven languages: Brazilian Portuguese, French, German, Italian, Japanese, Russian, and Spanish. The dataset has 206,983 triplets, with each triplet consisting of a source segment, its NMT output, and a human post-edited translation. Annotated by expert human linguists, our dataset offers both linguistic diversity and scale. Leveraging this dataset, we empirically show that Large Language Models (LLMs) with few-shot prompting can effectively perform APE, improving upon leading commercial and even proprietary machine translation systems. We believe that this new resource will facilitate the future development and evaluation of APE systems.
pdf
bib
abs
Neural Parameter Search for Slimmer Fine-Tuned Models and Better Transfer
Guodong Du
|
Zitao Fang
|
Jing Li
|
Junlin Li
|
Runhua Jiang
|
Shuyang Yu
|
Yifei Guo
|
Yangneng Chen
|
Sim Kuan Goh
|
Ho-Kin Tang
|
Daojing He
|
Honghai Liu
|
Min Zhang
Foundation models and their checkpoints have significantly advanced deep learning, boosting performance across various applications. However, fine-tuned models often struggle outside their specific domains and exhibit considerable redundancy. Recent studies suggest that combining a pruned fine-tuned model with the original pre-trained model can mitigate forgetting, reduce interference when merging model parameters across tasks, and improve compression efficiency. In this context, developing an effective pruning strategy for fine-tuned models is crucial. Leveraging the advantages of the task vector mechanism, we preprocess fine-tuned models by calculating the differences between them and the original model. Recognizing that different task vector subspaces contribute variably to model performance, we introduce a novel method called **N**eural **P**arameter **S**earch (**NPS**) for slimming down fine-tuned models. This method enhances pruning efficiency by searching through neural parameters of task vectors within low-rank subspaces. Our method has three key applications: enhancing knowledge transfer through pairwise model interpolation, facilitating effective knowledge fusion via model merging, and enabling the deployment of compressed models that retain near-original performance while significantly reducing storage costs. Extensive experiments across vision, NLP, and multi-modal benchmarks demonstrate the effectiveness and robustness of our approach, resulting in substantial performance gains.
pdf
bib
abs
Merge Hijacking: Backdoor Attacks to Model Merging of Large Language Models
Zenghui Yuan
|
Yangming Xu
|
Jiawen Shi
|
Pan Zhou
|
Lichao Sun
Model merging for Large Language Models (LLMs) directly fuses the parameters of different models finetuned on various tasks, creating a unified model for multi-domain tasks. However, due to potential vulnerabilities in models available on open-source platforms, model merging is susceptible to backdoor attacks. In this paper, we propose Merge Hijacking, the first backdoor attack targeting model merging in LLMs. The attacker constructs a malicious upload model and releases it. Once a victim user merges it with any other models, the resulting merged model inherits the backdoor while maintaining utility across tasks. Merge Hijacking defines two main objectives—effectiveness and utility—and achieves them through four steps. Extensive experiments demonstrate the effectiveness of our attack across different models, merging algorithms, and tasks. Additionally, we show that the attack remains effective even when merging real-world models. Moreover, our attack demonstrates robustness against two inference-time defenses (Paraphrasing and CLEANGEN) and one training-time defense (Fine-pruning).
pdf
bib
abs
Where Are We? Evaluating LLM Performance on African Languages
Ife Adebara
|
Hawau Olamide Toyin
|
Nahom Tesfu Ghebremichael
|
AbdelRahim A. Elmadany
|
Muhammad Abdul-Mageed
Africa’s rich linguistic heritage remains underrepresented in NLP, largely due to historical policies that favor foreign languages and create significant data inequities. In this paper, we integrate theoretical insights on Africa’s language landscape with an empirical evaluation using Sahara— a comprehensive benchmark curated from large-scale, publicly accessible datasets capturing the continent’s linguistic diversity. By systematically assessing the performance of leading large language models (LLMs) on Sahara, we demonstrate how policy-induced data variations directly impact model effectiveness across African languages. Our findings reveal that while a few languages perform reasonably well, many Indigenous languages remain marginalized due to sparse data. Leveraging these insights, we offer actionable recommendations for policy reforms and inclusive data practices. Overall, our work underscores the urgent need for a dual approach—combining theoretical understanding with empirical evaluation—to foster linguistic diversity in AI for African communities.
pdf
bib
abs
Beyond Output Matching: Bidirectional Alignment for Enhanced In-Context Learning
Chengwei Qin
|
Wenhan Xia
|
Fangkai Jiao
|
Chen Chen
|
Yuchen Hu
|
Bosheng Ding
|
Ruirui Chen
|
Shafiq Joty
Large language models (LLMs) have shown impressive few-shot generalization on many tasks via in-context learning (ICL). Despite their success in showing such emergent abilities, the scale and complexity of larger models also lead to unprecedentedly high computational demands and deployment challenges. In reaction, researchers explore transferring the powerful capabilities of larger models to more efficient and compact models by typically aligning the output of smaller (student) models with that of larger (teacher) models. Existing methods either train student models on the generated outputs of teacher models or imitate their token-level probability distributions. However, these distillation methods pay little to no attention to the input, which also plays a crucial role in ICL. Based on the finding that the performance of ICL is highly sensitive to the selection of demonstration examples, we propose Bidirectional Alignment (BiAlign) to fully leverage the models’ preferences for ICL examples to improve the ICL abilities of student models. Specifically, we introduce the alignment of input preferences between student and teacher models by incorporating a novel ranking loss, in addition to aligning the token-level output distribution. With extensive experiments and analysis, we demonstrate that BiAlign can consistently outperform existing baselines on a variety of tasks involving language understanding, reasoning, and coding.
pdf
bib
abs
CiteEval: Principle-Driven Citation Evaluation for Source Attribution
Yumo Xu
|
Peng Qi
|
Jifan Chen
|
Kunlun Liu
|
Rujun Han
|
Lan Liu
|
Bonan Min
|
Vittorio Castelli
|
Arshit Gupta
|
Zhiguo Wang
Citation quality is crucial in information-seeking systems, directly influencing trust and the effectiveness of information access. Current evaluation frameworks, both human and automatic, mainly rely on Natural Language Inference (NLI) to assess binary or ternary supportiveness from cited sources, which we argue is a suboptimal proxy for citation evaluation. In this work we introduce CiteEval, a citation evaluation framework driven by principles focusing on fine-grained citation assessment within a broad context, encompassing not only the cited sources but the full retrieval context, user query, and generated text. Guided by the proposed framework, we construct CiteBench, a multi-domain benchmark with high-quality human annotations on citation quality. To enable efficient evaluation, we further develop CiteEval-Auto, a suite of model-based metrics that exhibit strong correlation with human judgments. Experiments across diverse systems demonstrate CiteEval-Auto’s superior ability to capture the multifaceted nature of citations compared to existing metrics, offering a principled and scalable approach to evaluate and improve model-generated citations.
pdf
bib
abs
HiAgent: Hierarchical Working Memory Management for Solving Long-Horizon Agent Tasks with Large Language Model
Mengkang Hu
|
Tianxing Chen
|
Qiguang Chen
|
Yao Mu
|
Wenqi Shao
|
Ping Luo
Large Language Model (LLM)-based agents exhibit significant potential across various domains, operating as interactive systems that process environmental observations to generate executable actions for target tasks. The effectiveness of these agents is significantly influenced by their memory mechanism, which records historical experiences as sequences of action-observation pairs. We categorize memory into two types: cross-trial memory, accumulated across multiple attempts, and in-trial memory (working memory), accumulated within a single attempt. While considerable research has optimized performance through cross-trial memory, the enhancement of agent performance through improved working memory utilization remains underexplored. Instead, existing approaches often involve directly inputting entire historical action-observation pairs into LLMs, leading to redundancy in long-horizon tasks. Inspired by human problem-solving strategies, this paper introduces HiAgent, a framework that leverages subgoals as memory chunks to manage the working memory of LLM-based agents hierarchically. Specifically, HiAgent prompts LLMs to formulate subgoals before generating executable actions and enables LLMs to decide proactively to replace previous subgoals with summarized observations, retaining only the action-observation pairs relevant to the current subgoal. Experimental results across five long-horizon tasks demonstrate that HiAgent achieves a twofold increase in success rate and reduces the average number of steps required by 3.8. Additionally, our analysis shows that HiAgent consistently improves performance across various steps, highlighting its robustness and generalizability. Code is available in this URL: https://github.com/HiAgent2024/HiAgent
pdf
bib
abs
EducationQ: Evaluating LLMs’ Teaching Capabilities Through Multi-Agent Dialogue Framework
Yao Shi
|
Rongkeng Liang
|
Yong Xu
Large Language Models (LLMs) increasingly serve as educational tools, yet evaluating their teaching capabilities remains challenging due to the resource-intensive, context-dependent, and methodologically complex nature of teacher-student interactions. We introduce EducationQ, a multi-agent dialogue framework that efficiently assesses teaching capabilities through simulated dynamic educational scenarios, featuring specialized agents for teaching, learning, and evaluation. Testing 14 LLMs across major AI Organizations (OpenAI, Meta, Google, Anthropic, and others) on 1,498 questions spanning 13 disciplines and 10 difficulty levels reveals that teaching effectiveness does not correlate linearly with model scale or general reasoning capabilities - with some smaller open-source models outperforming larger commercial counterparts in teaching contexts. This finding highlights a critical gap in current evaluations that prioritize knowledge recall over interactive pedagogy. Our mixed-methods evaluation, combining quantitative metrics with qualitative analysis and expert case studies, identifies distinct pedagogical strengths employed by top-performing models (e.g., sophisticated questioning strategies, adaptive feedback mechanisms). Human expert evaluations show 78% agreement with our automated qualitative analysis of effective teaching behaviors, validating our methodology. EducationQ demonstrates that LLMs-as-Teachers require specialized optimization beyond simple scaling, suggesting next-generation educational AI prioritize targeted enhancement of specific pedagogical effectiveness.
pdf
bib
abs
KRISTEVA: Close Reading as a Novel Task for Benchmarking Interpretive Reasoning
Peiqi Sui
|
Juan Diego Rodriguez
|
Philippe Laban
|
J. Dean Murphy
|
Joseph P. Dexter
|
Richard Jean So
|
Samuel Baker
|
Pramit Chaudhuri
Each year, tens of millions of essays are written and graded in college-level English courses. Students are asked to analyze literary and cultural texts through a process known as close reading, where they gather textual details from which to formulate evidence-based arguments. Despite being viewed as a basis for critical thinking and widely adopted as a required element of university coursework, close reading has never been evaluated on large language models (LLMs), and multi-discipline benchmarks like MMLU do not include literature as a subject. To fill this gap, we present KRISTEVA, the first close reading benchmark for evaluating interpretive reasoning, consisting of 1331 multiple-choice questions adapted from classroom data. With KRISTEVA, we propose three progressively more difficult sets of tasks to approximate different elements of the close reading process, which we use to test how well LLMs understand and reason about literary works: 1) extracting stylistic features, 2) retrieving relevant contextual information from parametric knowledge, and 3) multi-hop reasoning between style and external contexts. Our baseline results find that while state-of-the-art LLMs possess some college-level close reading competency (accuracy 49.7% - 69.7%), their performances still trail those of experienced human evaluators on 10 out of our 11 tasks.
pdf
bib
abs
Efficient Domain Continual pretraining by Mitigating the Stability Gap
Yiduo Guo
|
Jie Fu
|
Huishuai Zhang
|
Dongyan Zhao
Continual pretraining enables Large Language Models (LLMs) to adapt to specialized domains like medicine and law. However, we observe a consistent phenomenon across different model sizes and domains: a temporary performance drop at the start of the continual pretraining process, followed by a performance recovery phase. To gain a deeper understanding of this issue, we use the stability gap— a concept adapted from the visual domain—which explains this initial drop arises from instability in the model’s general abilities. We validate this hypothesis through a series of experiments. To address this initial instability and enhance LLM performance within a fixed compute budget, we propose a training strategy that mitigates instability by increasing the number of epochs, alongside two data sampling strategies targeting data domain relevance and corpus distribution. We conduct experiments on Llama-family models to validate the effectiveness of our strategies for continual pretraining and instruction tuning in medical and legal domains. Our strategies improve the average medical task performance of the OpenLlama-3B model from 36.2% to 40.7% using only 40% of the original training budget, while also enhancing general task performance without causing forgetting. Furthermore, we aPPLy our strategies to continually pre-train and instruction-tune the Llama-3-8B model. The resulting model, Llama-3-Physician, achieves the best medical performance among open-source models on several benchmarks and rivals GPT-4 on specific tasks. We release our models at https://huggingface.co/YiDuo1999/Llama-3-Physician-8B-Instruct.
pdf
bib
abs
Palm: A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs
Fakhraddin Alwajih
|
Abdellah El Mekki
|
Samar Mohamed Magdy
|
AbdelRahim A. Elmadany
|
Omer Nacar
|
El Moatez Billah Nagoudi
|
Reem Abdel-Salam
|
Hanin Atwany
|
Youssef Nafea
|
Abdulfattah Mohammed Yahya
|
Rahaf Alhamouri
|
Hamzah A. Alsayadi
|
Hiba Zayed
|
Sara Shatnawi
|
Serry Sibaee
|
Yasir Ech-chammakhy
|
Walid Al-Dhabyani
|
Marwa Mohamed Ali
|
Imen Jarraya
|
Ahmed Oumar El-Shangiti
|
Aisha Alraeesi
|
Mohammed Anwar AL-Ghrawi
|
Abdulrahman S. Al-Batati
|
Elgizouli Mohamed
|
Noha Taha Elgindi
|
Muhammed Saeed
|
Houdaifa Atou
|
Issam Ait Yahia
|
Abdelhak Bouayad
|
Mohammed Machrouh
|
Amal Makouar
|
Dania Alkawi
|
Mukhtar Mohamed
|
Safaa Taher Abdelfadil
|
Amine Ziad Ounnoughene
|
Anfel Rouabhia
|
Rwaa Assi
|
Ahmed Sorkatti
|
Mohamedou Cheikh Tourad
|
Anis Koubaa
|
Ismail Berrada
|
Mustafa Jarrar
|
Shady Shehata
|
Muhammad Abdul-Mageed
As large language models (LLMs) become increasingly integrated into daily life, ensuring their cultural sensitivity and inclusivity is paramount. We introduce PALM, a year-long community-driven project covering all 22 Arab countries. The dataset contains instruction–response pairs in both Modern Standard Arabic (MSA) and dialectal Arabic (DA), spanning 20 diverse topics. Built by a team of 44 researchers across the Arab world—each an author of this paper—PALM offers a broad, inclusive perspective. We use PALM to evaluate the cultural and dialectal capabilities of several frontier LLMs, revealing notable limitations: while closed-source LLMs generally perform strongly, they still exhibit flaws, and smaller open-source models face greater challenges. Furthermore, certain countries (e.g., Egypt, the UAE) appear better represented than others (e.g., Iraq, Mauritania, Yemen). Our annotation guidelines, code, and data are publicly available for reproducibility. More information about PALM is available on our project page: https://github.com/UBC-NLP/palm.
pdf
bib
abs
NewsInterview: a Dataset and a Playground to Evaluate LLMs’ Grounding Gap via Informational Interviews
Alexander Spangher
|
Michael Lu
|
Sriya Kalyan
|
Hyundong Justin Cho
|
Tenghao Huang
|
Weiyan Shi
|
Jonathan May
Large Language Models (LLMs) have demonstrated impressive capabilities in generating coherent text but often struggle with grounding language and strategic dialogue. To address this gap, we focus on journalistic interviews, a domain rich in grounding communication and abundant in data. We curate a dataset of 40,000 two-person informational interviews from NPR and CNN, and reveal that LLMs are significantly less likely than human interviewers to use acknowledgements and to pivot to higher-level questions. Realizing that a fundamental deficit exists in multi-turn planning and strategic thinking, we develop a realistic simulated environment, incorporating source personas and persuasive elements, in order to facilitate the development of agents with longer-horizon rewards. Our experiments show that while source LLMs mimic human behavior in information sharing, interviewer LLMs struggle with recognizing when questions are answered and engaging persuasively, leading to suboptimal information extraction across model size and capability. These findings underscore the need for enhancing LLMs’ strategic dialogue capabilities.
pdf
bib
abs
CFBench: A Comprehensive Constraints-Following Benchmark for LLMs
Tao Zhang
|
ChengLIn Zhu
|
Yanjun Shen
|
Wenjing Luo
|
Yan Zhang
|
Hao Liang
|
Tao Zhang
|
Fan Yang
|
Mingan Lin
|
Yujing Qiao
|
Weipeng Chen
|
Bin Cui
|
Wentao Zhang
|
Zenan Zhou
The adeptness of Large Language Models (LLMs) in comprehending and following natural language instructions is critical for their deployment in sophisticated real-world applications. Existing evaluations mainly focus on fragmented constraints or narrow scenarios, but they overlook the comprehensiveness and authenticity of constraints from the user’s perspective. To bridge this gap, we propose CFBench, a large-scale Chinese Comprehensive Constraints Following Benchmark for LLMs, featuring 1,000 curated samples that cover more than 200 real-life scenarios and over 50 NLP tasks. CFBench meticulously compiles constraints from real-world instructions and constructs an innovative systematic framework for constraint types, which includes 10 primary categories and over 25 subcategories, and ensures each constraint is seamlessly integrated within the instructions. To make certain that the evaluation of LLM outputs aligns with user perceptions, we propose an advanced methodology that integrates multi-dimensional assessment criteria with requirement prioritization, covering various perspectives of constraints, instructions, and requirement fulfillment. Evaluating current leading LLMs on CFBench reveals substantial room for improvement in constraints following, and we further investigate influencing factors and enhancement strategies. The data and code will be made available.
pdf
bib
abs
Towards Building Large Scale Datasets and State-of-the-Art Automatic Speech Translation Systems for 14 Indian Languages
Ashwin Sankar
|
Sparsh Jain
|
Nikhil Narasimhan
|
Devilal Choudhary
|
Dhairya Suman
|
Mohammed Safi Ur Rahman Khan
|
Anoop Kunchukuttan
|
Mitesh M Khapra
|
Raj Dabre
Speech translation for Indian languages remains a challenging task due to the scarcity of large-scale, publicly available datasets that capture the linguistic diversity and domain coverage essential for real-world applications. Existing datasets cover a fraction of Indian languages and lack the breadth needed to train robust models that generalize beyond curated benchmarks. To bridge this gap, we introduce BhasaAnuvaad, the largest speech translation dataset for Indian languages, spanning over 44 thousand hours of audio and 17 million aligned text segments across 14 Indian languages and English. Our dataset is built through a threefold methodology: (a) aggregating high-quality existing sources, (b) large-scale web crawling to ensure linguistic and domain diversity, and (c) creating synthetic data to model real-world speech disfluencies. Leveraging BhasaAnuvaad, we train IndicSeamless, a state-of-the-art speech translation model for Indian languages that performs better than existing models. Our experiments demonstrate improvements in the translation quality, setting a new standard for Indian language speech translation. We will release all the code, data and model weights in the open-source, with permissive licenses to promote accessibility and collaboration.
pdf
bib
abs
CoRe-MMRAG: Cross-Source Knowledge Reconciliation for Multimodal RAG
Yang Tian
|
Fan Liu
|
Jingyuan Zhang
|
V. W.
|
Yupeng Hu
|
Liqiang Nie
Multimodal Retrieval-Augmented Generation (MMRAG) has been introduced to enhance Multimodal Large Language Models by incorporating externally retrieved multimodal knowledge, but it introduces two challenges: Parametric-Retrieved Knowledge Inconsistency (PRKI), where discrepancies between parametric and retrieved knowledge create uncertainty in determining reliability, and Visual-Textual Knowledge Inconsistency (VTKI), where misalignment between visual and textual sources disrupts entity representation. To address these challenges, we propose Cross-source knowledge Reconciliation for MultiModal RAG (CoRe-MMRAG), a novel end-to-end framework that effectively reconciles inconsistencies across knowledge sources. CoRe-MMRAG follows a four-stage pipeline: it first generates an internal response from parametric knowledge, then selects the most relevant multimodal evidence via joint similarity assessment, generates an external response, and finally integrates both to produce a reliable answer. Additionally, a specialized training paradigm enhances knowledge source discrimination, multimodal integration, and unified answer generation. Experiments on KB-VQA benchmarks show that CoRe-MMRAG achieves substantial improvements over baseline methods, achieving 5.6% and 9.3% performance gains on InfoSeek and Encyclopedic-VQA, respectively. We release code and data at https://github.com/TyangJN/CoRe-MMRAG.
pdf
bib
abs
Mapping 1,000+ Language Models via the Log-Likelihood Vector
Momose Oyama
|
Hiroaki Yamagiwa
|
Yusuke Takase
|
Hidetoshi Shimodaira
To compare autoregressive language models at scale, we propose using log-likelihood vectors computed on a predefined text set as model features. This approach has a solid theoretical basis: when treated as model coordinates, their squared Euclidean distance approximates the Kullback-Leibler divergence of text-generation probabilities. Our method is highly scalable, with computational cost growing linearly in both the number of models and text samples, and is easy to implement as the required features are derived from cross-entropy loss. Applying this method to over 1,000 language models, we constructed a “model map,” providing a new perspective on large-scale model analysis.
pdf
bib
abs
ConsistencyChecker: Tree-based Evaluation of LLM Generalization Capabilities
Zhaochen Hong
|
Haofei Yu
|
Jiaxuan You
Evaluating Large Language Models (LLMs) requires effective methods to assess semantic consistency across multiple reversible transformations. Traditional self-consistency methods often fail to capture subtle semantic errors in multi-step tasks. We introduce ConsistencyChecker, a tree-based evaluation framework that measures LLMs’ ability to preserve semantic consistency during reversible transformation processes, sidestepping benchmark data contamination issues. Our approach constructs self-consistency trees where nodes represent text states after transformations (e.g., translation, code modification, paraphrasing) and edges represent pairs of opposite transformations. By analyzing semantic preservation between nodes at different tree depths, ConsistencyChecker quantifies model reliability without requiring manually annotated reference data. Experiments demonstrate that ConsistencyChecker reliably measures generalization abilities across models from 1.5B to 72B parameters. On translation tasks, GPT-4o Mini achieves the highest L3 consistency score of 98.0%. For code generation, Qwen 2.5 32B leads with 85.1% semantic consistency at L3. Results show Pearson correlation greater than 0.7 between our embedding-based scores and WMT 2024 rankings on 4 out of 5 shared language pairs, validating the method’s effectiveness for benchmarking LLM performance without constructing new datasets.
pdf
bib
abs
Robust Estimation of Population-Level Effects in Repeated-Measures NLP Experimental Designs
Alejandro Benito-Santos
|
Adrian Ghajari
|
Víctor Fresno
NLP research frequently grapples with multiple sources of variability—spanning runs, datasets, annotators, and more—yet conventional analysis methods often neglect these hierarchical structures, threatening the reproducibility of findings. To address this gap, we contribute a case study illustrating how linear mixed-effects models (LMMs) can rigorously capture systematic language-dependent differences (i.e., population-level effects) in a population of monolingual and multilingual language models. In the context of a bilingual hate speech detection task, we demonstrate that LMMs can uncover significant population-level effects—even under low-resource (small-N) experimental designs—while mitigating confounds and random noise. By setting out a transparent blueprint for repeated-measures experimentation, we encourage the NLP community to embrace variability as a feature, rather than a nuisance, in order to advance more robust, reproducible, and ultimately trustworthy results.
pdf
bib
abs
FactBench: A Dynamic Benchmark for In-the-Wild Language Model Factuality Evaluation
Farima Fatahi Bayat
|
Lechen Zhang
|
Sheza Munir
|
Lu Wang
The rapid adoption of language models (LMs) across diverse applications has raised concerns about their factuality, i.e., their consistency with real-world facts. We introduce VERIFY, an evidence-based evaluation pipeline that measures LMs’ factuality in real-world user interactions. VERIFY considers the verifiability of LM-generated content and categorizes content units as Supported, Unsupported, or Undecidable based on Web-retrieved evidence. Importantly, factuality judgment by VERIFY more strongly correlates with human evaluations than existing methods. Using VERIFY, we identify “hallucination prompts,” i.e., those that frequently elicit factual errors in LM responses. These prompts form FactBench, a dataset of 1K prompts spanning 150 topics and tiered into Easy, Moderate, and Hard prompts. We benchmark widely-used openweight and proprietary LMs from six families, yielding three key findings: (i) LMs’ factual precision declines from Easy to Hard prompts, (ii) factuality does not necessarily improve with scale; Llama3.1-405B-Instruct performs comparably to or worse than its 70B variant, and (iii) Gemini1.5-Pro shows a notably higher refusal rate, with over-refusal in 25% of cases.
pdf
bib
abs
Training-free LLM Merging for Multi-task Learning
Zichuan Fu
|
Xian Wu
|
Yejing Wang
|
Wanyu Wang
|
Shanshan Ye
|
Hongzhi Yin
|
Yi Chang
|
Yefeng Zheng
|
Xiangyu Zhao
Large Language Models (LLMs) have demonstrated exceptional capabilities across diverse natural language processing (NLP) tasks. The release of open-source LLMs like LLaMA and Qwen has triggered the development of numerous fine-tuned models tailored for various tasks and languages. In this paper, we explore an important question: is it possible to combine these specialized models to create a unified model with multi-task capabilities. We introduces **H**ierarchical **I**terative **Merging** (Hi-Merging), a training-free method for unifying different specialized LLMs into a single model. Specifically, Hi-Merging employs model-wise and layer-wise pruning and scaling, guided by contribution analysis, to mitigate parameter conflicts. Extensive experiments on multiple-choice and question-answering tasks in both Chinese and English validate Hi-Merging’s ability for multi-task learning. The results demonstrate that Hi-Merging consistently outperforms existing merging techniques and surpasses the performance of models fine-tuned on combined datasets in most scenarios. Code is available at [Applied-Machine-Learning-Lab/Hi-Merging](https://github.com/Applied-Machine-Learning-Lab/Hi-Merging).
pdf
bib
abs
Inferring from Logits: Exploring Best Practices for Decoding-Free Generative Candidate Selection
Mingyu Derek Ma
|
Yanna Ding
|
Zijie Huang
|
Jianxi Gao
|
Yizhou Sun
|
Wei Wang
Generative Language Models rely on autoregressive decoding to produce the output sequence token by token. Many tasks such as preference optimization, require the model to produce task-level output consisting of multiple tokens directly by selecting candidates from a pool as predictions. Determining a task-level prediction from candidates using the ordinary token-level decoding mechanism is constrained by time-consuming decoding and interrupted gradients by discrete token selection. Existing works have been using decoding-free candidate selection methods to obtain candidate probability from initial output logits over vocabulary. Though these estimation methods are widely used, they are not systematically evaluated, especially on end tasks. We introduce an evaluation of a comprehensive collection of decoding-free candidate selection approaches on a comprehensive set of tasks, including five multiple-choice QA tasks with a small candidate pool and four clinical decision tasks with a massive amount of candidates, some with 10k+ options. We evaluate the estimation methods paired with a wide spectrum of foundation LMs covering different architectures, sizes and training paradigms. The results and insights from our analysis inform the future model design.
pdf
bib
abs
Comparison-based Active Preference Learning for Multi-dimensional Personalization
Minhyeon Oh
|
Seungjoon Lee
|
Jungseul Ok
Large language models (LLMs) have shown remarkable success, but aligning them with human preferences remains a core challenge. As individuals have their own, multi-dimensional preferences, recent studies have explored *multi-dimensional personalization*, which aims to enable models to generate responses personalized to *explicit* preferences. However, human preferences are often *implicit* and thus difficult to articulate, limiting the direct application of this approach. To bridge this gap, we propose Active Multi-dimensional Preference Learning (AMPLe), designed to capture implicit user preferences from interactively collected comparative feedback. Building on Bayesian inference, our work introduces a modified posterior update procedure to mitigate estimation bias and potential noise in comparisons. Also, inspired by generalized binary search, we employ an active query selection strategy to minimize the number of required comparisons by a user. Through theoretical analysis and experiments on language generation tasks, we demonstrate feedback efficiency and effectiveness of our framework in personalizing model responses. Our code is publicly available at https://github.com/ml-postech/AMPLe.
pdf
bib
abs
OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models
Siming Huang
|
Tianhao Cheng
|
Jason Klein Liu
|
Weidi Xu
|
Jiaran Hao
|
Liuyihan Song
|
Yang Xu
|
Jian Yang
|
Jiaheng Liu
|
Chenchen Zhang
|
Linzheng Chai
|
Ruifeng Yuan
|
Xianzhen Luo
|
Qiufeng Wang
|
YuanTao Fan
|
Qingfu Zhu
|
Zhaoxiang Zhang
|
Yang Gao
|
Jie Fu
|
Qian Liu
|
Houyi Li
|
Ge Zhang
|
Yuan Qi
|
Xu Yinghui
|
Wei Chu
|
Zili Wang
Code LLMs have been widely used in various domains, including code generation, logical reasoning, and agent systems. However, open-access code LLMs mostly only release weights, lacking key features such as reproducible data pipelines and transparent training protocols, which are crucial for advancing deeper, more reliable investigations. To address the gap, we introduce OpenCoder, a top-tier code LLM that not only achieves performance comparable to leading models but also serves as an “open cookbook” for the research community. Unlike most prior efforts, we release not only model weights and inference code, but also the reproducible training data, complete data processing pipeline, rigorous experimental ablation results, and detailed training protocols for open scientific research. Our work identifies the key ingredients for building a top-tier code LLM: optimized heuristic rules for data cleaning and deduplication, effective recall of code-related text corpus, and high-quality synthetic data for both annealing and supervised fine-tuning stages. By offering this level of openness, we aim to broaden access to all aspects of a top-tier code LLM, with OpenCoder serving as both a powerful model and an open foundation to accelerate research and enable reproducible advancements in code intelligence. The released resource is available at https://opencoder-llm.github.io.
pdf
bib
abs
LlamaDuo: LLMOps Pipeline for Seamless Migration from Service LLMs to Small-Scale Local LLMs
Chansung Park
|
Juyong Jiang
|
Fan Wang
|
Sayak Paul
|
Jing Tang
The widespread adoption of cloud-based proprietary large language models (LLMs) has introduced significant challenges, including operational dependencies, privacy concerns, and the necessity of continuous internet connectivity. In this work, we introduce an LLMOps pipeline, “LlamaDuo”, for the seamless migration of knowledge and abilities from service-oriented LLMs to smaller, locally manageable models. This pipeline is crucial for ensuring service continuity in the presence of operational failures, strict privacy policies, or offline requirements. Our LlamaDuo involves fine-tuning a small language model against the service LLM using a synthetic dataset generated by the latter. If the performance of the fine-tuned model falls short of expectations, it is automatically improved through additional fine-tuning using extra similar data generated by the service LLM. This multi-turn process guarantees that the smaller model can eventually match or even surpass the service LLM’s capabilities in specific downstream tasks, offering a practical and scalable solution for managing AI deployments in constrained environments. Extensive experiments with leading-edge LLMs are conducted to demonstrate the effectiveness, adaptability, and affordability of LlamaDuo across various downstream tasks. Our pipeline implementation is available at https://github.com/deep-diver/llamaduo.
pdf
bib
abs
AmbiK: Dataset of Ambiguous Tasks in Kitchen Environment
Anastasia Ivanova
|
Bakaeva Eva
|
Zoya Volovikova
|
Alexey Kovalev
|
Aleksandr Panov
As a part of an embodied agent, Large Language Models (LLMs) are typically used for behavior planning given natural language instructions from the user. However, dealing with ambiguous instructions in real-world environments remains a challenge for LLMs. Various methods for task ambiguity detection have been proposed. However, it is difficult to compare them because they are tested on different datasets and there is no universal benchmark. For this reason, we propose AmbiK (Ambiguous Tasks in Kitchen Environment), the fully textual dataset of ambiguous instructions addressed to a robot in a kitchen environment. AmbiK was collected with the assistance of LLMs and is human-validated. It comprises 1000 pairs of ambiguous tasks and their unambiguous counterparts, categorized by ambiguity type (Human Preferences, Common Sense Knowledge, Safety), with environment descriptions, clarifying questions and answers, user intents, and task plans, for a total of 2000 tasks. We hope that AmbiK will enable researchers to perform a unified comparison of ambiguity detection methods. AmbiK is available at https://github.com/cog-model/AmbiK-dataset.
pdf
bib
abs
SocialCC: Interactive Evaluation for Cultural Competence in Language Agents
Jincenzi Wu
|
Jianxun Lian
|
Dingdong Wang
|
Helen M. Meng
Large Language Models (LLMs) are increasingly deployed worldwide, yet their ability to navigate cultural nuances remains underexplored. Misinterpreting cultural content can lead to AI-generated responses that are offensive or inappropriate, limiting their usability in global applications such as customer service, diplomatic communication, and online education. While prior research has evaluated cultural knowledge of LLMs, existing benchmarks fail to assess dynamic cultural competence-the ability to apply cultural knowledge effectively in real-world interactions. To address this gap, we introduce SocialDuolingo, a novel benchmark designed to evaluate cultural competence through multi-turn interactive intercultural scenarios. It comprises 3,060 human-written scenarios spanning 60 countries across six continents. Through extensive experiments on eight prominent LLMs, our findings reveal a significant gap between the cultural knowledge stored in these models and their ability to apply it effectively in cross-cultural communication.
pdf
bib
abs
Scalable Vision Language Model Training via High Quality Data Curation
Hongyuan Dong
|
Zijian Kang
|
Weijie Yin
|
LiangXiao LiangXiao
|
ChaoFeng ChaoFeng
|
Ran Jiao
In this paper, we introduce SAIL-VL ( ScAlable Vision Language Model TraIning via High QuaLity Data Curation), an open-source vision language model (VLM) series achieving state-of-the-art (SOTA) performance in 2B and 8B parameters. The following three key improvements contribute to SAIL-VL’s leading performance: (1) Scalable high-quality visual understanding data construction: We implement a data construction pipeline to enable hundred-million-scale high-quality recaption data annotation. The resulted dataset SAIL-Caption is validated to be of the highest data quality compared with opensource datasets. (2) Scalable Pretraining with High-Quality Visual Understanding Data: We scale SAIL-VL’s pretraining budget up to 655B tokens and show that even a 2B VLM benefits from scaled up training data sizes, exhibiting logarithmic data size scaling laws in benchmark performance. (3) Scalable SFT via data quantity and complexity scaling: We curate a high-quality SFT dataset collection with leading data quantity scaling effectiveness and demonstrate that training with progressively higher-complexity data surpasses baseline one-stage training by a large margin. SAIL-VL series models achieve the highest average score in 18 widely used VLM benchmarks in our evaluation, with the 2B model takes the top position over VLMs of comparable sizes on OpenCompass 2024 (https://rank.opencompass.org.cn/leaderboard-multimodal), demonstrating robust visual comprehension abilities. SAIL-VL series models are released at HuggingFace (https://huggingface.co/BytedanceDouyinContent).
pdf
bib
abs
GRAM: Generative Recommendation via Semantic-aware Multi-granular Late Fusion
Sunkyung Lee
|
Minjin Choi
|
Eunseong Choi
|
Hye-young Kim
|
Jongwuk Lee
Generative recommendation is an emerging paradigm that leverages the extensive knowledge of large language models by formulating recommendations into a text-to-text generation task. However, existing studies face two key limitations in (i) incorporating implicit item relationships and (ii) utilizing rich yet lengthy item information. To address these challenges, we propose a Generative Recommender via semantic-Aware Multi-granular late fusion (GRAM), introducing two synergistic innovations. First, we design semantic-to-lexical translation to encode implicit hierarchical and collaborative item relationships into the vocabulary space of LLMs. Second, we present multi-granular late fusion to integrate rich semantics efficiently with minimal information loss. It employs separate encoders for multi-granular prompts, delaying the fusion until the decoding stage. Experiments on four benchmark datasets show that GRAM outperforms eight state-of-the-art generative recommendation models, achieving significant improvements of 11.5-16.0% in Recall@5 and 5.3-13.6% in NDCG@5. The source code is available at https://github.com/skleee/GRAM.
pdf
bib
abs
Towards Economical Inference: Enabling DeepSeek’s Multi-Head Latent Attention in Any Transformer-based LLMs
Tao Ji
|
Bin Guo
|
Yuanbin Wu
|
Qipeng Guo
|
Shenlixing Shenlixing
|
Chenzhan Chenzhan
|
Xipeng Qiu
|
Qi Zhang
|
Tao Gui
Multi-head Latent Attention (MLA) is an innovative architecture proposed by DeepSeek, designed to ensure efficient and economical inference by significantly compressing the Key-Value (KV) cache into a latent vector. Compared to MLA, standard LLMs employing Multi-Head Attention (MHA) and its variants such as Grouped-Query Attention (GQA) exhibit significant cost disadvantages. Enabling well-trained LLMs (e.g., Llama) to rapidly adapt to MLA without pre-training from scratch is both meaningful and challenging. This paper proposes the first data-efficient fine-tuning method for transitioning from MHA to MLA (**MHA2MLA**), which includes two key components: for *partial-RoPE*, we remove RoPE from dimensions of queries and keys that contribute less to the attention scores, for *low-rank approximation*, we introduce joint SVD approximations based on the pre-trained parameters of keys and values. These carefully designed strategies enable MHA2MLA to recover performance using only a small fraction (0.6% to 1%) of the data, significantly reducing inference costs while seamlessly integrating with compression techniques such as KV cache quantization. For example, the KV cache size of Llama2-7B is reduced by 92.19%, with only a 1% drop in LongBench performance. Our source code is publicly available at https://github.com/JT-Ushio/MHA2MLA.
pdf
bib
abs
TETRIS: Optimal Draft Token Selection for Batch Speculative Decoding
Zhaoxuan Wu
|
Zijian Zhou
|
Arun Verma
|
Alok Prakash
|
Daniela Rus
|
Bryan Kian Hsiang Low
We propose TETRIS, a novel method that optimizes the total throughput of batch speculative decoding in multi-request settings. Unlike existing methods that optimize for a single request or a group of requests as a whole, TETRIS actively selects the most promising draft tokens (for every request in a batch) to be accepted when verified in parallel, resulting in fewer rejected tokens and hence less wasted computing resources. Such an effective resource utilization to achieve fast inference in large language models (LLMs) is especially important to service providers with limited inference capacity. Compared to baseline speculative decoding, TETRIS yields a consistently higher acceptance rate and more effective utilization of the limited inference capacity. We show theoretically and empirically that TETRIS outperforms baseline speculative decoding and existing methods that dynamically select draft tokens, leading to a more efficient batch inference in LLMs.
pdf
bib
abs
Introducing Verification Task of Set Consistency with Set-Consistency Energy Networks
Mooho Song
|
Hye Ryung Son
|
Jay-Yoon Lee
Examining logical inconsistencies among multiple statements (such as collections of sentences or question-answer pairs) is a crucial challenge in machine learning, particularly for ensuring the safety and reliability of models. Traditional methods that rely on 1:1 pairwise comparisons often fail to capture inconsistencies that only emerge when more than two statements are evaluated collectively. To address this gap, we introduce the task of set-consistency verification, an extension of natural language inference (NLI) that assesses the logical coherence of entire sets rather than isolated pairs. Building on this task, we present the Set-Consistency Energy Network (SC-Energy), a novel model that employs a margin-based loss to learn the compatibility among a collection of statements. Our approach not only efficiently verifies inconsistencies and pinpoints the specific statements responsible for logical contradictions, but also significantly outperforms existing methods, including prompting-based LLM models. Furthermore, we release two new datasets: Set-LConVQA and Set-SNLI for set-consistency verification task.
pdf
bib
abs
Language Models can Subtly Deceive Without Lying: A Case Study on Strategic Phrasing in Legislation
Atharvan Dogra
|
Krishna Pillutla
|
Ameet Deshpande
|
Ananya B. Sai
|
John J Nay
|
Tanmay Rajpurohit
|
Ashwin Kalyan
|
Balaraman Ravindran
We explore the ability of large language models (LLMs) to engage in subtle deception through strategically phrasing and intentionally manipulating information. This harmful behavior can be hard to detect, unlike blatant lying or unintentional hallucination. We build a simple testbed mimicking a legislative environment where a corporate lobbyist module is proposing amendments to bills that benefit a specific company while evading identification of this benefactor. We use real-world legislative bills matched with potentially affected companies to ground these interactions. Our results show that LLM lobbyists can draft subtle phrasing to avoid such identification by strong LLM-based detectors. Further optimization of the phrasing using LLM-based re-planning and re-sampling increases deception rates by up to 40 percentage points.Our human evaluations to verify the quality of deceptive generations and their retention of self-serving intent show significant coherence with our automated metrics and also help in identifying certain strategies of deceptive phrasing.This study highlights the risk of LLMs’ capabilities for strategic phrasing through seemingly neutral language to attain self-serving goals. This calls for future research to uncover and protect against such subtle deception.
pdf
bib
abs
AfroCS-xs: Creating a Compact, High-Quality, Human-Validated Code-Switched Dataset for African Languages
Kayode Olaleye
|
Arturo Oncevay
|
Mathieu Sibue
|
Nombuyiselo Zondi
|
Michelle Terblanche
|
Sibongile Mapikitla
|
Richard Lastrucci
|
Charese Smiley
|
Vukosi Marivate
Code-switching is prevalent in multilingual communities but lacks adequate high-quality data for model development, especially for African languages. To address this, we present AfroCS-xs, a small human-validated synthetic code-switched dataset for four African languages (Afrikaans, Sesotho, Yoruba, isiZulu) and English within a specific domain—agriculture. Using large language models (LLMs), we generate code-switched sentences, including English translations, that are rigorously validated and corrected by native speakers. As a downstream evaluation task, we use this dataset to fine-tune different instruction-tuned LLMs for code-switched translation and compare their performance against machine translation (MT) models. Our results demonstrate that LLMs consistently improve in translation accuracy when fine-tuned on the high-quality AfroCS-xs dataset, highlighting that substantial gains can still be made with a low volume of data. We also observe improvements on natural code-switched and out-of-domain (personal finance) test sets. Overall, regardless of data size and prior exposure to a language, LLMs benefit from higher quality training data when translating code-switched texts in under-represented languages.
pdf
bib
abs
Just Go Parallel: Improving the Multilingual Capabilities of Large Language Models
Muhammad Reza Qorib
|
Junyi Li
|
Hwee Tou Ng
Large language models (LLMs) have demonstrated impressive translation capabilities even without being explicitly trained on parallel data. This remarkable property has led some to believe that parallel data is no longer necessary for building multilingual language models. While some attribute this to the emergent abilities of LLMs due to scale, recent work suggests that it is actually caused by incidental bilingual signals present in the training data. Various methods have been proposed to maximize the utility of parallel data to enhance the multilingual capabilities of multilingual encoder-based and encoder-decoder language models. However, some decoder-based LLMs opt to ignore parallel data instead. In this work, we conduct a systematic study on the impact of adding parallel data on LLMs’ multilingual capabilities, focusing specifically on translation and multilingual common-sense reasoning. Through controlled experiments, we demonstrate that parallel data can significantly improve LLMs’ multilingual capabilities.
pdf
bib
abs
Design Choices for Extending the Context Length of Visual Language Models
Mukai Li
|
Lei Li
|
Shansan Gong
|
Qi Liu
Visual Language Models (VLMs) demonstrate impressive capabilities in processing multimodal inputs, yet applications such as visual agents, which require handling multiple images and high-resolution videos, demand enhanced long-range modeling. Moreover, existing open-source VLMs lack systematic exploration into extending their context length, and commercial models often provide limited details. To tackle this, we aim to establish an effective solution that enhances long context performance of VLMs while preserving their capacities in short context scenarios. Towards this goal, we make the best design choice through extensive experiment settings from data curation to context window extending and utilizing: (1) we analyze data sources and length distributions to construct ETVLM - a data recipe to balance the performance across scenarios; (2) we examine existing position extending methods, identify their limitations and propose M-RoPE++ as an enhanced approach; we also choose to solely instruction-tune the backbone with mixed-source data; (3) we discuss how to better utilize extended context windows and propose hybrid-resolution training. Built on the Qwen-VL series model, we propose Giraffe, which is effectively extended to 128K lengths. Evaluated on extensive long context VLM benchmarks such as VideoMME and Viusal Haystacks, our Giraffe achieves state-of-the-art performance among similarly sized open-source long VLMs and is competitive with commercial model GPT-4V. We will open-source the code, data, and models.
uppdf
bib
Findings of the Association for Computational Linguistics: ACL 2025
Wanxiang Che
|
Joyce Nabende
|
Ekaterina Shutova
|
Mohammad Taher Pilehvar
pdf
bib
abs
Explicit vs. Implicit: Investigating Social Bias in Large Language Models through Self-Reflection
Yachao Zhao
|
Bo Wang
|
Yan Wang
|
Dongming Zhao
|
Ruifang He
|
Yuexian Hou
Large Language Models (LLMs) have been shown to exhibit various biases and stereotypes in their generated content. While extensive research has investigated biases in LLMs, prior work has predominantly focused on explicit bias, with minimal attention to implicit bias and the relation between these two forms of bias. This paper presents a systematic framework grounded in social psychology theories to investigate and compare explicit and implicit biases in LLMs.We propose a novel self-reflection-based evaluation framework that operates in two phases: first measuring implicit bias through simulated psychological assessment methods, then evaluating explicit bias by prompting LLMs to analyze their own generated content. Through extensive experiments on advanced LLMs across multiple social dimensions, we demonstrate that LLMs exhibit a substantial inconsistency between explicit and implicit biases: while explicit bias manifests as mild stereotypes, implicit bias exhibits strong stereotypes.We further investigate the underlying factors contributing to this explicit-implicit bias inconsistency, examining the effects of training data scale, model size, and alignment techniques. Experimental results indicate that while explicit bias declines with increased training data and model size, implicit bias exhibits a contrasting upward trend. Moreover, contemporary alignment methods effectively suppress explicit bias but show limited efficacy in mitigating implicit bias.
pdf
bib
abs
Beyond Perception: Evaluating Abstract Visual Reasoning through Multi-Stage Task
Yanbei Jiang
|
Yihao Ding
|
Chao Lei
|
Jiayang Ao
|
Jey Han Lau
|
Krista A. Ehinger
Current Multimodal Large Language Models (MLLMs) excel in general visual reasoning but remain underexplored in Abstract Visual Reasoning (AVR), which demands higher-order reasoning to identify abstract rules beyond simple perception. Existing AVR benchmarks focus on single-step reasoning, emphasizing the end result but neglecting the multi-stage nature of reasoning process. Past studies found MLLMs struggle with these benchmarks, but it doesn’t explain how they fail. To address this gap, we introduce MultiStAR, a Multi-Stage AVR benchmark, based on RAVEN, designed to assess reasoning across varying levels of complexity. Additionally, existing metrics like accuracy only focus on the final outcomes while do not account for the correctness of intermediate steps. Therefore, we propose a novel metric, MSEval, which considers the correctness of intermediate steps in addition to the final outcomes. We conduct comprehensive experiments on MultiStAR using 17 representative close-source and open-source MLLMs. The results reveal that while existing MLLMs perform adequately on basic perception tasks, they continue to face challenges in more complex rule detection stages. The dataset and code will be available after acceptance.
pdf
bib
abs
How Numerical Precision Affects Arithmetical Reasoning Capabilities of LLMs
Guhao Feng
|
Kai Yang
|
Yuntian Gu
|
Xinyue Ai
|
Shengjie Luo
|
Jiacheng Sun
|
Di He
|
Zhenguo Li
|
Liwei Wang
Despite the remarkable success of transformer-based large language models (LLMs) across various domains, understanding and enhancing their mathematical capabilities remains a significant challenge. In this paper, we conduct a rigorous theoretical analysis of LLMs’ mathematical abilities, with a specific focus on their arithmetic performances. We identify numerical precision as a key factor that influences their effectiveness in arithmetical tasks. Our results show that Transformers operating with low numerical precision fail to address arithmetic tasks, such as iterated addition and integer multiplication, unless the model size grows super-polynomially with respect to the input length. In contrast, Transformers with standard numerical precision can efficiently handle these tasks with significantly smaller model sizes. We further support our theoretical findings through empirical experiments that explore the impact of varying numerical precision on arithmetic tasks, providing valuable insights for improving the mathematical reasoning capabilities of LLMs.
pdf
bib
abs
Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts
Zeliang Zhang
|
Xiaodong Liu
|
Hao Cheng
|
Chenliang Xu
|
Jianfeng Gao
In this work, we address the memory overhead of deploying Mixture-of-Experts (MoE) architectures in Large Language Models (LLMs). While MoE layers improve LLM performance without increasing inference costs, the ever-growing number of experts inflates memory requirements, hindering practical deployment. Our empirical study reveals that some experts encode redundant knowledge during pre-training. We thus propose a method of grouping and pruning similar experts to improve the model’s parameter efficiency. We validate the effectiveness of our method by pruning three state-of-the-art MoE architectures, including Mixtral, Deepseek-MoE, and Qwen. The evaluation shows that our method outperforms other model pruning methods on a range of natural language tasks.
pdf
bib
abs
A Persona-Aware LLM-Enhanced Framework for Multi-Session Personalized Dialogue Generation
Dongshuo Liu
|
Zhijing Wu
|
Dandan Song
|
Heyan Huang
Multi-session personalized dialogue generation is one of the most important topics in open-domain dialogue. It aims to generate responses consistent with the dialogue history and personality information across multiple sessions to engage users’ interest in the dialogue. Recent approaches focusing on history modeling and persona modeling have advanced the development of this field. However, they overlook the importance of dialogue structure in helping large language models (LLMs) understand the dialogue context. Moreover, these methods do not efficiently expand and utilize personality information, reducing the responses’ consistency. In this paper, we propose a Persona-Aware LLM-enAnCEd(PALACE) framework for multi-session personalized dialogue generation. Specifically, the framework consists of three components: a topic-aware memory bank, a persona prompt learning module, and VAE-LoRA. The topic-aware memory bank works by retrieving historical information that possesses a certain dialogue structure and relevant topics. The persona prompt learning module enhances the LLM’s persona-aware capabilities by utilizing a persona commonsense knowledge graph and a query-driven graph neural network. Furthermore, to enhance the generative capabilities of the LLM and obtain more useful prior knowledge, we combine VAE with LoRA to propose VAE-LoRA. Experimental results on the MSC and DuLeMon dataset demonstrate that our framework outperforms the state-of-the-art methods in automatic and human evaluation metrics.
pdf
bib
abs
Exploring In-Image Machine Translation with Real-World Background
Yanzhi Tian
|
Zeming Liu
|
Zhengyang Liu
|
Yuhang Guo
In-Image Machine Translation (IIMT) aims to translate texts within images from one language to another. Previous research on IIMT was primarily conducted on simplified scenarios such as images of one-line text with black font in white backgrounds, which is far from reality and impractical for applications in the real world. To make IIMT research practically valuable, it is essential to consider a complex scenario where the text backgrounds are derived from real-world images. To facilitate research of complex scenarios IIMT, we design an IIMT dataset that includes subtitle text with a real-world background. However, previous IIMT models perform inadequately in complex scenarios. To address the issue, we propose the DebackX model, which separates the background and text-image from the source image, performs translation on the text-image directly, and fuses the translated text-image with the background to generate the target image. Experimental results show that our model achieves improvements in both translation quality and visual effect.
pdf
bib
abs
BayesKD: Bayesian Knowledge Distillation for Compact LLMs in Constrained Fine-tuning Scenarios
Wei Li
|
Lujun Li
|
Mark G. Lee
|
Shengjie Sun
|
Lei Zhang
|
Wei Xue
|
Yike Guo
Large language models (LLMs) have revolutionized various domains with their remarkable capabilities, but their massive parameter sizes pose significant challenges for fine-tuning and inference, especially in resource-constrained environments. Conventional compression methods often result in substantial performance degradation within LLMs and struggle to restore model quality during fine-tuning. To address this challenge, we present Bayesian Knowledge Distillation (BayesKD), a novel distillation framework meticulously designed for compact LLMs in resource-constrained fine-tuning scenarios. Departing from conventional LLM distillation methods that introduce time-consuming paradigms and fail to generalize in compressed LLM fine-tuning scenarios, our BayesKD develops the Logits Dual-Scaling, Knowledge Alignment Module, and Bayesian Distillation Optimization. In particular, our Logits Dual-Scaling strategy adaptively aligns the strength of the teacher’s knowledge transfer, while the Knowledge Alignment Module bridges the gap between the teacher and student models by projecting their knowledge representations into a shared interval. Additionally, we employ Logits-Aware Bayesian Optimization to swiftly identify optimal settings based on these strategies, thereby enhancing model performance. Extensive experiments across diverse tasks demonstrate that BayesKD consistently outperforms baseline methods on various state-of-the-art LLMs, including LLaMA, Qwen2, Bloom, and Vicuna. Notably, our BayesKD achieves average accuracy gains of 2.99% and 4.05% over standard KD for the 8B parameter LLaMA and Qwen2 model. Codes are available in the supplementary materials.
pdf
bib
abs
GOLFer: Smaller LMs-Generated Documents Hallucination Filter & Combiner for Query Expansion in Information Retrieval
Lingyuan Liu
|
Mengxiang Zhang
Large language models (LLMs)-based query expansion for information retrieval augments queries with generated hypothetical documents with LLMs. However, its performance relies heavily on the scale of the language models (LMs), necessitating larger, more advanced LLMs. This approach is costly, computationally intensive, and often has limited accessibility. To address these limitations, we introduce GOLFer - Smaller LMs-Generated Documents Hallucination Filter & Combiner - a novel method leveraging smaller open-source LMs for query expansion. GOLFer comprises two modules: a hallucination filter and a documents combiner. The former detects and removes non-factual and inconsistent sentences in generated documents, a common issue with smaller LMs, while the latter combines the filtered content with the query using a weight vector to balance their influence. We evaluate GOLFer alongside dominant LLMs-based query expansion methods on three web search and ten low-resource datasets. Experimental results demonstrate that GOLFer consistently outperforms other methods using smaller LMs, and maintains competitive performance against methods using large-size LLMs, demonstrating its effectiveness.
pdf
bib
abs
Exp4Fuse: A Rank Fusion Framework for Enhanced Sparse Retrieval using Large Language Model-based Query Expansion
Lingyuan Liu
|
Mengxiang Zhang
Large Language Models (LLMs) have shown potential in generating hypothetical documents for query expansion, thereby enhancing information retrieval performance. However, the efficacy of this method is highly dependent on the quality of the generated documents, which often requires complex prompt strategies and the integration of advanced dense retrieval techniques. This can be both costly and computationally intensive. To mitigate these limitations, we explore the use of zero-shot LLM-based query expansion to improve sparse retrieval, particularly for learned sparse retrievers. We introduce a novel fusion ranking framework, Exp4Fuse, which enhances the performance of sparse retrievers through an indirect application of zero-shot LLM-based query expansion. Exp4Fuse operates by simultaneously considering two retrieval routes—one based on the original query and the other on the LLM-augmented query. It then generates two ranked lists using a sparse retriever and fuses them using a modified reciprocal rank fusion method. We conduct extensive evaluations of Exp4Fuse against leading LLM-based query expansion methods and advanced retrieval techniques on three MS MARCO-related datasets and seven low-resource datasets. Experimental results reveal that Exp4Fuse not only surpasses existing LLM-based query expansion methods in enhancing sparse retrievers but also, when combined with advanced sparse retrievers, achieves SOTA results on several benchmarks. This highlights the superior performance and effectiveness of Exp4Fuse in improving query expansion for sparse retrieval.
pdf
bib
abs
Emo Pillars: Knowledge Distillation to Support Fine-Grained Context-Aware and Context-Less Emotion Classification
Alexander Shvets
Most datasets for sentiment analysis lack context in which an opinion was expressed, often crucial for emotion understanding, and are mainly limited by a few emotion categories. Foundation large language models (LLMs) like GPT-4 suffer from over-predicting emotions and are too resource-intensive. We design an LLM-based data synthesis pipeline and leverage a large model, Mistral-7b, for the generation of training examples for more accessible, lightweight BERT-type encoder models. We focus on enlarging the semantic diversity of examples and propose grounding the generation into a corpus of narratives to produce non-repetitive story-character-centered utterances with unique contexts over 28 emotion classes. By running 700K inferences in 450 GPU hours, we contribute with the dataset of 100K contextual and also 300K context-less examples to cover both scenarios. We use it for fine-tuning pre-trained encoders, which results in several Emo Pillars models. We show that Emo Pillars models are highly adaptive to new domains when tuned to specific tasks such as GoEmotions, ISEAR, IEMOCAP, and EmoContext, reaching the SOTA performance on the first three. We also validate our dataset, conducting statistical analysis and human evaluation, and confirm the success of our measures in utterance diversification (although less for the neutral class) and context personalization, while pointing out the need for improved handling of out-of-taxonomy labels within the pipeline.
pdf
bib
abs
Multi-Prompting Decoder Helps Better Language Understanding
Zifeng Cheng
|
Zhaoling Chen
|
Zhiwei Jiang
|
Yafeng Yin
|
Cong Wang
|
Shiping Ge
|
Qing Gu
Recent large Pre-trained Language Models (PLMs) usually only provide users with the inference APIs, namely the emerging Model-as-a-Service (MaaS) setting. To adapt MaaS PLMs to downstream tasks without accessing their parameters and gradients, some existing methods focus on the output-side adaptation of PLMs, viewing the PLM as an encoder and then optimizing a task-specific decoder for decoding the output hidden states and class scores of the PLM. Despite the effectiveness of these methods, they only use a single prompt to query PLMs for decoding, leading to a heavy reliance on the quality of the adopted prompt. In this paper, we propose a simple yet effective Multi-Prompting Decoder (MPD) framework for MaaS adaptation. The core idea is to query PLMs with multiple different prompts for each sample, thereby obtaining multiple output hidden states and class scores from PLMs for subsequent decoding. Such multi-prompting decoding paradigm can simultaneously mitigate reliance on the quality of a single prompt, alleviate the issue of data scarcity under the few-shot setting, and provide richer knowledge extracted from PLMs. Specifically, we propose two decoding strategies: multi-prompting decoding with optimal transport for hidden states and calibrated decoding for class scores. Extensive experiments demonstrate that our method achieves new state-of-the-art results on multiple natural language understanding datasets under the few-shot setting.
pdf
bib
abs
Visual Cues Enhance Predictive Turn-Taking for Two-Party Human Interaction
Sam O’Connor Russell
|
Naomi Harte
Turn-taking is richly multimodal. Predictive turn-taking models (PTTMs) facilitate natural- istic human-robot interaction, yet most rely solely on speech. We introduce MM-VAP, a multimodal PTTM which combines speech with visual cues including facial expression, head pose and gaze. We find that it outperforms the state-of-the-art audio-only in videoconfer- encing interactions (84% vs. 79% hold/shift prediction accuracy). Unlike prior work which aggregates all holds and shifts, we group by duration of silence between turns. This reveals that through the inclusion of visual features, MM-VAP outperforms a state-of-the-art audio- only turn-taking model across all durations of speaker transitions. We conduct a detailed abla- tion study, which reveals that facial expression features contribute the most to model perfor- mance. Thus, our working hypothesis is that when interlocutors can see one another, visual cues are vital for turn-taking and must therefore be included for accurate turn-taking prediction. We additionally validate the suitability of au- tomatic speech alignment for PTTM training using telephone speech. This work represents the first comprehensive analysis of multimodal PTTMs. We discuss implications for future work and make all code publicly available.
pdf
bib
abs
The Right Time Matters: Data Arrangement Affects Zero-Shot Generalization in Instruction Tuning
Bingxiang He
|
Ning Ding
|
Cheng Qian
|
Jia Deng
|
Ganqu Cui
|
Lifan Yuan
|
Haiwen Hong
|
Huan-ang Gao
|
Longtao Huang
|
Hui Xue
|
Huimin Chen
|
Zhiyuan Liu
|
Maosong Sun
Understanding alignment techniques begins with comprehending zero-shot generalization brought by instruction tuning, but little of the mechanism has been understood. Existing work has largely been confined to the task level, without considering that tasks are artificially defined and, to LLMs, merely consist of tokens and representations. To bridge this gap, we investigate zero-shot generalization from the perspective of the data itself. We first demonstrate that zero-shot generalization happens very early during instruction tuning, with loss serving as a stable indicator. Next, we investigate training data arrangement through similarity and granularity perspectives, confirming that the timing of exposure to certain training examples may greatly facilitate generalization on unseen tasks. Finally, we propose a more grounded training data arrangement framework, Test-centric Multi-turn Arrangement, and show its effectiveness in promoting continual learning and further loss reduction. For the first time, we show that zero-shot generalization during instruction tuning is a form of similarity-based generalization between training and test data at the instance level.
pdf
bib
abs
MFinMeeting: A Multilingual, Multi-Sector, and Multi-Task Financial Meeting Understanding Evaluation Dataset
Jie Zhu
|
Junhui Li
|
Yalong Wen
|
Xiandong Li
|
Lifan Guo
|
Feng Chen
Recent breakthroughs in large language models (LLMs) have led to the development of new benchmarks for evaluating their performance in the financial domain. However, current financial benchmarks often rely on news articles, earnings reports, or announcements, making it challenging to capture the real-world dynamics of financial meetings. To address this gap, we propose a novel benchmark called MFinMeeting, which is a multilingual, multi-sector, and multi-task dataset designed for financial meeting understanding. First, MFinMeeting supports English, Chinese, and Japanese, enhancing comprehension of financial discussions in diverse linguistic contexts. Second, it encompasses various industry sectors defined by the Global Industry Classification Standard (GICS), ensuring that the benchmark spans a broad range of financial activities. Finally, MFinMeeting includes three tasks: summarization, question-answer (QA) pair extraction, and question answering, facilitating a more realistic and comprehensive evaluation of understanding. Experimental results with seven popular LLMs reveal that even the most advanced long-context models have significant room for improvement, demonstrating the effectiveness of MFinMeeting as a benchmark for assessing LLMs’ financial meeting comprehension skills.
pdf
bib
abs
ODDA: An OODA-Driven Diverse Data Augmentation Framework for Low-Resource Relation Extraction
Yijie Zhong
|
Yunfan Gao
|
Xiaolian Zhang
|
Haofen Wang
Data Augmentation (DA) has emerged as a promising solution to address the scarcity of high-quality annotated data in low-resource relation extraction (LRE). Leveraging large language models (LLMs), DA has significantly improved the performance of RE models with considerably fewer parameters. However, existing DA methods struggle with diversity misalignments, as they neglect the diversity required by the model and generate homogeneous augmentations that do not cover the inter-sample and inter-relation variability, leading to suboptimal performance. Inspired by the Observe-Orient-Decide-Act (OODA) framework, which provides a robust theoretical foundation for iterative decision-making under dynamic conditions, we propose an OODA-driven Diverse DA method (ODDA), guiding the data generation and selection process. DDA first observes the RE model’s behavior to select effective demonstrations for LLMs. Next, it orients LLMs towards generating diverse data by replacing schema constraints with attribute constraints. Then ODDA decides on the final augmented dataset with overall diversity from a global search and finally acts to train the RE model. Extensive experiments on three widely-used benchmarks demonstrate that ODDA consistently outperforms state-of-the-art baselines, achieving average F1 improvements of 3.1% across various LRE scenarios while maintaining enhanced model stability.
pdf
bib
abs
Detecting and Mitigating Challenges in Zero-Shot Video Summarization with Video LLMs
Luca Cagliero
|
Lorenzo Vaiani
|
Eliana Pastor
|
Alkis Koudounas
|
Elena Baralis
|
Vittorio Mazzia
|
Sandro Pollastrini
|
Thomas Gueudre
|
Manuel Giollo
|
Daniele Amberti
|
Yue Wu
Video summarization aims to generate a condensed textual version of an original video. Summaries may consist of either plain text or a shortlist of salient events, possibly including temporal or spatial references. Video Large Language Models (VLLMs) exhibit impressive zero-shot capabilities in video analysis. However, their performance varies significantly according to the LLM prompt, the characteristics of the video, and the properties of the training data and LLM architecture.In this work, we thoroughly evaluate the zero-shot summarization performance of four state-of-the-art open-source VLLMs specifically designed to address spatial and temporal reasoning. In light of the detected summarization issues, we propose different cost-effective mitigation strategies, based on Chain-of-Thought prompting, that involve the injection of knowledge extracted by external, lightweight models. To perform the VLLM evaluation, we design a new video summarization benchmark consisting of 100 videos with varying characteristics in terms of domain, duration, and spatio-temporal properties. Videos are manually annotated by three independent human experts with plain text, event-based, and spatio-temporal summaries. The experimental evaluation shows that VLLMs significantly benefit from prompting a list of recognized actions, whereas injecting automatically recognized objects and scene changes respectively improve spatially contextualized and event-based summaries in specific cases.
pdf
bib
abs
Entity Framing and Role Portrayal in the News
Tarek Mahmoud
|
Zhuohan Xie
|
Dimitar Iliyanov Dimitrov
|
Nikolaos Nikolaidis
|
Purificação Silvano
|
Roman Yangarber
|
Shivam Sharma
|
Elisa Sartori
|
Nicolas Stefanovitch
|
Giovanni Da San Martino
|
Jakub Piskorski
|
Preslav Nakov
We introduce a novel multilingual and hierarchical corpus annotated for entity framing and role portrayal in news articles. The dataset uses a unique taxonomy inspired by storytelling elements, comprising 22 fine-grained roles, or archetypes, nested within three main categories: protagonist, antagonist, and innocent. Each archetype is carefully defined, capturing nuanced portrayals of entities such as guardian, martyr, and underdog for protagonists; tyrant, deceiver, and bigot for antagonists; and victim, scapegoat, and exploited for innocents. The dataset includes 1,378 recent news articles in five languages (Bulgarian, English, Hindi, European Portuguese, and Russian) focusing on two critical domains of global significance: the Ukraine-Russia War and Climate Change. Over 5,800 entity mentions have been annotated with role labels. This dataset serves as a valuable resource for research into role portrayal and has broader implications for news analysis. We describe the characteristics of the dataset and the annotation process, and we report evaluation results on fine-tuned state-of-the-art multilingual transformers and hierarchical zero-shot learning using LLMs at the level of a document, a paragraph, and a sentence.
pdf
bib
abs
Derailer-Rerailer: Adaptive Verification for Efficient and Reliable Language Model Reasoning
Guangya Wan
|
Yuqi Wu
|
Hao Wang
|
Shengming Zhao
|
Jie Chen
|
Sheng Li
Large Language Models (LLMs) have shown impressive reasoning capabilities, yet existing prompting methods face a critical trade-off: simple approaches often struggle with complex tasks and reasoning stability, while more sophisticated methods require multiple inferences and substantial computational resources, limiting their practical deployment. To address this challenge, we propose Derailer-Rerailer, a novel framework that adaptively balances reasoning accuracy and computational efficiency. At its core, our framework employs a lightweight Derailer mechanism to assess reasoning stability and selectively triggers an advanced Rerailer verification process only when necessary, thereby optimizing computational resource usage. Extensive evaluation across both open and closed-source models on more than 20 categories of mathematical, symbolic, and commonsense reasoning tasks demonstrates our framework’s effectiveness: Derailer-Rerailer achieves significant accuracy improvements (8-11% across various reasoning tasks) while maintaining 2-3 times better efficiency than existing verification methods, with particularly strong performance in mathematical and symbolic reasoning, offering a practical solution for enhancing LLM reasoning reliability while significantly reducing computational overhead.
pdf
bib
abs
Leveraging Large Language Models for Conversational Multi-Doc Question Answering: The First Place of WSDM Cup 2024
Yiming Li
|
Zhao Zhang
Conversational multi-doc question answering aims to answer specific questions based on the retrieved documents as well as the contextual conversations. In this paper, we introduce our winning approach for the “Conversational Multi-Doc QA” challenge in WSDM Cup 2024, which exploits the superior natural language understanding and generation capability of Large Language Models (LLMs). We first adapt LLMs to the task, then devise a hybrid training strategy to make the most of in-domain unlabeled data. Moreover, an advanced text embedding model is adopted to filter out potentially irrelevant documents, and several approaches are designed and compared for the model ensemble. Equipped with all these techniques, our solution finally ranked 1st place in WSDM Cup 2024, surpassing its rivals to a large extent. The source codes have been released at https://github.com/zhangzhao219/WSDM-Cup-2024.
pdf
bib
abs
TreeRAG: Unleashing the Power of Hierarchical Storage for Enhanced Knowledge Retrieval in Long Documents
Wenyu Tao
|
Xiaofen Xing
|
Yirong Chen
|
Linyi Huang
|
Xiangmin Xu
When confronting long document information retrieval for Query-Focused Summarization(QFS), Traditional Retrieval-Augmented Generation(RAG) frameworks struggle to retrieve all relevant knowledge points, and the chunking and retrieve strategies of existing frameworks may disrupt the connections between knowledge points and the integrity of the information. To address these issues, we propose TreeRAG, which employs Tree-Chunking for chunking and embedding in a tree-like structure , coupled with "root-to-leaves" and "leaf-to-root" retrieve strategy named Bidirectional Traversal Retrieval. This approach effectively preserves the hierarchical structure among knowledge points and significantly enhances the ability to retrieve while minimizing noise inference. Our experimental results on the Finance, Law, and Medical subsets of the Dragonball dataset demonstrate that TreeRAG achieves significant enhancements in both recall quality and precision compared to traditional and popular existing methods and achieves better performance to corresponding question-answering tasks, marking a new breakthrough in long document knowledge retrieval.
pdf
bib
abs
Attention with Dependency Parsing Augmentation for Fine-Grained Attribution
Qiang Ding
|
Lvzhou Luo
|
Yixuan Cao
|
Ping Luo
To assist humans in efficiently validating RAG-generated content, developing a fine-grained attribution mechanism that provides supporting evidence from retrieved documents for every answer span is essential. Existing fine-grained attribution methods rely on model-internal similarity metrics between responses and documents, such as saliency scores and hidden state similarity. However, these approaches suffer from either high computational complexity or coarse-grained representations. Additionally, a common problem shared by the previous works is their reliance on decoder-only Transformers, limiting their ability to incorporate contextual information after the target span. To address the above problems, we propose two techniques applicable to all model-internals-based methods. First, we aggregate token-wise evidence through set union operations, preserving the granularity of representations. Second, we enhance the attributor by integrating dependency parsing to enrich the semantic completeness of target spans. For practical implementation, our approach employs attention weights as the similarity metric. Experimental results demonstrate that the proposed method consistently outperforms all prior works.
pdf
bib
abs
ASTRO: Automatic Strategy Optimization For Non-Cooperative Dialogues
Yikuan Hu
|
Chen Huang
|
Wenqiang Lei
Non-cooperative dialogues, such as negotiations and persuasion, present significant challenges for large language models (LLMs) due to the lack of inherent cooperation or shared goals. Current methods for optimizing dialogue strategies require substantial human effort for strategy optimization. To address these challenges, we propose ASTRO (Automated Strategy Optimization), a fully automated solution that leverages LLMs’ self-envolving capabilities. ASTRO dynamically generates customized strategy sets based on task goals and optimizes strategy planner using a self-play reinforcement learning paradigm. Our experimental results demonstrate ASTRO’s significant performance improvements over baseline models across various non-cooperative dialogue tasks, highlighting the potential for autonomously developing such agents without human intervention. Our code is available at https://github.com/SCUNLP/ASTRO.
pdf
bib
abs
Defensive Prompt Patch: A Robust and Generalizable Defense of Large Language Models against Jailbreak Attacks
Chen Xiong
|
Xiangyu Qi
|
Pin-Yu Chen
|
Tsung-Yi Ho
Safety, security, and compliance are essential requirements when aligning large language models (LLMs). However, many seemingly aligned LLMs are soon shown to be susceptible to jailbreak attacks. These attacks aim to circumvent the models’ safety guardrails and security mechanisms by introducing jailbreak prompts into malicious queries. In response to these challenges, this paper introduces **Defensive Prompt Patch** (DPP), a novel prompt-based defense mechanism specifically designed to protect LLMs against such sophisticated jailbreak strategies. Unlike previous approaches, which have often compromised the utility of the model for the sake of safety, DPP is designed to achieve a minimal Attack Success Rate (ASR) while preserving the high utility of LLMs. Our method uses strategically designed suffix prompts that effectively thwart a wide range of standard and adaptive jailbreak techniques. Empirical results conducted on Llama-2-7B-Chat and Mistral-7B-Instruct-v0.2 demonstrate the robustness and adaptability of DPP, showing significant reductions in ASR with negligible impact on utility. Our approach not only outperforms existing defense strategies in balancing safety and functionality, but also provides a scalable and robust solution to various LLM platforms.
pdf
bib
abs
GUM-SAGE: A Novel Dataset and Approach for Graded Entity Salience Prediction
Jessica Lin
|
Amir Zeldes
Determining and ranking the most salient entities in a text is critical for user-facing systems, especially as users increasingly rely on models to interpret long documents they only partially read. Graded entity salience addresses this need by assigning entities scores that reflect their relative importance in a text. Existing approaches fall into two main categories: subjective judgments of salience, which allow for gradient scoring but lack consistency, and summarization-based methods, which define salience as mention-worthiness in a summary, promoting explainability but limiting outputs to binary labels (entities are either summary-worthy or not). In this paper, we introduce a novel approach for graded entity salience that combines the strengths of both approaches. Using an English dataset spanning 12 spoken and written genres, we collect 5 summaries per document and calculate each entity’s salience score based on its presence across these summaries. Our approach shows stronger correlation with scores based on human summaries and alignments, and outperforms existing techniques, including LLMs. We release our data and code at https://github.com/jl908069/gum_sum_salience to support further research on graded salient entity extraction.
pdf
bib
abs
Verifying the Steps of Deductive Reasoning Chains
Zacchary Sadeddine
|
Fabian M. Suchanek
As Large Language Models penetrate everyday life more and more, it becomes essential to measure the correctness of their output. Inthis paper, we propose a novel task: the automatic verification of individual reasoning steps in a logical deductive Chain-of-Thought. Thistask addresses two well-known problems of LLMs, hallucination and incorrect reasoning. We propose a new dataset of logical reasoningchains, in which the individual deduction steps have been manually annotated for soundness, and benchmark several methods on it. We findthat LLMs can detect unsound reasoning steps fairly well, but argue that verification has to be performed by transparent methods instead.We test symbolic methods, but find that they under-perform. We develop a neuro-symbolic baseline called VANESSA that comes closer to the performance of LLMs.
pdf
bib
abs
Translate With Care: Addressing Gender Bias, Neutrality, and Reasoning in Large Language Model Translations
Pardis Sadat Zahraei
|
Ali Emami
Addressing gender bias and maintaining logical coherence in machine translation remains challenging, particularly when translating between natural gender languages, like English, and genderless languages, such as Persian, Indonesian, and Finnish. We introduce the Translate-with-Care (TWC) dataset, comprising 3,950 challenging scenarios across six low- to mid-resource languages, to assess translation systems’ performance. Our analysis of diverse technologies, including GPT-4, mBART-50, NLLB-200, and Google Translate, reveals a universal struggle in translating genderless content, resulting in gender stereotyping and reasoning errors. All models preferred masculine pronouns when gender stereotypes could influence choices. Google Translate and GPT-4 showed particularly strong bias, favoring male pronouns 4-6 times more than feminine ones in leadership and professional success contexts. Fine-tuning mBART-50 on TWC substantially resolved these biases and errors, led to strong generalization, and surpassed proprietary LLMs while remaining open-source. This work emphasizes the need for targeted approaches to gender and semantic coherence in machine translation, particularly for genderless languages, contributing to more equitable and accurate translation systems.
pdf
bib
abs
Utilizing Semantic Textual Similarity for Clinical Survey Data Feature Selection
Benjamin C Warner
|
Ziqi Xu
|
Simon Haroutounian
|
Thomas Kannampallil
|
Chenyan Lu
Surveys are widely used to collect patient data in healthcare, and there is significant clinical interest in predicting patient outcomes using survey data. However, surveys often include numerous features that lead to high-dimensional inputs for machine learning models. This paper exploits a unique source of information in surveys for feature selection. We observe that feature names (i.e., survey questions) are often semantically indicative of what features are most useful. Using language models, we leverage semantic textual similarity (STS) scores between features and targets to select features. The performance of STS scores in directly ranking features as well as in the minimal-redundancy-maximal-relevance (mRMR) algorithm is evaluated using survey data collected as part of a clinical study on persistent post-surgical pain (PPSP) as well as an accessible dataset collected through the NIH All of Us program. Our findings show that features selected with STS can result in higher performance models compared to traditional feature selection algorithms.
pdf
bib
abs
Distance between Relevant Information Pieces Causes Bias in Long-Context LLMs
Runchu Tian
|
Yanghao Li
|
Yuepeng Fu
|
Siyang Deng
|
Qinyu Luo
|
Cheng Qian
|
Shuo Wang
|
Xin Cong
|
Zhong Zhang
|
Yesai Wu
|
Yankai Lin
|
Huadong Wang
|
Xiaojiang Liu
Positional bias in large language models hinders their ability to effectively process long inputs. A prominent example is the “lost in the middle” phenomenon, where LLMs struggle to utilize relevant information situated in the middle of the input. While prior research primarily focuses on single pieces of relevant information, real-world applications often involve multiple relevant information pieces. To bridge this gap, we present LongPiBench, a benchmark designed to assess positional bias involving multiple pieces of relevant information. It includes various tasks and input lengths. Thorough experiments are conducted with three commercial and six open-source models. These experiments reveal that while most current models are more robust against the “lost in the middle” issue, there also exist noticeable biases related to the spacing of relevant information pieces. These findings highlight the importance of evaluating and reducing positional biases for long-context LLMs.
pdf
bib
abs
Variable Layerwise Quantization: A Simple and Effective Approach to Quantize LLMs
Razvan-Gabriel Dumitru
|
Vikas Yadav
|
Rishabh Maheshwary
|
Paul Ioan Clotan
|
Sathwik Tejaswi Madhusudhan
|
Mihai Surdeanu
We present a simple meta quantization approach that quantizes different layers of a large language model (LLM) at different bit levels, and is independent of the underlying quantization technique. Specifically, we quantize the most important layers to higher bit precision and less important layers to lower bits. We propose two effective strategies to measure the importance of layers within LLMs: the first measures the importance of a layer based on how different its output embeddings are from the input embeddings (higher is better); the second estimates the importance of a layer using the number of layer weights that are much larger than average (smaller is better). We show that quantizing different layers at varying bits as per our importance scores results in minimal performance drop with a far more compressed model. Finally, we present several practical key takeaways from our variable layer-wise quantization experiments: (a) LLM performance under variable quantization remains close to the original model until 25–50% of layers are moved in lower quantization using our proposed ordering but only until 5–10% if moved using no specific ordering; (b) Adding layer importance to inherently dynamic quantization techniques can further improve their performance, showing that our approach is complementary to other dynamic quantization methods; (c) Quantizing LLMs to lower bits performs substantially better than pruning unless extreme quantization (2-bit) is used; and (d) Layer-wise quantization to lower bits works better in the case of larger LLMs with more layers compared to smaller LLMs with fewer layers.
pdf
bib
abs
Why Are Positional Encodings Nonessential for Deep Autoregressive Transformers? A Petroglyph Revisited
Kazuki Irie
Do autoregressive Transformer language models require explicit positional encodings (PEs)? The answer is ‘no’ provided they have more than one layer—they can distinguish sequences with permuted tokens without the need for explicit PEs. This follows from the fact that a cascade of (permutation invariant) set processors can collectively exhibit sequence-sensitive behavior in the autoregressive setting. This property has been known since early efforts (contemporary with GPT-2) adopting the Transformer for language modeling. However, this result does not appear to have been well disseminated, leading to recent rediscoveries. This may be partially due to a sudden growth of the language modeling community after the advent of GPT-2/3, but perhaps also due to the lack of a clear explanation in prior work, despite being commonly understood by practitioners in the past. Here we review the long-forgotten explanation why explicit PEs are nonessential for multi-layer autoregressive Transformers (in contrast, one-layer models require PEs to discern order information of their inputs), as well as the origin of this result, and hope to re-establish it as a common knowledge.
pdf
bib
abs
CRPO: Confidence-Reward Driven Preference Optimization for Machine Translation
Guofeng Cui
|
Pichao Wang
|
Yang Liu
|
Zemian Ke
|
Zhu Liu
|
Vimal Bhat
Large language models (LLMs) have shown great potential in natural language processing tasks, but their application to machine translation (MT) remains challenging due to pretraining on English-centric data and the complexity of reinforcement learning from human feedback (RLHF). Direct Preference Optimization (DPO) has emerged as a simpler and more efficient alternative, but its performance depends heavily on the quality of preference data. To address this, we propose Confidence-Reward driven Preference Optimization (CRPO), a novel method that combines reward scores with model confidence to improve data selection for fine-tuning. CRPO selects challenging sentence pairs where the model is uncertain or underperforms, leading to more effective learning. While primarily designed for LLMs, CRPO also generalizes to encoder-decoder models like NLLB, demonstrating its versatility. Empirical results show that CRPO outperforms existing methods such as RS-DPO, RSO and MBR score in both translation accuracy and data efficiency.
pdf
bib
abs
Talking Point based Ideological Discourse Analysis in News Events
Nishanth Sridhar Nakshatri
|
Nikhil Mehta
|
Siyi Liu
|
Sihao Chen
|
Daniel Hopkins
|
Dan Roth
|
Dan Goldwasser
Analyzing ideological discourse even in the age of LLMs remains a challenge, as these models often struggle to capture the key elements that shape real-world narratives. Specifically, LLMs fail to focus on characteristic elements driving dominant discourses and lack the ability to integrate contextual information required for understanding abstract ideological views. To address these limitations, we propose a framework motivated by the theory of ideological discourse analysis to analyze news articles related to real-world events. Our framework represents the news articles using a relational structure−talking points, which captures the interaction between entities, their roles, and media frames along with a topic of discussion. It then constructs a vocabulary of repeating themes−prominent talking points, that are used to generate ideology-specific viewpoints (or partisan perspectives). We evaluate our framework’s ability to generate these perspectives through automated tasks−ideology and partisan classification tasks, supplemented by human validation. Additionally, we demonstrate straightforward applicability of our framework in creating event snapshots, a visual way of interpreting event discourse. We release resulting dataset and model to the community to support further research.
pdf
bib
abs
FlashBack: Efficient Retrieval-Augmented Language Modeling for Fast Inference
Runheng Liu
|
Xingchen Xiao
|
Heyan Huang
|
Zewen Chi
|
Zhijing Wu
Retrieval-Augmented Language Modeling (RALM) by integrating large language models (LLM) with relevant documents from an external corpus is a proven methodology for enabling the LLM to generate information beyond the scope of its pre-training corpus. Previous work by retrieving a set of tokens iteratively with retrieved content prepending to the input poses a high runtime issue, which degrades the inference efficiency of the LLMs because they fail to use the Key-Value (KV) cache efficiently. We propose FlashBack, a modular RALM designed to improve the inference efficiency of RALM with appending context pattern while maintaining decent performance after fine-tuning by Low-Rank Adaption. FlashBack appends retrieved documents at the end of the context for efficiently utilizing the KV cache. We also introduce the Marking Token as two special prompt tokens for marking the appending context during fine-tuning. Our experiments show that FlashBack can improve language modeling performance in perplexity metric. We proved the Marking Token is a usable add-on when fine-tuning models on specific context patterns. By bypassing unnecessary re-computation, FlashBack achieves fast inference speed speed with long context input. The inference speed is up to 4× faster than the prepending counterpart on a 7B LLM (Llama 2) in the runtime test.
pdf
bib
abs
CMQCIC-Bench: A Chinese Benchmark for Evaluating Large Language Models in Medical Quality Control Indicator Calculation
Guangya Yu
|
Yanhao Li
|
Zongying Jiang
|
Yuxiong Jin
|
Li Dai
|
Yupian Lin
|
Ruihui Hou
|
Weiyan Zhang
|
Yongqi Fan
|
Qi Ye
|
Jingping Liu
|
Tong Ruan
Medical quality control indicators are essential to assess the qualifications of healthcare institutions for medical services. With the impressive performance of large language models (LLMs) like GPT-4 in the medical field, leveraging these technologies for the Medical Quality Control Indicator Calculation (MQCIC) presents a promising approach. In this work, (1) we introduce a real-world task MQCIC and propose an open-source Chinese electronic medical records (EMRs)-based dataset (CMQCIC-Bench) comprising 785 instances and 76 indicators. (2) We propose a semi-automatic method to enhance the rule representation. Then we propose the Clinical Facts-based Inferential Rule (CF-IR) method that disentangles the clinical fact verification and inferential rule reasoning actions. (3) We conduct comprehensive experiments on 20 representative LLMs, covering general and medical models. Our findings reveal that CF-IR outperforms Chain-of-Thought methods in MQCIC tasks. (4) We conduct an error analysis and investigate the capabilities of clinical fact verification and inferential rule reasoning, providing insights to improve performance in the MQCIC further. The dataset and code is available in this repository https://github.com/YuY-2001/C-MQCIC.
pdf
bib
abs
ConKE: Conceptualization-Augmented Knowledge Editing in Large Language Models for Commonsense Reasoning
Liyu Zhang
|
Weiqi Wang
|
Tianqing Fang
|
Yangqiu Song
Knowledge Editing (KE) aims to adjust a Large Language Model’s (LLM) internal representations and parameters to correct inaccuracies and improve output consistency without incurring the computational expense of re-training the entire model. However, editing commonsense knowledge still faces difficulties, including limited knowledge coverage in existing resources, the infeasibility of annotating labels for an overabundance of commonsense knowledge, and the strict knowledge formats of current editing methods. In this paper, we address these challenges by presenting ConceptEdit, a framework that integrates conceptualization and instantiation into the KE pipeline for LLMs to enhance their commonsense reasoning capabilities. ConceptEdit dynamically diagnoses implausible commonsense knowledge within an LLM using another verifier LLM and augments the source knowledge to be edited with conceptualization for stronger generalizability. Experimental results demonstrate that LLMs enhanced with ConceptEdit successfully generate commonsense knowledge with improved plausibility compared to other baselines and achieve stronger performance across multiple question answering benchmarks. Our data, code, and models are publicly available at https://github.com/HKUST-KnowComp/ConKE.
pdf
bib
abs
Exploring Multi-Modal Data with Tool-Augmented LLM Agents for Precise Causal Discovery
ChengAo Shen
|
Zhengzhang Chen
|
Dongsheng Luo
|
Dongkuan Xu
|
Haifeng Chen
|
Jingchao Ni
Causal discovery is an imperative foundation for decision-making across domains, such as smart health, AI for drug discovery and AIOps. Traditional statistical causal discovery methods, while well-established, predominantly rely on observational data and often overlook the semantic cues inherent in cause-and-effect relationships. The advent of Large Language Models (LLMs) has ushered in an affordable way of leveraging the semantic cues for knowledge-driven causal discovery, but the development of LLMs for causal discovery lags behind other areas, particularly in the exploration of multi-modal data. To bridge the gap, we introduce MatMCD, a multi-agent system powered by tool-augmented LLMs. MatMCD has two key agents: a Data Augmentation agent that retrieves and processes modality-augmented data, and a Causal Constraint agent that integrates multi-modal data for knowledge-driven reasoning. The proposed design of the inner-workings ensures successful cooperation of the agents. Our empirical study across seven datasets suggests the significant potential of multi-modality enhanced causal discovery.
pdf
bib
abs
PARSQL: Enhancing Text-to-SQL through SQL Parsing and Reasoning
Yaxun Dai
|
Haiqin Yang
|
Mou Hao
|
Pingfu Chao
Large language models (LLMs) have made significant strides in text-to-SQL tasks; however, small language models (SLMs) are crucial due to their low resource consumption and efficient inference for real-world deployment. Due to resource limitations, SLMs struggle to accurately interpret natural language questions and may overlook critical constraints, leading to challenges such as generating SQL with incorrect logic or incomplete conditions. To address these issues, we propose PARSQL, a novel framework that leverages SQL parsing and reasoning. Specifically, we design PARSer, an SQL parser that extracts constraints from SQL to generate sub-SQLs for data augmentation and producing step-by-step SQL explanations (reason) via both rule-based and LLM-based methods. We define a novel text-to-reason task and incorporate it into multi-task learning, thereby enhancing text-to-SQL performance. Additionally, we employ an efficient SQL selection strategy that conducts direct similarity computation between the generated SQLs and their corresponding reasons to derive the final SQL for post-correction. Extensive experiments show that our PARSQL outperforms models with the same model size on the BIRD and Spider benchmarks. Notably, PARSQL-3B achieves 56.98% execution accuracy on BIRD, rivaling 7B models with significantly fewer parameters, setting a new state-of-the-art performance. Code can be found [here](https://github.com/yaxundai/parsql).
pdf
bib
abs
Probing the Geometry of Truth: Consistency and Generalization of Truth Directions in LLMs Across Logical Transformations and Question Answering Tasks
Yuntai Bao
|
Xuhong Zhang
|
Tianyu Du
|
Xinkui Zhao
|
Zhengwen Feng
|
Hao Peng
|
Jianwei Yin
Large language models (LLMs) are trained on extensive datasets that encapsulate substantial world knowledge. However, their outputs often include confidently stated inaccuracies. Earlier works suggest that LLMs encode truthfulness as a distinct linear feature, termed the “truth direction”, which can classify truthfulness reliably. We address several open questions about the truth direction: (i) whether LLMs universally exhibit consistent truth directions; (ii) whether sophisticated probing techniques are necessary to identify truth directions; and (iii) how the truth direction generalizes across diverse contexts.Our findings reveal that not all LLMs exhibit consistent truth directions, with stronger representations observed in more capable models, particularly in the context of logical negation.Additionally, we demonstrate that truthfulness probes trained on declarative atomic statements can generalize effectively to logical transformations, question-answering tasks, in-context learning, and external knowledge sources.Finally, we explore the practical application of truthfulness probes in selective question-answering, illustrating their potential to improve user trust in LLM outputs.These results advance our understanding of truth directions and provide new insights into the internal representations of LLM beliefs.
pdf
bib
abs
Comparing Bad Apples to Good Oranges Aligning Large Language Models via Joint Preference Optimization
Hritik Bansal
|
Ashima Suvarna
|
Gantavya Bhatt
|
Nanyun Peng
|
Kai-Wei Chang
|
Aditya Grover
A common technique for aligning large language models (LLMs) relies on acquiring human preferences by comparing multiple generations conditioned on a fixed context. This method, however, relies solely on pairwise comparisons, where the generations are evaluated within an identical context. While effective to such conditional preferences often fail to encompass the nuanced and multidimensional nature of human preferences. In this work, we revisit the traditional paradigm of preference acquisition and propose a new axis based on eliciting preferences jointly over the instruction-response pairs. Unlike prior preference optimizations, which are designed for conditional ranking protocols (e.g., DPO), we propose Joint Preference Optimization (JPO), a new preference optimization objective that upweights the joint probability of the chosen instruction-response pair over the rejected instruction-response pair. Interestingly, LLMs trained with joint instruction-response preference data using JPO outperform LLM trained with DPO by 5.2% and 3.3% win-rate for summarization and open-ended dialogue datasets, respectively. Our findings reveal that joint preferences over instruction and response pairs can significantly enhance the alignment of LLMs by tapping into a broader spectrum of human preference elicitation. The data and code is available athttps://github.com/Hritikbansal/jpo.
pdf
bib
abs
TestAgent: An Adaptive and Intelligent Expert for Human Assessment
Junhao Yu
|
Yan Zhuang
|
Yuxuan Sun
|
Weibo Gao
|
Qi Liu
|
Mingyue Cheng
|
Zhenya Huang
|
Enhong Chen
Accurately assessing internal human states is key to understanding preferences, offering personalized services, and identifying challenges in real-world applications. Originating from psychometrics, adaptive testing has become the mainstream method for human measurement and has now been widely applied in education, healthcare, sports, and sociology. It customizes assessments by selecting the fewest test questions . However, current adaptive testing methods face several challenges. The mechanized nature of most algorithms leads to guessing behavior and difficulties with open-ended questions. Additionally, subjective assessments suffer from noisy response data and coarse-grained test outputs, further limiting their effectiveness. To move closer to an ideal adaptive testing process, we propose TestAgent, a large language model (LLM)-powered agent designed to enhance adaptive testing through interactive engagement. This is the first application of LLMs in adaptive testing. TestAgent supports personalized question selection, captures test-takers’ responses and anomalies, and provides precise outcomes through dynamic, conversational interactions. Experiments on psychological, educational, and lifestyle assessments show our approach achieves more accurate results with 20% fewer questions than state-of-the-art baselines, and testers preferred it in speed, smoothness, and other dimensions.
pdf
bib
abs
SPICA: Retrieving Scenarios for Pluralistic In-Context Alignment
Quan Ze Chen
|
Kevin Feng
|
Chan Young Park
|
Amy X Zhang
When different groups’ values differ, one approach to model alignment is to steer models at inference time towards each group’s preferences. However, techniques like in-context learning only consider similarity when drawing few-shot examples and not cross-group differences in values. We propose SPICA, a framework that accounts for group-level differences during in-context example retrieval. SPICA introduces three designs: scenario banks, group-informed retrieval metrics, and in-context alignment prompts. From an evaluation of SPICA on an alignment task collecting inputs from four demographic groups (n = 544), our metrics retrieve in-context examples that more closely match observed preferences, with the best prompt configuration using multiple contrastive responses to demonstrate examples. In an end-to-end evaluation (n = 120), we observe that SPICA is higher rated than similarity-based retrieval, with groups seeing up to a +0.16 point improvement on a 5 point scale. Additionally, gains from SPICA were more uniform, with all groups benefiting from alignment rather than only some. Finally, we find that while a group-agnostic approach can align to aggregated values, it is not most suited for divergent groups.
pdf
bib
abs
First-Step Advantage: Importance of Starting Right in Multi-Step Math Reasoning
Kushal Jain
|
Moritz Miller
|
Niket Tandon
|
Kumar Shridhar
Language models can solve complex reasoning tasks better by learning to generate rationales for their predictions. Often these models know how to solve a task but their auto-regressive decoding nature leads to incorrect results if started incorrectly. We observe that smaller models in particular, when corrected, can solve a task that they would otherwise struggle with. We demonstrate this phenomenon by using a larger model to guide smaller models, which leads to significantly improved performance (up to +24 points on the GSM8K dataset by 7B models). To assist smaller models in initiating the starting step, we propose QuestCoT, where a smaller model first asks how to start before proceeding with a chain of reasoning. On various multistep mathematical reasoning datasets over multiple smaller models, we show that getting the start right can lead to significant performance gains across all models (gains of up to +6 points on GSM8K, +9 on SVAMP, +5 on ASDiv, and +7 on MultiArith).
pdf
bib
abs
Evaluating Instructively Generated Statement by Large Language Models for Directional Event Causality Identification
Wei Xiang
|
Chuanhong Zhan
|
Qing Zhang
|
Bang Wang
This paper aims to identify directional causal relations between events, including the existence and direction of causality. Previous studies mainly adopt prompt learning paradigm to predict a causal answer word based on a Pre-trained Language Model (PLM) for causality existence identification. However, the indecision in selecting answer words from some synonyms and the confusion of indicating opposite causal directions with the same answer word raise more challenges in directional causality identification. Inspired by the strong capabilities of pre-trained Generative Language Models (GLMs) in generating responses or statements, we propose to instruct a GLM to generate causality statements and identify directional event causality by evaluating the generated statements. Specifically, we propose an Instructive Generation and Statement Evaluation method to identify both the existence and direction of causality. We first fine-tune a GLM to instructively generate causality statements based on event description inputs. Then, we evaluate the rationality of the generated statements to determine the existence and direction of event causalities. Experiments on the ESC and MAVEN datasets show that our method significantly outperforms state-of-the-art algorithms, even with fewer training data.
pdf
bib
abs
CoinMath: Harnessing the Power of Coding Instruction for Math LLM
Chengwei Wei
|
Bin Wang
|
Jung-jae Kim
|
Guimei Liu
|
Nancy F. Chen
Large Language Models (LLMs) have shown strong performance in solving mathematical problems, with code-based solutions proving particularly effective. However, the best practice to leverage coding instruction data to enhance mathematical reasoning remains underexplored. This study investigates three key questions: (1) How do different coding styles of mathematical code-based rationales impact LLMs’ learning performance? (2) Can general-domain coding instructions improve performance? (3) How does integrating textual rationales with code-based ones during training enhance mathematical reasoning abilities? Our findings reveal that code-based rationales with concise comments, descriptive naming, and hardcoded solutions are beneficial, while improvements from general-domain coding instructions and textual rationales are relatively minor. Based on these insights, we propose CoinMath, a learning strategy designed to enhance mathematical reasoning by diversifying the coding styles of code-based rationales. CoinMath generates a variety of code-based rationales incorporating concise comments, descriptive naming conventions, and hardcoded solutions. Experimental results demonstrate that CoinMath significantly outperforms its baseline model, MAmmoTH, one of the SOTA math LLMs.
pdf
bib
abs
Profiling News Media for Factuality and Bias Using LLMs and the Fact-Checking Methodology of Human Experts
Zain Muhammad Mujahid
|
Dilshod Azizov
|
Maha Tufail Agro
|
Preslav Nakov
In an age characterized by the proliferation of mis- and disinformation online, it is critical to empower readers to understand the content they are reading. Important efforts in this direction rely on manual or automatic fact-checking, which can be challenging for emerging claims with limited information. Such scenarios can be handled by assessing the reliability and the political bias of the source of the claim, i.e., characterizing entire news outlets rather than individual claims or articles. This is an important but understudied research direction. While prior work has looked into linguistic and social contexts, we do not analyze individual articles or information in social media. Instead, we propose a novel methodology that emulates the criteria that professional fact-checkers use to assess the factuality and political bias of an entire outlet. Specifically, we design a variety of prompts based on these criteria and elicit responses from large language models (LLMs), which we aggregate to make predictions. In addition to demonstrating sizable improvements over strong baselines via extensive experiments with multiple LLMs, we provide an in-depth error analysis of the effect of media popularity and region on model performance. Further, we conduct an ablation study to highlight the key components of our dataset that contribute to these improvements. To facilitate future research, we released our dataset and code.
pdf
bib
abs
Structured Discourse Representation for Factual Consistency Verification
Kun Zhang
|
Oana Balalau
|
Ioana Manolescu
Analysing the differences in how events are represented across texts, or verifying whether the language model generations hallucinate, requires the ability to systematically compare their content. To support such comparison, structured representation that captures fine-grained information plays a vital role.In particular, identifying distinct atomic facts and the discourse relations connecting them enables deeper semantic comparison. Our proposed approach combines structured discourse information extraction with a classifier, FDSpotter, for factual consistency verification. We show that adversarial discourse relations pose challenges for language models, but fine-tuning on our annotated data, DiscInfer, achieves competitive performance. Our proposed approach advances factual consistency verification by grounding in linguistic structure and decomposing it into interpretable components. We demonstrate the effectiveness of our method on the evaluation of two tasks: data-to-text generation and text summarisation. Our code and dataset will be publicly available on GitHub.
pdf
bib
abs
SHARP: Unlocking Interactive Hallucination via Stance Transfer in Role-Playing LLMs
Chuyi Kong
|
Ziyang Luo
|
Hongzhan Lin
|
Zhiyuan Fan
|
Yaxin Fan
|
Yuxi Sun
|
Jing Ma
The advanced role-playing capabilities of Large Language Models (LLMs) have enabled rich interactive scenarios, yet existing research in social interactions neglects hallucination while struggling with poor generalizability and implicit character fidelity judgments. To bridge this gap, motivated by human behaviour, we introduce a generalizable and explicit paradigm for uncovering interactive patterns of LLMs across diverse worldviews. Specifically, we first define interactive hallucination through stance transfer, then construct SHARP, a benchmark built by extracting relations from commonsense knowledge graphs and utilizing LLMs’ inherent hallucination properties to simulate multi-role interactions. Extensive experiments confirm our paradigm’s effectiveness and stability, examine the factors that influence these metrics, and challenge conventional hallucination mitigation solutions. More broadly, our work reveals a fundamental limitation in popular post-training methods for role-playing LLMs: the tendency to obscure knowledge beneath style, resulting in monotonous yet human-like behaviors—interactive hallucination.
pdf
bib
abs
Understanding the Gap: an Analysis of Research Collaborations in NLP and Language Documentation
Luke Gessler
|
Alexis Palmer
|
Katharina Von Der Wense
Despite over 20 years of NLP work explicitly intended for application in language documentation (LD), practical use of this work remains vanishingly scarce. This issue has been noted and discussed over the past 10 years, but without the benefit of data to inform the discourse.To address this lack in the literature, we present a survey- and interview-based analysis of the lack of adoption of NLP in LD, focusing on the matter of collaborations between documentary linguists and NLP researchers. Our data show support for ideas from previous work but also reveal the importance of little-discussed factors such as misaligned professional incentives, technical knowledge burdens, and LD software.
pdf
bib
abs
PersonaBench: Evaluating AI Models on Understanding Personal Information through Accessing (Synthetic) Private User Data
Juntao Tan
|
Liangwei Yang
|
Zuxin Liu
|
Zhiwei Liu
|
Rithesh R N
|
Tulika Manoj Awalgaonkar
|
Jianguo Zhang
|
Weiran Yao
|
Ming Zhu
|
Shirley Kokane
|
Silvio Savarese
|
Huan Wang
|
Caiming Xiong
|
Shelby Heinecke
Personalization is essential for AI assistants, especially in private AI settings where models are expected to interpret users’ personal data (e.g., conversations, app usage) to understand their background, preferences, and social context. However, due to privacy concerns, existing academic research lacks direct access to such data, making benchmarking difficult. To fill this gap, we propose a synthetic data pipeline that generates realistic user profiles and private documents, enabling the creation of PersonaBench—a benchmark for evaluating models’ ability to understand personal information. Using this benchmark, we assess Retrieval-Augmented Generation (RAG) pipelines on personalized questions and find that current models struggle to accurately extract and answer questions even when provided with the full set of user documents, highlighting the need for improved personalization methods.
pdf
bib
abs
Leveraging Variation Theory in Counterfactual Data Augmentation for Optimized Active Learning
Simret A Gebreegziabher
|
Kuangshi Ai
|
Zheng Zhang
|
Elena Glassman
|
Toby Jia-Jun Li
Active Learning (AL) allows models to learn interactively from user feedback. However, only annotating existing samples may hardly benefit the model’s generalization. Moreover, AL commonly faces a cold start problem due to insufficient annotated data for effective sample selection. To address this, we introduce a counterfactual data augmentation approach inspired by Variation Theory, a theory of human concept learning that emphasizes the essential features of a concept by focusing on what stays the same and what changes. We use a neuro-symbolic pipeline to pinpoint key conceptual dimensions and use a large language model (LLM) to generate targeted variations along those dimensions. Through a text classification experiment, we show that our approach achieves significantly higher performance when there are fewer annotated data, showing its capability to address the cold start problem in AL. We also find that as the annotated training data gets larger, the impact of the generated data starts to diminish. This work demonstrates the value of incorporating human learning theories into the design and optimization of AL.
pdf
bib
abs
ORBIT: Cost-Effective Dataset Curation for Large Language Model Domain Adaptation with an Astronomy Case Study
Eric Modesitt
|
Ke Yang
|
Spencer Hulsey
|
Xin Liu
|
ChengXiang Zhai
|
Volodymyr Kindratenko
Recent advances in language modeling demonstrate the need for high-quality domain-specific training data, especially for tasks that require specialized knowledge. General-purpose models, while versatile, often lack the depth needed for expert-level tasks because of limited domain-specific information. Domain adaptation training can enhance these models, but it demands substantial, high-quality data. To address this, we propose ORBIT, a cost-efficient methodology for curating massive, high-quality domain-specific datasets from noisy web sources, tailored for training specialist large language models. Using astronomy as a primary case study, we refined the 1.3T-token FineWeb-Edu dataset into a high-quality, 10B-token subset focused on astronomy. Fine-tuning LLaMA-3-8B on a 1B-token astronomy subset improved performance on the MMLU astronomy benchmark from 69% to 76% and achieved top results on AstroBench, an astronomy-specific benchmark. Moreover, our model (Orbit-LLaMA) outperformed LLaMA-3-8B-base, with GPT-4o evaluations preferring it in 73% of cases across 1000 astronomy-specific questions. Additionally, we validated ORBIT’s generalizability by applying it to law and medicine, achieving a significant improvement of data quality compared to an unfiltered baseline. We open-source the ORBIT methodology, including the curated datasets, the codebase, and the resulting model.
pdf
bib
abs
Serial Position Effects of Large Language Models
Xiaobo Guo
|
Soroush Vosoughi
We would like to express our gratitude to the Reviewers and the Area Chair for their insightful comments and for recognizing the robustness of our proposed framework for analyzing the serial position effects (SPE) in LLMs. We appreciate the acknowledgment of our work in demonstrating the widespread existence of this effect across various LLMs and the experiments we conducted to mitigate SPE.We acknowledge the concerns raised regarding the significance of the mitigation methods, including training-side solutions, CoT, and prompt engineering. The varying degrees of effectiveness observed in these methods highlight both the complexity and importance of addressing this cognitive bias. We believe these effects are inherently rooted in LLMs, and a comprehensive solution that fully addresses SPE may be beyond the scope of this work. However, we have proposed practical strategies, such as using binary choices instead of multiple choices where feasible, limiting prompt length, and placing crucial information at the beginning of prompts. These suggestions are intended to help users, particularly those who may not be experts in the domain of LLMs, to better utilize these models.We agree with the suggestion that a deeper analysis of the relationship between task characteristics and SPE could enhance the manuscript. As it stands, our findings indicate that higher model accuracy tends to correlate with a reduction in SPE, which aligns with expectations—if a model achieves 100% accuracy, it is unlikely to be influenced by SPE. Beyond this, we did not observe any clear relationships, which suggests that SPE may be influenced by a combination of factors, including the specific task, the model used, and the nature of the prompts. We will clarify this point in the final version of the manuscript.
pdf
bib
abs
scRAG: Hybrid Retrieval-Augmented Generation for LLM-based Cross-Tissue Single-Cell Annotation
Zhiyin Yu
|
Chao Zheng
|
Chong Chen
|
Xian-Sheng Hua
|
Xiao Luo
In recent years, large language models (LLMs) such as GPT-4 have demonstrated impressive potential in a wide range of fields, including biology, genomics and healthcare. Numerous studies have attempted to apply pre-trained LLMs to single-cell data analysis within one tissue. However, when it comes to cross-tissue cell annotation, LLMs often suffer from unsatisfactory performance due to the lack of specialized biological knowledge regarding genes and tissues. In this paper, we introduce scRAG, a novel framework that incorporates advanced LLM-based RAG techniques into cross-tissue single-cell annotation. scRAG utilizes LLMs to retrieve structured triples from knowledge graphs and unstructured similar cell information from the reference cell database, and it generates candidate cell types. The framework further optimizes predictions by retrieving marker genes from both candidate cells and similar cells to refine its results. Extensive experiments on a cross-tissue dataset demonstrate that our scRAG framework outperforms various baselines, including generalist models, domain-specific methods, and trained classifiers. The source code is available at https://github.com/YuZhiyin/scRAG.
pdf
bib
abs
Can Large Language Models Address Open-Target Stance Detection?
Abu Ubaida Akash
|
Ahmed Fahmy
|
Amine Trabelsi
Stance detection (SD) identifies a text’s position towards a target, typically labeled as favor, against, or none. We introduce Open-Target Stance Detection (OTSD), the most realistic task where targets are neither seen during training nor provided as input. We evaluate Large Language Models (LLMs) from GPT, Gemini, Llama, and Mistral families, comparing their performance to the only existing work, Target-Stance Extraction (TSE), which benefits from predefined targets. Unlike TSE, OTSD removes the dependency of a predefined list, making target generation and evaluation more challenging. We also provide a metric for evaluating target quality that correlates well with human judgment. Our experiments reveal that LLMs outperform TSE in target generation, both when the real target is explicitly and not explicitly mentioned in the text. Similarly, LLMs overall surpass TSE in stance detection for both explicit and non-explicit cases. However, LLMs struggle in both target generation and stance detection when the target is not explicit.
pdf
bib
abs
Improve Language Model and Brain Alignment via Associative Memory
Congchi Yin
|
Yongpeng Zhang
|
Xuyun Wen
|
Piji Li
Associative memory engages in the integration of relevant information for comprehension in the human cognition system. In this work, we seek to improve alignment between language models and human brain while processing speech information by integrating associative memory. After verifying the alignment between language model and brain by mapping language model activations to brain activity, the original text stimuli expanded with simulated associative memory are regarded as input to computational language models. We find the alignment between language model and brain is improved in brain regions closely related to associative memory processing. We also demonstrate large language models after specific supervised fine-tuning better align with brain response, by building the Association dataset containing 1000 samples of stories, with instructions encouraging associative memory as input and associated content as output.
pdf
bib
abs
Towards Reliable Large Audio Language Model
Ziyang Ma
|
Xiquan Li
|
Yakun Song
|
Wenxi Chen
|
Chenpeng Du
|
Jian Wu
|
Yuanzhe Chen
|
Zhuo Chen
|
Yuping Wang
|
Yuxuan Wang
|
Xie Chen
Recent advancements in large audio language models (LALMs) have demonstrated impressive results and promising prospects in universal understanding and reasoning across speech, music, and general sound. However, these models still lack the ability to recognize their knowledge boundaries and refuse to answer questions they don’t know proactively. While there have been successful attempts to enhance the reliability of LLMs, reliable LALMs remain largely unexplored. In this paper, we systematically investigate various approaches towards reliable LALMs, including training-free methods such as multi-modal chain-of-thought (MCoT), and training-based methods such as supervised fine-tuning (SFT). Besides, we identify the limitations of previous evaluation metrics and propose a new metric, the Reliability Gain Index (RGI), to assess the effectiveness of different reliable methods. Our findings suggest that both training-free and training-based methods enhance the reliability of LALMs to different extents. Moreover, we find that awareness of reliability is a “meta ability”, which can be transferred across different audio modalities, although significant structural and content differences exist among sound, music, and speech.
pdf
bib
abs
Large Vocabulary Size Improves Large Language Models
Sho Takase
|
Ryokan Ri
|
Shun Kiyono
|
Takuya Kato
This paper empirically investigates the relationship between subword vocabulary size and the performance of large language models (LLMs) to provide insights on how to define the vocabulary size. Experimental results show that larger vocabulary sizes lead to better performance in LLMs. Moreover, we consider a continual training scenario where a pre-trained language model is trained on a different target language. We introduce a simple method to use a new vocabulary instead of the pre-defined one. We show that using the new vocabulary outperforms the model with the vocabulary used in pre-training.
pdf
bib
abs
MUSE: A Multimodal Conversational Recommendation Dataset with Scenario-Grounded User Profiles
Zihan Wang
|
Xiaocui Yang
|
YongKang Liu
|
Shi Feng
|
Daling Wang
|
Yifei Zhang
Current conversational recommendation systems focus predominantly on text. However, real-world recommendation settings are generally multimodal, causing a significant gap between existing research and practical applications. To address this issue, we propose Muse, the first multimodal conversational recommendation dataset. Muse comprises 83,148 utterances from 7,000 conversations centered around the Clothing domain. Each conversation contains comprehensive multimodal interactions, rich elements, and natural dialogues. Data in Muse are automatically synthesized by a multi-agent framework powered by multimodal large language models (MLLMs). It innovatively derives user profiles from real-world scenarios rather than depending on manual design and history data for better scalability, and then it fulfills conversation simulation and optimization. Both human and LLM evaluations demonstrate the high quality of conversations in Muse. Additionally, fine-tuning experiments on three MLLMs demonstrate Muse’s learnable patterns for recommendations and responses, confirming its value for multimodal conversational recommendation. Our dataset and codes are available at https://anonymous.4open.science/r/Muse-0086.
pdf
bib
abs
Machine Translation Models are Zero-Shot Detectors of Translation Direction
Michelle Wastl
|
Jannis Vamvas
|
Rico Sennrich
Detecting the translation direction of parallel text has applications for machine translation training and evaluation, but also has forensic applications, such as resolving plagiarism or forgery allegations. In this work, we explore an unsupervised approach to translation direction detection based on the simple hypothesis that p(translation|original)>p(original|translation), motivated by the well-known simplification effect in translationese or machine-translationese. In experiments with multilingual machine translation models across 20 translation directions, we confirm the effectiveness of the approach for high-resource language pairs, achieving document-level accuracies of 82–96% for NMT-produced translations, and 60–81% for human translations, depending on the model used.
pdf
bib
abs
Do Robot Snakes Dream like Electric Sheep? Investigating the Effects of Architectural Inductive Biases on Hallucination
Jerry Huang
|
Prasanna Parthasarathi
|
Mehdi Rezagholizadeh
|
Boxing Chen
|
Sarath Chandar
The growth in prominence of large language models (LLMs) in everyday life can be largely attributed to their generative abilities, yet some of this is also owed to the risks and costs associated with their use. On one front is their tendency to hallucinate false or misleading information, limiting their reliability. On another is the increasing focus on the computational limitations associated with traditional self-attention based LLMs, which has brought about new alternatives, in particular recurrent models, meant to overcome them. Yet it remains uncommon to consider these two concerns simultaneously. Do changes in architecture exacerbate/alleviate existing concerns about hallucinations? Do they affect how and where they occur? Through an extensive evaluation, we study how these architecture-based inductive biases affect the propensity to hallucinate. While hallucination remains a general phenomenon not limited to specific architectures, the situations in which they occur and the ease with which specific types of hallucinations can be induced can significantly differ based on the model architecture. These findings highlight the need for better understanding both these problems in conjunction with each other, as well as consider how to design more universal techniques for handling hallucinations.
pdf
bib
abs
GenTool: Enhancing Tool Generalization in Language Models through Zero-to-One and Weak-to-Strong Simulation
Jie He
|
Jennifer Neville
|
Mengting Wan
|
Longqi Yang
|
Hui Liu
|
Xiaofeng Xu
|
Xia Song
|
Jeff Z. Pan
|
Pei Zhou
Large Language Models (LLMs) can enhance their capabilities as AI assistants by integrating external tools, allowing them to access a wider range of information. While recent LLMs are typically fine-tuned with tool usage examples during supervised fine-tuning (SFT), questions remain about their ability to develop robust tool-usage skills and can effectively generalize to unseen queries and tools. In this work, we present GenTool, a novel training framework that prepares LLMs for diverse generalization challenges in tool utilization. Our approach addresses two fundamental dimensions critical for real-world applications: Zero-to-One Generalization, enabling the model to address queries initially lacking a suitable tool by adopting and utilizing one when it becomes available, and Weak-to-Strong Generalization, allowing models to leverage enhanced versions of existing tools to solve queries. To achieve this, we develop synthetic training data simulating these two dimensions of tool usage and introduce a two-stage fine-tuning approach: optimizing tool ranking, then refining tool selection. Through extensive experiments across four generalization scenarios, we demonstrate that our method significantly enhances the tool-usage capabilities of LLMs ranging from 1B to 8B parameters, achieving performance that surpasses GPT-4o. Furthermore, our analysis also provides valuable insights into the challenges LLMs encounter in tool generalization.
pdf
bib
abs
SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution
Chengxing Xie
|
Bowen Li
|
Chang Gao
|
He Du
|
Wai Lam
|
Difan Zou
|
Kai Chen
Large Language Models (LLMs) have demonstrated remarkable proficiency across a variety of complex tasks. One significant application of LLMs is in tackling software engineering challenges, particularly in resolving real-world tasks on GitHub by fixing code based on the issues reported by the users. However, many current approaches rely on proprietary LLMs, which limits reproducibility, accessibility, and transparency. The critical components of LLMs for addressing software engineering issues and how their capabilities can be effectively enhanced remain unclear. To address these challenges, we introduce SWE-Fixer, a novel open-source framework designed to effectively and efficiently resolve GitHub issues. SWE-Fixer comprises two essential modules: a code file retrieval module and a code editing module. The retrieval module employs BM25 along with a lightweight model to achieve coarse-to-fine file retrieval. Subsequently, the code editing module utilizes the other model to generate patches for the identified files. To mitigate the lack of publicly available datasets, we compile an extensive dataset that includes 110K GitHub issues along with their corresponding patches and train the two models of SWE-Fixer separately. We assess our approach on the SWE-Bench Lite and Verified benchmarks, achieving competitive performance among open-source models with scores of 22.0% and 30.2%. Furthermore, SWE-Fixer reaches state-of-the-art performance (24.7% on Lite and 32.8% on Verified) with PASS_TO_PASS (P2P) filtering. Additionally, our approach requires only two model calls per instance, making it significantly more efficient than existing methods. These results highlight the effectiveness of SWE-Fixer in real-world code-fixing scenarios.We will make our model, dataset, and code publicly available at https://github.com/InternLM/SWE-Fixer.
pdf
bib
abs
GlyphPattern: An Abstract Pattern Recognition for Vision-Language Models
Zixuan Wu
|
Yoolim Kim
|
Carolyn Jane Anderson
Vision-Language Models (VLMs) have made rapid progress in reasoning across visual and textual data. While VLMs perform well on vision tasks that they are trained on, our results highlight key challenges in abstract pattern recognition. We present GlyphPattern, a 954 item dataset that pairs 318 human-written descriptions of visual patterns from 40 writing systems with three visual presentation styles.GlyphPattern evaluates abstract pattern recognition in VLMs, requiring models to understand and judge natural language descriptions of visual patterns. GlyphPattern patterns are drawn from a large-scale cognitive science investigation of human writing systems; as a result, they are rich in spatial reference and compositionality. Our experiments show that GlyphPattern is challenging for state-of-the-art VLMs (GPT-4o achieves only 55% accuracy), with marginal gains from few-shot prompting. Our detailed analysis reveals errors at multiple levels, including visual processing, natural language understanding, and pattern generalization.
pdf
bib
abs
FitCF: A Framework for Automatic Feature Importance-guided Counterfactual Example Generation
Qianli Wang
|
Nils Feldhus
|
Simon Ostermann
|
Luis Felipe Villa-Arenas
|
Sebastian Möller
|
Vera Schmitt
Counterfactual examples are widely used in natural language processing (NLP) as valuable data to improve models, and in explainable artificial intelligence (XAI) to understand model behavior. The automated generation of counterfactual examples remains a challenging task even for large language models (LLMs), despite their impressive performance on many tasks. In this paper, we first introduce ZeroCF, a faithful approach for leveraging important words derived from feature attribution methods to generate counterfactual examples in a zero-shot setting. Second, we present a new framework, FitCF, which further verifies aforementioned counterfactuals by label flip verification and then inserts them as demonstrations for few-shot prompting, outperforming three state-of-the-art baselines. Through ablation studies, we identify the importance of each of FitCF’s core components in improving the quality of counterfactuals, as assessed through flip rate, perplexity, and similarity measures. Furthermore, we show the effectiveness of LIME and Integrated Gradients as backbone attribution methods for FitCF and find that the number of demonstrations has the largest effect on performance. Finally, we reveal a strong correlation between the faithfulness of feature attribution scores and the quality of generated counterfactuals, which we hope will serve as an importantfinding for future research in this direction.
pdf
bib
abs
From Misleading Queries to Accurate Answers: A Three-Stage Fine-Tuning Method for LLMs
Guocong Li
|
Weize Liu
|
Yihang Wu
|
Ping Wang
|
Shuaihan Huang
|
Hongxia Xu
|
Jian Wu
Large language models (LLMs) exhibit excellent performance in natural language processing (NLP), but remain highly sensitive to the quality of input queries, especially when these queries contain misleading or inaccurate information. Existing methods focus on correcting the output, but they often overlook the potential of improving the ability of LLMs to detect and correct misleading content in the input itself. In this paper, we propose a novel three-stage fine-tuning method that enhances the ability of LLMs to detect and correct misleading information in the input, further improving response accuracy and reducing hallucinations. Specifically, the three stages include (1) training LLMs to identify misleading information, (2) training LLMs to correct the misleading information using built-in or external knowledge, and (3) training LLMs to generate accurate answers based on the corrected queries. To evaluate our method, we conducted experiments on three datasets for the hallucination detection task and the question answering (QA) task, as well as two datasets containing misleading information that we constructed. The experimental results demonstrate that our method significantly improves the accuracy and factuality of LLM responses, while also enhancing the ability to detect hallucinations and reducing the generation of hallucinations in the output, particularly when the query contains misleading information.
pdf
bib
abs
Separate the Wheat from the Chaff: A Post-Hoc Approach to Safety Re-Alignment for Fine-Tuned Language Models
Di Wu
|
Xin Lu
|
Yanyan Zhao
|
Bing Qin
Although large language models (LLMs) achieve effective safety alignment at the time of release, they still face various safety challenges. A key issue is that fine-tuning often compromises the safety alignment of LLMs. To address this issue, we propose a method named
IRR (
Identify,
Remove, and
Recalibrate for Safety Realignment) that performs safety realignment for LLMs. The core of IRR is to identify and remove unsafe delta parameters from the fine-tuned models, while recalibrating the retained parameters. We evaluate the effectiveness of IRR across various datasets, including both full fine-tuning and LoRA methods. Our results demonstrate that IRR significantly enhances the safety performance of fine-tuned models on safety benchmarks, such as harmful queries and jailbreak attacks, while maintaining their performance on downstream tasks. The source code is available at:
https://github.com/pikepokenew/IRR.
pdf
bib
abs
Nuclear Deployed!: Analyzing Catastrophic Risks in Decision-making of Autonomous LLM Agents
Rongwu Xu
|
Xiaojian Li
|
Shuo Chen
|
Wei Xu
Large language models (LLMs) are evolving into autonomous decision-makers, raising concerns about catastrophic risks in high-stakes scenarios, particularly in Chemical, Biological, Radiological and Nuclear (CBRN) domains. Based on the insight that such risks can originate from trade-offs between the agent’s Helpful, Harmlessness and Honest (HHH) goals, we build a novel three-stage evaluation framework, which is carefully constructed to effectively and naturally expose such risks. We conduct 14,400 agentic simulations across 12 advanced LLMs, with extensive experiments and analysis. Results reveal that LLM agents can autonomously engage in catastrophic behaviors and deception, without being deliberately induced. Furthermore, stronger reasoning abilities often increase, rather than mitigate, these risks. We also show that these agents can violate instructions and superior commands. On the whole, we empirically prove the existence of catastrophic risks in autonomous LLM agents.
pdf
bib
abs
MoRE: A Mixture of Low-Rank Experts for Adaptive Multi-Task Learning
Dacao Zhang
|
Kun Zhang
|
Shimao Chu
|
Le Wu
|
Xin Li
|
Si Wei
With the rapid development of Large Language Models (LLMs), Parameter-Efficient Fine-Tuning (PEFT) methods have gained significant attention, which aims to achieve efficient fine-tuning of LLMs with fewer parameters. As a representative PEFT method, Low-Rank Adaptation (LoRA) introduces low-rank matrices to approximate the incremental tuning parameters and achieves impressive performance over multiple scenarios. After that, plenty of improvements have been proposed for further improvement. However, these methods either focus on single-task scenarios or separately train multiple LoRA modules for multi-task scenarios, limiting the efficiency and effectiveness of LoRA in multi-task scenarios. To better adapt to multi-task fine-tuning, in this paper, we propose a novel Mixture of Low-Rank Experts (MoRE) for multi-task PEFT. Specifically, instead of using an individual LoRA for each task, we align different ranks of LoRA module with different tasks, which we named low-rank experts. Moreover, we design a novel adaptive rank selector to select the appropriate expert for each task. By jointly training low-rank experts, MoRE can enhance the adaptability and efficiency of LoRA in multi-task scenarios. Finally, we conduct extensive experiments over multiple multi-task benchmarks along with different LLMs to verify model performance. Experimental results demonstrate that compared to traditional LoRA and its variants, MoRE significantly improves the performance of LLMs in multi-task scenarios and incurs no additional inference cost. We also release the model and code to facilitate the community.
pdf
bib
abs
Lunar Twins: We Choose to Go to the Moon with Large Language Models
Xin-Yu Xiao
|
Yalei Liu
|
Xiangyu Liu
|
Zengrui Li
|
Erwei Yin
|
Qianchen Xia
In recent years, the rapid advancement of large language models (LLMs) has significantly reshaped the landscape of scientific research. While LLMs have achieved notable success across various domains, their application in specialized fields such as lunar exploration remains underdeveloped, and their full potential in this domain has yet to be fully realized. To address this gap, we introduce Lunar Twins, the first LLMs designed specifically for lunar exploration, along with a collaborative framework that combines both large and small models. Additionally, we present Lunar GenData, a multi-agent collaborative workflow for generating lunar instructions, and establish the first specialized lunar dataset, which integrates real data from the Chang’e lunar missions. Lastly, we developed Lunar Eval, the first comprehensive evaluation suite for assessing the capabilities of LLMs in lunar exploration tasks. Experimental validation demonstrates that our approach not only enhances domain expertise in lunar exploration but also reveals preliminary indications of embodied intelligence potential.
pdf
bib
abs
SPHERE: An Evaluation Card for Human-AI Systems
Dora Zhao
|
Qianou Ma
|
Xinran Zhao
|
Chenglei Si
|
Chenyang Yang
|
Ryan Louie
|
Ehud Reiter
|
Diyi Yang
|
Tongshuang Wu
In the era of Large Language Models (LLMs), establishing effective evaluation methods and standards for diverse human-AI interaction systems is increasingly challenging. To encourage more transparent documentation and facilitate discussion on human-AI system evaluation design options, we present an evaluation card SPHERE, which encompasses five key dimensions: 1) What is being evaluated?; 2) How is the evaluation conducted?; 3) Who is participating in the evaluation?; 4) When is evaluation conducted?; 5) How is evaluation validated? We conduct a review of 39 human-AI systems using SPHERE, outlining current evaluation practices and areas for improvement. We provide three recommendations for improving the validity and rigor of evaluation practices.
pdf
bib
abs
Data-Centric Improvements for Enhancing Multi-Modal Understanding in Spoken Conversation Modeling
Maximillian Chen
|
Ruoxi Sun
|
Sercan O Arik
Conversational assistants are increasingly popular across diverse real-world applications, highlighting the need for advanced multimodal speech modeling. Speech, as a natural mode of communication, encodes rich user-specific characteristics such as speaking rate and pitch, making it critical for effective interaction. Our work introduces a data-centric customization approach for efficiently enhancing multimodal understanding in conversational speech modeling. Central to our contributions is a novel multi-task learning paradigm that involves designing auxiliary tasks to utilize a small amount of speech data. Our approach achieves state-of-the-art performance on the Spoken-SQuAD benchmark, using only 10% of the training data with open-weight models, establishing a robust and efficient framework for audio-centric conversational modeling. We also introduce ASK-QA, the first dataset for multi-turn spoken dialogue with ambiguous user requests and dynamic evaluation inputs.
pdf
bib
abs
Question-Aware Knowledge Graph Prompting for Enhancing Large Language Models
Haochen Liu
|
Song Wang
|
Chen Chen
|
Jundong Li
Large Language Models (LLMs) often struggle with tasks requiring external knowledge, such as knowledge-intensive Multiple Choice Question Answering (MCQA). Integrating Knowledge Graphs (KGs) can enhance reasoning; however, existing methods typically demand costly fine-tuning or retrieve noisy KG information. Recent approaches leverage Graph Neural Networks (GNNs) to generate KG-based input embedding prefixes as soft prompts for LLMs but fail to account for question relevance, resulting in noisy prompts. Moreover, in MCQA tasks, the absence of relevant KG knowledge for certain answer options remains a significant challenge. To address these issues, we propose Question-Aware Knowledge Graph Prompting (QAP), which incorporates question embeddings into GNN aggregation to dynamically assess KG relevance. QAP employs global attention to capture inter-option relationships, enriching soft prompts with inferred knowledge. Experimental results demonstrate that QAP outperforms state-of-the-art methods across multiple datasets, highlighting its effectiveness.
pdf
bib
abs
UQ-Merge: Uncertainty Guided Multimodal Large Language Model Merging
Huaizhi Qu
|
Xinyu Zhao
|
Jie Peng
|
Kwonjoon Lee
|
Behzad Dariush
|
Tianlong Chen
Multimodal Large Language Models (MLLMs) have gained increasing popularity as a promising framework for leveraging the strong language reasoning capabilities in the vision-language domain. Given a wide range of MLLMs, model merging potentially offers a cheap way to aggregate their diverse knowledge into a single MLLM. However, directly plug-in existing model merging approaches often leads to suboptimal performance due to (1) inclusion of harmful models that have over-confident predictions in the target task; (2) the lack of specialized designs for vision-language inputs. To tackle these pain points, we conduct pioneering investigations to dissect the merging procedures and propose an uncertainty-guided MLLM merging algorithm, i.e., UQ-Merge, which i) identifies beneficial candidates for merging, ii) determines the merging order and the number of helpful candidates, and iii) performs appropriate merging. Within our framework, we consider uncertainty quantification on both text and vision inputs to examine the MLLM prediction confidence, and then decide whether and when a MLLM needs to be included. It is worth mentioning that our vision-language uncertainty quantification does not require access to sample labels, making it more practical in various scenarios. Extensive experiments consistently demonstrate the superior MLLM merging performance of UQ-Merge in both held-in and held-out vision-language benchmarks. For example, compared to existing state-of-the-art merging methods, UQ-Merge brings substantial performance improvements of up to 44.3% on average accuracy in 12 datasets. Codes are available at https://anonymous.4open.science/r/UQ-Merge-7CD7.
pdf
bib
abs
AQuAECHR: Attributed Question Answering for European Court of Human Rights
Korbinian Q. Weidinger
|
Santosh T.y.s.s
|
Oana Ichim
|
Matthias Grabmair
LLMs have become prevalent tools for information seeking across various fields, including law. However, their generated responses often suffer from hallucinations, hindering their widespread adoption in high stakes domains such as law, which can potentially mislead experts and propagate societal harms. To enhance trustworthiness in these systems, one promising approach is to attribute the answer to an actual source, thereby improving the factuality and verifiability of the response. In pursuit of advancing attributed legal question answering, we introduce AQuAECHR, a benchmark comprising information-seeking questions from ECHR jurisprudence along with attributions to relevant judgments. We present strategies to automatically curate this dataset from ECHR case law guides and utilize an LLM-based filtering pipeline to improve dataset quality, as validated by legal experts. Additionally, we assess several LLMs, including those trained on legal corpora, on this dataset to underscore significant challenges with the current models and strategies dealing with attributed QA, both quantitatively and qualitatively.
pdf
bib
abs
Leveraging Unit Language Guidance to Advance Speech Modeling in Textless Speech-to-Speech Translation
Yuhao Zhang
|
Xiangnan Ma
|
Kaiqi Kou
|
Peizhuo Liu
|
Weiqiao Shan
|
Benyou Wang
|
Tong Xiao
|
Yuxin Huang
|
Zhengtao Yu
|
JingBo Zhu
The success of building textless speech-to-speech translation (S2ST) models has attracted much attention. However, S2ST still faces two main challenges: 1) extracting linguistic features for various speech signals, called cross-modal (CM), and 2) learning alignment of difference languages in long sequences, called cross-lingual (CL). We propose the unit language to overcome the two modeling challenges. The unit language can be considered a text-like representation format, constructed using n-gram language modeling. We implement multi-task learning to utilize the unit language in guiding the speech modeling process. Our initial results reveal a conflict when applying source and target unit languages simultaneously. We propose task prompt modeling to mitigate this conflict. We conduct experiments on four languages of the Voxpupil dataset. Our method demonstrates significant improvements over a strong baseline and achieves performance comparable to models trained with text.
pdf
bib
abs
Ponder & Press: Advancing Visual GUI Agent towards General Computer Control
Yiqin Wang
|
Haoji Zhang
|
Jingqi Tian
|
Yansong Tang
Most existing GUI agents typically depend on non-vision inputs like HTML source code or accessibility trees, limiting flexibility across diverse software environments and platforms. Current multimodal large language models (MLLMs), though excel at using vision to ground real-world objects, often struggle with accurately localizing GUI elements – a critical requirement for effective GUI automation – due to the semantic gap between real-world objects and GUI elements. In this work, we introduce Ponder & Press, a divide-and-conquer framework for general computer control that uses only visual input. Our approach combines a general-purpose MLLM as an ‘interpreter’, responsible for translating high-level user instructions into detailed action descriptions, with a GUI-specific MLLM as a ‘locator’ that precisely locates GUI elements for action placement. By leveraging a purely visual input, our agent offers a versatile, human-like interaction paradigm applicable to various applications. Ponder & Press locator outperforms existing models by +22.5% on the ScreenSpot GUI grounding benchmark. More offline and interactive agent benchmarks across various GUI environments – including web pages, desktop software, and mobile UIs – demonstrate that the Ponder & Press framework achieves state-of-the-art performance, highlighting the potential of visual GUI agents.
pdf
bib
abs
LogicGame: Benchmarking Rule-Based Reasoning Abilities of Large Language Models
Jiayi Gui
|
Yiming Liu
|
Jiale Cheng
|
Xiaotao Gu
|
Xiao Liu
|
Hongning Wang
|
Yuxiao Dong
|
Jie Tang
|
Minlie Huang
Large Language Models (LLMs) have demonstrated notable capabilities across various tasks, showcasing complex problem-solving abilities. Understanding and executing complex rules, along with multi-step planning, are fundamental to logical reasoning and critical for practical LLM agents and decision-making systems. However, evaluating LLMs as effective rule-based executors and planners remains underexplored. In this paper, we introduce LogicGame, a novel benchmark designed to evaluate the comprehensive rule understanding, execution, and planning capabilities of LLMs. Unlike traditional benchmarks, LogicGame provides diverse games that contain a series of rules with an initial state, requiring models to comprehend and apply predefined regulations to solve problems. We create simulated scenarios in which models execute or plan operations to achieve specific outcomes. These game scenarios are specifically designed to distinguish logical reasoning from mere knowledge by relying exclusively on predefined rules. This separation allows for a pure assessment of rule-based reasoning capabilities. The evaluation considers not only final outcomes but also intermediate steps, providing a comprehensive assessment of model performance. Moreover, these intermediate steps are deterministic and can be automatically verified. LogicGame defines game scenarios with varying difficulty levels, from simple rule applications to complex reasoning chains, in order to offer a precise evaluation of model performance on rule understanding and multi-step execution. Utilizing LogicGame, we test various LLMs and identify notable shortcomings in their rule-based logical reasoning abilities.
pdf
bib
abs
LLM-Based Multi-Agent Systems are Scalable Graph Generative Models
Jiarui Ji
|
Runlin Lei
|
Jialing Bi
|
Zhewei Wei
|
Xu Chen
|
Yankai Lin
|
Xuchen Pan
|
Yaliang Li
|
Bolin Ding
The structural properties of naturally arising social graphs are extensively studied to understand their evolution. Prior approaches for modeling network dynamics typically rely on rule-based models, which lack realism and generalizability, or deep learning-based models, which require large-scale training datasets. As abstract graph representations of entity-wise interactions, social graphs present an opportunity to explore network evolution mechanisms through realistic simulations of human-item interactions. Leveraging the pre-trained social consensus knowledge embedded in large language models (LLMs), we present GraphAgent-Generator (GAG), a novel simulation-based framework for dynamic, text-attributed social graph generation. GAG simulates the temporal node and edge generation processes for zero-shot social graph generation. The resulting graphs adhere to seven key macroscopic network properties, achieving an 11% improvement in microscopic graph structure metrics. Through the node classification benchmarking task, we validate that GAG effectively captures the intricate text-structure correlations in graph generation. Furthermore, GAG supports generating graphs with up to nearly 100,000 nodes or 10 million edges through large-scale LLM-based agent simulation with parallel acceleration, achieving a minimum speed-up of 90.4%. The source code is available at https://github.com/Ji-Cather/GraphAgent.
pdf
bib
abs
AD-LLM: Benchmarking Large Language Models for Anomaly Detection
Tiankai Yang
|
Yi Nian
|
Li Li
|
Ruiyao Xu
|
Yuangang Li
|
Jiaqi Li
|
Zhuo Xiao
|
Xiyang Hu
|
Ryan A. Rossi
|
Kaize Ding
|
Xia Hu
|
Yue Zhao
Anomaly detection (AD) is an important machine learning task with many real-world uses, including fraud detection, medical diagnosis, and industrial monitoring. Within natural language processing (NLP), AD helps detect issues like spam, misinformation, and unusual user activity. Although large language models (LLMs) have had a strong impact on tasks such as text generation and summarization, their potential in AD has not been studied enough. This paper introduces AD-LLM, the first benchmark that evaluates how LLMs can help with NLP anomaly detection. We examine three key tasks: (i) zero-shot detection, using LLMs’ pre-trained knowledge to perform AD without tasks-specific training; (ii) data augmentation, generating synthetic data and category descriptions to improve AD models; and (iii) model selection, using LLMs to suggest unsupervised AD models. Through experiments with different datasets, we find that LLMs can work well in zero-shot AD, that carefully designed augmentation methods are useful, and that explaining model selection for specific datasets remains challenging. Based on these results, we outline six future research directions on LLMs for AD.
pdf
bib
abs
RTADev: Intention Aligned Multi-Agent Framework for Software Development
Jie Liu
|
Guohua Wang
|
Ronghui Yang
|
Jiajie Zeng
|
Mengchen Zhao
|
Yi Cai
LLM-based Multi-agent frameworks have shown a great potential in solving real-world software development tasks, where the agents of different roles can communicate much more efficiently than humans. Despite their efficiency, LLM-based agents can hardly fully understand each other, which frequently causes errors during the development process. Moreover, the accumulation of errors could easily lead to the failure of the whole project. In order to reduce such errors, we introduce an intention aligned multi-agent framework RTADev, which utilizes a self-correction mechanism to ensure that all agents work based on a consensus. RTADev mimics human teams where individuals are free to start meetings anytime for reaching agreement. Specifically, RTADev integrates an alignment checking phase and a conditional ad hoc group review phase, so that the errors can be effectively reduced with minimum agent communications. Our experiments on various software development tasks show that RTADev significantly improves the quality of generated software code in terms of executability, structural and functional completeness. The code of our project is available at https://github.com/codeagent-rl/RTADev.
pdf
bib
abs
TACO-RL: Task Aware Prompt Compression Optimization with Reinforcement Learning
Shivam Shandilya
|
Menglin Xia
|
Supriyo Ghosh
|
Huiqiang Jiang
|
Jue Zhang
|
Qianhui Wu
|
Victor Rühle
|
Saravan Rajmohan
The increasing prevalence of large language models (LLMs) such as GPT-4 in various applications has led to a surge in the size of prompts required for optimal performance, leading to challenges in computational efficiency. Prompt compression aims to reduce the inference cost by minimizing input tokens without compromising on the task performance. However, existing prompt compression techniques either rely on sub-optimal metrics such as information entropy or model it as a task-agnostic token classification problem that fails to capture task-specific information.To address these issues, we propose a novel and efficient reinforcement learning (RL) based task-aware prompt compression method. To ensure low latency requirements, we leverage existing Transformer encoder-based token classification model while guiding the learning process with task-specific reward signals using lightweight REINFORCE algorithm. We evaluate the performance of our method on three diverse and challenging tasks including text summarization, question answering and code summarization. We demonstrate that our RL-guided compression method improves the task performance by 8% - 189% across these three scenarios over state-of-the-art compression techniques while satisfying the same compression rate and latency requirements.
pdf
bib
abs
A Character-Centric Creative Story Generation via Imagination
Kyeongman Park
|
Minbeom Kim
|
Kyomin Jung
Creative story generation has long been a goal of NLP research. While existing methodologies have aimed to generate long and coherent stories, they fall significantly short of human capabilities in terms of diversity and character depth. To address this, we introduce a novel story generation framework called CCI (Character-centric Creative story generation via Imagination). CCI features two modules for creative story generation: IG (Image-Guided Imagination) and MW (Multi-Writer model). In the IG module, we utilize a text-to-image model to create visual representations of key story elements, such as characters, backgrounds, and main plots, in a more novel and concrete manner than text-only approaches. The MW module uses these story elements to generate multiple persona-description candidates and selects the best one to insert into the story, thereby enhancing the richness and depth of the narrative. We compared the stories generated by CCI and baseline models through statistical analysis, as well as human and LLM evaluations. The results showed that the IG and MW modules significantly improve various aspects of the stories’ creativity. Furthermore, our framework enables interactive multi-modal story generation with users, opening up new possibilities for human-LLM integration in cultural development. Project page : https://www.2024cci.p-e.kr/
pdf
bib
abs
Proverbs Run in Pairs: Evaluating Proverb Translation Capability of Large Language Model
Minghan Wang
|
Viet Thanh Pham
|
Farhad Moghimifar
|
Thuy-Trang Vu
Despite achieving remarkable performance, machine translation (MT) research remains underexplored in terms of translating cultural elements in languages, such as idioms, proverbs, and colloquial expressions. This paper investigates the capability of state-of-the-art neural machine translation (NMT) and large language models (LLMs) in translating proverbs, which are deeply rooted in cultural contexts. We construct a translation dataset of standalone proverbs and proverbs in conversation for four language pairs. Our experiments show that the studied models can achieve good translation between languages with similar cultural backgrounds, and LLMs generally outperform NMT models in proverb translation. Furthermore, we find that current automatic evaluation metrics such as BLEU, CHRF++ and COMET are inadequate for reliably assessing the quality of proverb translation, highlighting the need for more culturally aware evaluation metrics.
pdf
bib
abs
Towards Efficient LLM Grounding for Embodied Multi-Agent Collaboration
Yang Zhang
|
Shixin Yang
|
Chenjia Bai
|
Fei Wu
|
Xiu Li
|
Zhen Wang
|
Xuelong Li
Grounding the reasoning ability of large language models (LLMs) for embodied tasks is challenging due to the complexity of the physical world. Especially, LLM planning for multi-agent collaboration requires communication of agents or credit assignment as the feedback to re-adjust the proposed plans and achieve effective coordination. However, existing methods that overly rely on physical verification or self-reflection suffer from excessive and inefficient querying of LLMs. In this paper, we propose a novel framework for multi-agent collaboration that introduces Reinforced Advantage feedback (ReAd) for efficient self-refinement of plans. Specifically, we perform critic regression to learn a sequential advantage function from LLM-planned data, and then treat the LLM planner as an optimizer to generate actions that maximize the advantage function. It endows the LLM with the foresight to discern whether the action contributes to accomplishing the final task. We provide theoretical analysis by extending advantage-weighted regression in reinforcement learning to multi-agent systems. Experiments on Overcooked-AI and a difficult variant of RoCoBench show that ReAd surpasses baselines in success rate, and also significantly decreases the interaction steps of agents and query rounds of LLMs, demonstrating its high efficiency for grounding LLMs. More results are given at https://read-llm.github.io/.
pdf
bib
abs
UAQFact: Evaluating Factual Knowledge Utilization of LLMs on Unanswerable Questions
Chuanyuan Tan
|
Wenbiao Shao
|
Hao Xiong
|
Tong Zhu
|
Zhenhua Liu
|
Kai Shi
|
Wenliang Chen
Handling unanswerable questions (UAQ) is crucial for LLMs, as it helps prevent misleading responses in complex situations. While previous studies have built several datasets to assess LLMs’ performance on UAQ, these datasets lack factual knowledge support, which limits the evaluation of LLMs’ ability to utilize their factual knowledge when handling UAQ. To address the limitation, we introduce a new unanswerable question dataset UAQFact, a bilingual dataset with auxiliary factual knowledge created from a Knowledge Graph. Based on UAQFact, we further define two new tasks to measure LLMs’ ability to utilize internal and external factual knowledge, respectively. Our experimental results across multiple LLM series show that UAQFact presents significant challenges, as LLMs do not consistently perform well even when they have factual knowledge stored. Additionally, we find that incorporating external knowledge may enhance performance, but LLMs still cannot make full use of the knowledge which may result in incorrect responses. Our code and dataset are available at https://github.com/cytan17726/UAQ_Fact.
pdf
bib
abs
Exploring Knowledge Filtering for Retrieval-Augmented Discriminative Tasks
Minjie Qiang
|
Zhongqing Wang
|
Xiaoyi Bao
|
HaoYuan Ma
|
Shoushan Li
|
Guodong Zhou
Retrieval-augmented methods have achieved remarkable advancements in alleviating the hallucination of large language models.Nevertheless, the introduction of external knowledge does not always lead to the expected improvement in model performance, as irrelevant or harmful information present in the retrieved knowledge can compromise the prediction process.To address these challenges, we propose a novel framework aimed at improving model performance by incorporating knowledge filtering and prediction fusion mechanisms.In particular, our approach first employs a perplexity-based annotation method to collect training data.Then, we design four distinct strategies to filter out harmful retrieved knowledge.Finally, we integrate the filtered knowledge to generate the final result via batch-wise predictions.We conduct extensive experiments across multiple discriminative task datasets to evaluate the proposed framework.The results demonstrate that our framework can significantly enhance the performance of models on discriminative tasks.
pdf
bib
abs
Group then Scale: Dynamic Mixture-of-Experts Multilingual Language Model
Chong Li
|
Yingzhuo Deng
|
Jiajun Zhang
|
Chengqing Zong
The curse of multilinguality phenomenon is a fundamental problem of multilingual Large Language Models (LLMs), where the competition between massive languages results in inferior performance. It mainly comes from limited capacity and negative transfer between dissimilar languages. To address this issue, we propose a method to dynamically group and scale up the parameters of multilingual LLM while boosting positive transfer among similar languages. Specifically, the model is first tuned on monolingual corpus to determine the parameter deviation in each layer and quantify the similarity between languages. Layers with more deviations are extended to mixture-of-experts layers to reduce competition between languages, where one expert module serves one group of similar languages. Experimental results on 18 to 128 languages show that our method reduces the negative transfer between languages and significantly boosts multilingual performance with fewer parameters. Such language group specialization on experts benefits the new language adaptation and reduces the inference on the previous multilingual knowledge learned.
pdf
bib
abs
Beyond Verbal Cues: Emotional Contagion Graph Network for Causal Emotion Entailment
Fangxu Yu
|
Junjie Guo
|
Zhen Wu
|
Xinyu Dai
Emotions are fundamental to conversational understanding. While significant advancements have been achieved in conversational emotion recognition and emotional response generation, recognizing the causes of eliciting emotions is less explored. Previous studies have primarily focused on identifying the causes of emotions by understanding verbal contextual utterances, overlooking that non-verbal emotional cues can elicit emotions. To address this issue, we develop an Emotional Contagion Graph Network (ECGN) that simulates the impact of non-verbal implicit emotions on the counterpart’s emotions. To achieve this, we construct a heterogeneous graph that simulates the transmission of non-verbal emotions alongside verbal influences. By applying message passing between nodes, the constructed graph effectively models both the implicit emotional dynamics and explicit verbal interactions. We evaluate ECGN’s performance through extensive experiments on the benchmark datasets and compare it against multiple state-of-the-art models. Experimental results demonstrate the effectiveness of the proposed model. Our code is available at https://github.com/Yu-Fangxu/ECGN.
pdf
bib
abs
Critic-CoT: Boosting the Reasoning Abilities of Large Language Model via Chain-of-Thought Critic
Xin Zheng
|
Jie Lou
|
Boxi Cao
|
Xueru Wen
|
Yuqiu Ji
|
Hongyu Lin
|
Yaojie Lu
|
Xianpei Han
|
Debing Zhang
|
Le Sun
Self-critic has become a crucial mechanism for enhancing the reasoning performance of LLMs. However, current approaches mainly involve basic prompts for intuitive instance-level feedback, which resembles System-1 processes and limits the reasoning capabilities. Moreover, there is a lack of in-depth investigations into the relationship between LLM’s ability to criticize and its task-solving performance. To address these issues, we propose Critic-CoT, a novel framework that pushes LLMs toward System-2-like critic capability. Through a step-wise CoT reasoning paradigm and the automatic construction of weak-supervision data without human annotation, Critic-CoT enables LLMs to engage in slow, analytic self-critique and refinement, thereby improving their reasoning abilities. Experiments on GSM8K and MATH and out-of-domain evaluation demonstrate that our enhanced model significantly boosts task-solving performance by filtering out invalid solutions or iterative refinement. Furthermore, we investigate the intrinsic correlation between critique and task-solving abilities within LLMs, discovering that these abilities can mutually reinforce each other rather than conflict.
pdf
bib
abs
Systematic Generalization in Language Models Scales with Information Entropy
Sondre Wold
|
Lucas Georges Gabriel Charpentier
|
Étienne Simon
Systematic generalization remains challenging for current language models, which are known to be both sensitive to semantically similar permutations of the input and to struggle with known concepts presented in novel contexts. Although benchmarks exist for assessing compositional behavior, it is unclear how to measure the difficulty of a systematic generalization problem. In this work, we show how one aspect of systematic generalization can be described by the entropy of the distribution of component parts in the training data. We formalize a framework for measuring entropy in a sequence-to-sequence task and find that the performance of popular model architectures scales with the entropy. Our work connects systematic generalization to information efficiency, and our results indicate that success at high entropy can be achieved even without built-in priors, and that success at low entropy can serve as a target for assessing progress towards robust systematic generalization.
pdf
bib
abs
The Inverse Scaling Effect of Pre-Trained Language Model Surprisal Is Not Due to Data Leakage
Byung-Doh Oh
|
Hongao Zhu
|
William Schuler
In psycholinguistic modeling, surprisal from larger pre-trained language models has been shown to be a poorer predictor of naturalistic human reading times. However, it has been speculated that this may be due to data leakage that caused language models to see the text stimuli during training. This paper presents two studies to address this concern at scale. The first study reveals relatively little leakage of five naturalistic reading time corpora in two pre-training datasets in terms of length and frequency of token n-gram overlap. The second study replicates the negative relationship between language model size and the fit of surprisal to reading times using models trained on ‘leakage-free’ data that overlaps only minimally with the reading time corpora. Taken together, this suggests that previous results using language models trained on these corpora are not driven by the effects of data leakage.
pdf
bib
abs
Logical Consistency is Vital: Neural-Symbolic Information Retrieval for Negative-Constraint Queries
Ganlin Xu
|
Zhoujia Zhang
|
Wangyi Mei
|
Jiaqing Liang
|
Weijia Lu
|
Xiaodong Zhang
|
Zhifei Yang
|
Xiaofeng Ma
|
Yanghua Xiao
|
Deqing Yang
Information retrieval plays a crucial role in resource localization. Current dense retrievers retrieve the relevant documents within a corpus via embedding similarities, which compute similarities between dense vectors mainly depending on word co-occurrence between queries and documents, but overlook the real query intents. Thus, they often retrieve numerous irrelevant documents. Particularly in the scenarios of complex queries such as negative-constraint queries, their retrieval performance could be catastrophic. To address the issue, we propose a neuro-symbolic information retrieval method, namely NS-IR, that leverages first-order logic (FOL) to optimize the embeddings of naive natural language by considering the logical consistency between queries and documents. Specifically, we introduce two novel techniques, logic alignment and connective constraint, to re-rank candidate documents, thereby enhancing retrieval relevance. Furthermore, we construct a new dataset NegConstraint including negative-constraint queries to evaluate our NS-IR’s performance on such complex IR scenarios. Our extensive experiments demonstrate that NS-IR not only achieves superior zero-shot retrieval performance on web search and low-resource retrieval tasks, but also performs better on negative-constraint queries. Our scource code and dataset are available at https://github.com/xgl-git/NS-IR-main.
pdf
bib
abs
‘No’ Matters: Out-of-Distribution Detection in Multimodality Multi-Turn Interactive Dialogue Download PDF
Rena Wei Gao
|
Xuetong Wu
|
Siwen Luo
|
Caren Han
|
Feng Liu
Out-of-distribution (OOD) detection in multimodal contexts is essential for identifying deviations in different modalities, particularly for interactive dialogue systems in real-life interactions, where the systems are usually infeasible to deploy large language models (LLMs) to generate dialogue responses due to data privacy and ethical issues. This paper aims to improve label detection that involves multi-round long dialogues by efficiently detecting OOD dialogues and images. We introduce a novel scoring framework named Dialogue Image Aligning and Enhancing Framework (DIAEF) that integrates the visual language models with the novel proposed scores that detect OOD in two key scenarios (1) mismatches between the dialogue and image input pair and (2) input pairs with previously unseen labels. Our experimental results, derived from various benchmarks, demonstrate that integrating image and multi-round dialogue OOD detection is more effective with previously unseen labels than using either modality independently. In the presence of mismatched pairs, our proposed score effectively identifies these mismatches and demonstrates strong robustness in long dialogues. This approach enhances domain-aware, adaptive conversational agents and establishes baselines for future studies.
pdf
bib
abs
Event Pattern-Instance Graph: A Multi-Round Role Representation Learning Strategy for Document-Level Event Argument Extraction
Qizhi Wan
|
LiuTao LiuTao
|
Changxuan Wan
|
Rong Hu
|
Keli Xiao
|
Yuxin Shuai
For document-level event argument extraction, existing role-based span selection strategies suffer from several limitations: (1) ignoring interrelations among arguments within an event instance; (2) relying on pre-trained language models to capture role semantics at either the event pattern or document, without leveraging pattern-instance associations. To address these limitations, this paper proposes a multi-round role representation learning strategy. First, we construct an event pattern-instance graph (EPIG) to comprehensively capture the role semantics embedded in various direct and indirect associations, including those among roles within event patterns, arguments within event instances, and the alignments between patterns and instances. Second, to enhance the learning of role node representation in the graph, we optimize the update mechanisms for both node and edge representations in the EPIG graph. By leveraging the graph attention network, we iteratively update the representations of role nodes and role edges. The role representations learned from the EPIG are then integrated into the original role representations, further enriching their semantic information. Finally, a role representation memory module and a multi-round learning strategy is proposed to retain and refine role representations learned from previously analyzed documents. This memory mechanism enhances the prediction performance in subsequent rounds of span selection. Extensive experiments on three datasets verify the effectiveness of the model.
pdf
bib
abs
EXECUTE: A Multilingual Benchmark for LLM Token Understanding
Lukas Edman
|
Helmut Schmid
|
Alexander Fraser
The CUTE benchmark showed that LLMs struggle with character understanding in English. We extend it to more languages with diverse scripts and writing systems, introducing EXECUTE. Our simplified framework allows easy expansion to any language. Tests across multiple LLMs reveal that challenges in other languages are not always on the character level as in English. Some languages show word-level processing issues, some show no issues at all. We also examine sub-character tasks in Chinese, Japanese, and Korean to assess LLMs’ understanding of character components.
pdf
bib
abs
Explainable Hallucination through Natural Language Inference Mapping
Wei-Fan Chen
|
Zhixue Zhao
|
Akbar Karimi
|
Lucie Flek
Large language models (LLMs) often generate hallucinated content, making it crucial to identify and quantify inconsistencies in their outputs. We introduce HaluMap, a post-hoc framework that detects hallucinations by mapping entailment and contradiction relations between source inputs and generated outputs using a natural language inference (NLI) model. To improve reliability, we propose a calibration step leveraging intra-text relations to refine predictions. HaluMap outperforms state-of-the-art NLI-based methods by five percentage points compared to other training-free approaches, while providing clear, interpretable explanations. As a training-free and model-agnostic approach, HaluMap offers a practical solution for verifying LLM outputs across diverse NLP tasks. The resources of this paper are available at https://github.com/caisa-lab/acl25-halumap.
pdf
bib
abs
HopRAG: Multi-Hop Reasoning for Logic-Aware Retrieval-Augmented Generation
Hao Liu
|
Zhengren Wang
|
Xi Chen
|
Zhiyu Li
|
Feiyu Xiong
|
Qinhan Yu
|
Wentao Zhang
Retrieval-Augmented Generation (RAG) systems often struggle with imperfect retrieval, as traditional retrievers focus on lexical or semantic similarity rather than logical relevance. To address this, we propose HopRAG, a novel RAG framework that augments retrieval with logical reasoning through graph-structured knowledge exploration. During indexing, HopRAG constructs a passage graph, with text chunks as vertices and logical connections established via LLM-generated pseudo-queries as edges. During retrieval, it employs a retrieve-reason-prune mechanism: starting with lexically or semantically similar passages, the system explores multi-hop neighbors guided by pseudo-queries and LLM reasoning to identify truly relevant ones. Experiments on multiple multi-hop benchmarks demonstrate that HopRAG’s retrieve-reason-prune mechanism can expand the retrieval scope based on logical connections and improve final answer quality.
pdf
bib
abs
Double Entendre: Robust Audio-Based AI-Generated Lyrics Detection via Multi-View Fusion
Markus Frohmann
|
Gabriel Meseguer-Brocal
|
Markus Schedl
|
Elena V. Epure
The rapid advancement of AI-based music generation tools is revolutionizing the music industry but also posing challenges to artists, copyright holders, and providers alike. This necessitates reliable methods for detecting such AI-generated content. However, existing detectors, relying on either audio or lyrics, face key practical limitations: audio-based detectors fail to generalize to new or unseen generators and are vulnerable to audio perturbations; lyrics-based methods require cleanly formatted and accurate lyrics, unavailable in practice. To overcome these limitations, we propose a novel, practically grounded approach: a multimodal, modular late-fusion pipeline that combines automatically transcribed sung lyrics and speech features capturing lyrics related information within the audio. By relying on lyrical aspects directly from audio, our method enhances robustness, mitigates susceptibility to low-level artifacts, and enables practical applicability. Experiments show that our method, DE-detect, outperforms existing lyrics-based detectors while also being more robust to audio perturbations. Thus, it offers an effective, robust solution for detecting AI-generated music in real-world scenarios. Our code is available at https://github.com/deezer/robust-AI-lyrics-detection.
pdf
bib
abs
Don’t Miss the Forest for the Trees: Attentional Vision Calibration for Large Vision Language Models
Sangmin Woo
|
Donguk Kim
|
Jaehyuk Jang
|
Yubin Choi
|
Changick Kim
Large Vision Language Models (LVLMs) demonstrate strong capabilities in visual understanding and description, yet often suffer from hallucinations, attributing incorrect or misleading features to images. We observe that LVLMs disproportionately focus on a small subset of image tokens—termed blind tokens—which are typically irrelevant to the query (e.g., background or non-object regions). We hypothesize that such attention misalignment plays a key role in generating hallucinated responses. To mitigate this issue, we propose Attentional Vision Calibration (AvisC), a test-time approach that dynamically recalibrates the influence of blind tokens without modifying the underlying attention mechanism. AvisC first identifies blind tokens by analyzing layer-wise attention distributions over image tokens, then employs a contrastive decoding strategy to balance the influence of original and blind-token-biased logits. Experiments on standard benchmarks, including POPE, MME, and AMBER, demonstrate that AvisC effectively reduces hallucinations in LVLMs.
pdf
bib
abs
SATA: A Paradigm for LLM Jailbreak via Simple Assistive Task Linkage
Xiaoning Dong
|
Wenbo Hu
|
Wei Xu
|
Tianxing He
Large language models (LLMs) have made significant advancements across various tasks, but their safety alignment remains a major concern. Exploring jailbreak prompts can expose LLMs’ vulnerabilities and guide efforts to secure them. Existing methods primarily design sophisticated instructions for the LLM to follow, or rely on multiple iterations, which could hinder the performance and efficiency of jailbreaks. In this work, we propose a novel jailbreak paradigm, Simple Assistive Task Linkage (SATA), which can effectively circumvent LLM safeguards and elicit harmful responses. Specifically, SATA first masks harmful keywords within a malicious query to generate a relatively benign query containing one or multiple [MASK] special tokens. It then employs a simple assistive task—such as a masked language model task or an element lookup by position task—to encode the semantics of the masked keywords. Finally, SATA links the assistive task with the masked query to jointly perform the jailbreak. Extensive experiments show that SATA achieves state-of-the-art performance and outperforms baselines by a large margin. Specifically, on AdvBench dataset, with mask language model (MLM) assistive task, SATA achieves an overall attack success rate (ASR) of 85% and harmful score (HS) of 4.57, and with element lookup by position (ELP) assistive task, SATA attains an overall ASR of 76% and HS of 4.43.
pdf
bib
abs
Chain-Talker: Chain Understanding and Rendering for Empathetic Conversational Speech Synthesis
Yifan Hu
|
Rui Liu
|
Yi Ren
|
Xiang Yin
|
Haizhou Li
Conversational Speech Synthesis (CSS) aims to align synthesized speech with the emotional and stylistic context of user-agent interactions to achieve empathy. Current generative CSS models face interpretability limitations due to insufficient emotional perception and redundant discrete speech coding. To address the above issues, we present Chain-Talker, a three-stage framework mimicking human cognition: Emotion Understanding derives context-aware emotion descriptors from dialogue history; Semantic Understanding generates compact semantic codes via serialized prediction; and Empathetic Rendering synthesizes expressive speech by integrating both components. To support emotion modeling, we develop CSS-EmCap, an LLM-driven automated pipeline for generating precise conversational speech emotion captions. Experiments on three benchmark datasets demonstrate that Chain-Talker produces more expressive and empathetic speech than existing methods, with CSS-EmCap contributing to reliable emotion modeling. The code and demos are available at: https://github.com/AI-S2-Lab/Chain-Talker.
pdf
bib
abs
Parameter-Efficient Fine-Tuning via Circular Convolution
Aochuan Chen
|
Jiashun Cheng
|
Zijing Liu
|
Ziqi Gao
|
Fugee Tsung
|
Yu Li
|
Jia Li
Low-Rank Adaptation (LoRA) has gained popularity for fine-tuning large foundation models, leveraging low-rank matrices \mathbf A and \mathbf B to represent weight changes (i.e., 𝛥 \mathbf W = \mathbf B \mathbf A). This method reduces trainable parameters and mitigates heavy memory consumption associated with full delta matrices by sequentially multiplying \mathbf A and \mathbf B with the activation. Despite its success, the intrinsic low-rank characteristic may limit its performance. Although several variants have been proposed to address this issue, they often overlook the crucial computational and memory efficiency brought by LoRA. In this paper, we propose Circular Convolution Adaptation (C3A), which not only achieves high-rank adaptation with enhanced performance but also excels in both computational power and memory utilization. Extensive experiments demonstrate that C3A consistently outperforms LoRA and its variants across various fine-tuning tasks.
pdf
bib
abs
Alleviating Hallucinations in Large Language Models via Truthfulness-driven Rank-adaptive LoRA
Jiahao Li
|
Zhendong Mao
|
Quan Wang
Improving the truthfulness of LLMs to alleviate hallucinations has become critical for promoting the practical deployment of LLMs. Current fine-tuning-based methods ignore the intrinsic discrepancy in the truthfulness correlations across LLM internal modules, and instead treat them equally, which may potentially decrease the performance of truthfulness improvement. In this paper, we propose a truthfulness-driven rank-adaptive LoRA method to improve LLM truthfulness (RaLFiT), which adaptively allocates the ranks in LoRA training according to the truthfulness correlations of modules within LLM. Specifically, it first measures the truthfulness correlation of each LLM module by a probing process, and allocates higher ranks to strongly correlated modules, which means a larger update subspace during training. Experimental results on TruthfulQA show that RaLFiT consistently outperforms previous state-of-the-art methods across the Llama LLM family, verifying its effectiveness and superiority, and for the first time makes the performance of 7B Llama LLMs exceed GPT-4.
pdf
bib
abs
ScEdit: Script-based Assessment of Knowledge Editing
Xinye Li
|
Zunwen Zheng
|
Qian Zhang
|
Dekai Zhuang
|
Jiabao Kang
|
Liyan Xu
|
Qingbin Liu
|
Xi Chen
|
Zhiying Tu
|
Dianhui Chu
|
Dianbo Sui
Knowledge Editing (KE) has gained increasing attention, yet current KE tasks remain relatively simple. Under current evaluation frameworks, many editing methods achieve exceptionally high scores, sometimes nearing perfection. However, few studies integrate KE into real-world application scenarios (e.g., recent interest in LLM-as-agent). To support our analysis, we introduce a novel script-based benchmark – ScEdit (Script-based Knowledge Editing Benchmark) – which encompasses both counterfactual and temporal edits. We integrate token-level and text-level evaluation methods, comprehensively analyzing existing KE techniques. The benchmark extends traditional fact-based (“What”-type question) evaluation to action-based (“How”-type question) evaluation. We observe that all KE methods exhibit a drop in performance on established metrics and face challenges on text-level metrics, indicating a challenging task. Our benchmark is available at https://github.com/asdfo123/ScEdit.
pdf
bib
abs
SafeRoute: Adaptive Model Selection for Efficient and Accurate Safety Guardrails in Large Language Models
Seanie Lee
|
Dong Bok Lee
|
Dominik Wagner
|
Minki Kang
|
Haebin Seong
|
Tobias Bocklet
|
Juho Lee
|
Sung Ju Hwang
Deploying large language models (LLMs) in real-world applications requires robust safety guard models to detect and block harmful user prompts. While large safety guard models achieve strong performance, their computational cost is substantial. To mitigate this, smaller distilled models are used, but they often underperform on “hard” examples where the larger model provides accurate predictions. We observe that many inputs can be reliably handled by the smaller model, while only a small fraction require the larger model’s capacity. Motivated by this, we propose SafeRoute, a binary router that distinguishes hard examples from easy ones. Our method selectively applies the larger safety guard model to the data that the router considers hard, improving efficiency while maintaining accuracy compared to solely using the larger safety guard model. Experimental results on multiple benchmark datasets demonstrate that our adaptive model selection significantly enhances the trade-off between computational cost and safety performance, outperforming relevant baselines.
pdf
bib
abs
Moderation Matters: Measuring Conversational Moderation Impact in English as a Second Language Group Discussion
Rena Wei Gao
|
Ming-Bin Chen
|
Lea Frermann
|
Jey Han Lau
English as a Second Language (ESL) speakers often struggle to engage in group discussions due to language barriers. While moderators can facilitate participation, few studies assess conversational engagement and evaluate moderation effectiveness. To address this gap, we develop a dataset comprising 17 sessions from an online ESL conversation club, which includes both moderated and non-moderated discussions. We then introduce an approach that integrates automatic ESL dialogue assessment and a framework that categorizes moderation strategies. Our findings indicate that moderators help improve the flow of topics and start/end a conversation. Interestingly, we find active acknowledgement and encouragement to be the most effective moderation strategy, while excessive information and opinion sharing by moderators has a negative impact. Ultimately, our study paves the way for analyzing ESL group discussions and the role of moderators in non-native conversation settings.
pdf
bib
abs
Measuring Bias and Agreement in Large Language Model Presupposition Judgments
Katherine Atwell
|
Mandy Simons
|
Malihe Alikhani
Identifying linguistic bias in text demands the identification not only of explicitly asserted content but also of implicit content including presuppositions. Large language models (LLMs) offer a promising automated approach to detecting presuppositions, yet the extent to which their judgments align with human intuitions remains unexplored. Moreover, LLMs may inadvertently reflect societal biases when identifying presupposed content. To empirically investigate this, we prompt multiple large language models to evaluate presuppositions across diverse textual domains, drawing from three distinct datasets annotated by human raters. We calculate the agreement between LLMs and human raters, and find several linguistic factors associated with fluctuations in human-model agreement. Our observations reveal discrepancies in human-model alignment, suggesting potential biases in LLMs, notably influenced by gender and political ideology.
pdf
bib
abs
Harnessing PDF Data for Improving Japanese Large Multimodal Models
Jeonghun Baek
|
Akiko Aizawa
|
Kiyoharu Aizawa
Large Multimodal Models (LMMs) have demonstrated strong performance in English, but their effectiveness in Japanese remains limited due to the lack of high-quality training data. Current Japanese LMMs often rely on translated English datasets, restricting their ability to capture Japan-specific cultural knowledge. To address this, we explore the potential of Japanese PDF data as a training resource, an area that remains largely underutilized. We introduce a fully automated pipeline that leverages pretrained models to extract image-text pairs from PDFs through layout analysis, OCR, and vision-language pairing, removing the need for manual annotation. Additionally, we construct instruction data from extracted image-text pairs to enrich the training data. To evaluate the effectiveness of PDF-derived data, we train Japanese LMMs and assess their performance on the Japanese LMM Benchmark. Our results demonstrate substantial improvements, with performance gains ranging from 2.1% to 13.8% on Heron-Bench. Further analysis highlights the impact of PDF-derived data on various factors, such as model size and language models, reinforcing its value as a multimodal resource for Japanese LMMs.
pdf
bib
abs
EnerGIZAr: Leveraging GIZA++ for Effective Tokenizer Initialization
Pranaydeep Singh
|
Eneko Agirre
|
Gorka Azkune
|
Orphee De Clercq
|
Els Lefever
Continual pre-training has long been considered the default strategy for adapting models to non-English languages, but struggles with initializing new embeddings, particularly for non-Latin scripts. In this work, we propose EnerGIZAr, a novel methodology that improves continual pre-training by leveraging statistical word alignment techniques. Our approach utilizes GIZA++ to construct a subword-level alignment matrix between source (English) and target language tokens. This matrix enables informed initialization of target tokenizer embeddings, which provides a more effective starting point for adaptation. We evaluate EnerGIZAr against state-of-the-art initialization strategies such as OFA and FOCUS across four typologically diverse languages: Hindi, Basque, Arabic and Korean. Experimental results on key NLP tasks – including POS tagging, Sentiment Analysis, NLI, and NER – demonstrate that EnerGIZAr achieves superior monolingual performance while also out-performing all methods for cross-lingual transfer when tested on XNLI. With EnerGIZAr, we propose an intuitive, explainable as well as state-of-the-art initialisation technique for continual pre-training of English models.
pdf
bib
abs
AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents
Yuxiang Chai
|
Siyuan Huang
|
Yazhe Niu
|
Han Xiao
|
Liang Liu
|
Guozhi Wang
|
Dingyu Zhang
|
Shuai Ren
|
Hongsheng Li
AI agents have drawn increasing attention mostly on their ability to perceive environments, understand tasks, and autonomously achieve goals. To advance research on AI agents in mobile scenarios, we introduce the Android Multi-annotation EXpo (AMEX), a comprehensive, large-scale dataset designed for generalist mobile GUI-control agents which are capable of completing tasks by directly interacting with the graphical user interface (GUI) on mobile devices. AMEX comprises over 104K high-resolution screenshots from popular mobile applications, which are annotated at multiple levels. Unlike existing GUI-related datasets, e.g., Rico, AitW, etc., AMEX includes three levels of annotations: GUI interactive element grounding, GUI screen and element functionality descriptions, and complex natural language instructions with stepwise GUI-action chains. We develop this dataset from a more instructive and detailed perspective, complementing the general settings of existing datasets. Additionally, we finetune a baseline model SPHINX Agent and illustrate the effectiveness of AMEX.
pdf
bib
abs
Drop Dropout on Single Epoch Language Model Pretraining
Houjun Liu
|
John Bauer
|
Christopher D Manning
Originally, dropout was seen as a breakthrough regularization technique that reduced overfitting and improved performance in almost all applications of deep learning by reducing overfitting. Yet, single-epoch pretraining tasks common to modern LLMs yield minimal overfitting, leading to dropout not being used for large LLMs. Nevertheless, no thorough empirical investigation has been done on the role of dropout in LM pretraining. Through experiments in single-epoch pretraining of both masked (BERT) and autoregressive (Pythia 160M and 1.4B) LMs with varying levels of dropout, we find that downstream performance in language modeling, morpho-syntax (BLiMP), question answering (SQuAD), and natural-language inference (MNLI) improves when dropout is not applied during pretraining. We additionally find that the recently-introduced “early dropout” also degrades performance over applying no dropout at all. We further investigate the models’ editability, and find that models trained without dropout are more successful in gradient-based model editing (MEND) and equivalent in representation-based model editing (ReFT). Therefore, we advocate to **drop dropout** during single-epoch pretraining.
pdf
bib
abs
Robust and Minimally Invasive Watermarking for EaaS
Zongqi Wang
|
Baoyuan Wu
|
Jingyuan Deng
|
Yujiu Yang
Embeddings as a Service (EaaS) is emerging as a crucial role in AI applications. Unfortunately, EaaS is vulnerable to model extraction attacks, highlighting the urgent need for copyright protection. Although some preliminary works propose applying embedding watermarks to protect EaaS, recent research reveals that these watermarks can be easily removed. Hence, it is crucial to inject robust watermarks resistant to watermark removal attacks. Existing watermarking methods typically inject a target embedding into embeddings through linear interpolation when the text contains triggers. However, this mechanism results in each watermarked embedding having the same component, which makes the watermark easy to identify and eliminate. Motivated by this, in this paper, we propose a novel embedding-specific watermarking (ESpeW) mechanism to offer robust copyright protection for EaaS. Our approach involves injecting unique, yet readily identifiable watermarks into each embedding. Watermarks inserted by ESpeW are designed to maintain a significant distance from one another and to avoid sharing common components, thus making it significantly more challenging to remove the watermarks. Moreover, ESpeW is minimally invasive, as it reduces the impact on embeddings to less than 1%, setting a new milestone in watermarking for EaaS. Extensive experiments on four popular datasets demonstrate that ESpeW can even watermark successfully against a highly aggressive removal strategy without sacrificing the quality of embeddings.
pdf
bib
abs
Task-Informed Anti-Curriculum by Masking Improves Downstream Performance on Text
Jarca Andrei
|
Florinel Alin Croitoru
|
Radu Tudor Ionescu
Masked language modeling has become a widely adopted unsupervised technique to pre-train large language models (LLMs). However, the process of selecting tokens for masking is random, and the percentage of masked tokens is typically fixed for the entire training process. In this paper, we propose to adjust the masking ratio and to decide which tokens to mask based on a novel task-informed anti-curriculum learning scheme. First, we harness task-specific knowledge about useful and harmful tokens in order to determine which tokens to mask. Second, we propose a cyclic decaying masking ratio, which corresponds to an anti-curriculum schedule (from hard to easy). We exemplify our novel task-informed anti-curriculum by masking (TIACBM) approach across three diverse downstream tasks: sentiment analysis, text classification by topic, and authorship attribution. Our findings suggest that TIACBM enhances the ability of the model to focus on key task-relevant features, contributing to statistically significant performance gains across tasks. We release our code at https://github.com/JarcaAndrei/TIACBM.
pdf
bib
abs
CARMO: Dynamic Criteria Generation for Context Aware Reward Modelling
Taneesh Gupta
|
Shivam Shandilya
|
Xuchao Zhang
|
Rahul Madhavan
|
Supriyo Ghosh
|
Chetan Bansal
|
Huaxiu Yao
|
Saravan Rajmohan
Reward modeling in large language models is known to be susceptible to reward hacking, causing models to latch onto superficial features such as the tendency to generate lists or unnecessarily long responses. In RLHF, and more generally during post-training, flawed reward signals often lead to outputs that optimize for these spurious correlates instead of genuine quality or correctness. We propose **Carmo (Context-Aware Reward Modeling)**, a novel approach that first generates dynamic, context-relevant criteria to ground the reward model prior to producing reward scores. Unlike prior methods that use static rubrics, Carmo leverages powerful LLMs to adaptively create evaluation criteria, e.g., logical consistency, clarity, and depth, tailored to the user query. Our theoretical analysis shows that such criteria generation can mitigate reward hacking. We further demonstrate how Carmo can be distilled into smaller models, thereby lowering the computational cost of alignment. We establish a new state-of-the-art performance on zero shot setting for generative models, with a 2.1% improvement on Reward Bench. Furthermore, alignment performed on the Carmo-curated preference dataset achieves **22.5% and 21.1% LC-WR (%) and WR (%) on Mistral-Base (7B)**. We release our datasets at [huggingface/CARMO](https://huggingface.co/datasets/Multi-preference-Optimization/CARMO-UltraFeedback).
pdf
bib
abs
SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training
Wenxi Chen
|
Ziyang Ma
|
Ruiqi Yan
|
Yuzhe Liang
|
Xiquan Li
|
Ruiyang Xu
|
Zhikang Niu
|
Yanqiao Zhu
|
Yifan Yang
|
Zhanxun Liu
|
Kai Yu
|
Yuxuan Hu
|
Jinyu Li
|
Yan Lu
|
Shujie Liu
|
Xie Chen
Recent advancements highlight the potential of end-to-end real-time spoken dialogue systems, showcasing their low latency and high quality. In this paper, we introduce SLAM-Omni, a timbre-controllable, end-to-end voice interaction system with single-stage training. SLAM-Omni achieves zero-shot timbre control by modeling spoken language with semantic tokens and decoupling speaker information to a vocoder. By predicting grouped speech semantic tokens at each step, our method significantly reduces the sequence length of audio tokens, accelerating both training and inference. Additionally, we propose historical text prompting to compress dialogue history, facilitating efficient multi-round interactions. Comprehensive evaluations reveal that SLAM-Omni outperforms prior models of similar scale, requiring only 15 hours of training on 4 GPUs with limited data. Notably, it is the first spoken dialogue system to achieve competitive performance with a single-stage training approach, eliminating the need for pre-training on TTS or ASR tasks. Further experiments validate its multilingual and multi-turn dialogue capabilities on larger datasets.
pdf
bib
abs
C2LEVA: Toward Comprehensive and Contamination-Free Language Model Evaluation
Yanyang Li
|
Wong Tin Long
|
Cheung To Hung
|
Jianqiao Zhao
|
Duo Zheng
|
Liu Ka Wai
|
Michael R. Lyu
|
Liwei Wang
Recent advances in large language models (LLMs) have shown significant promise, yet their evaluation raises concerns, particularly regarding data contamination due to the lack of access to proprietary training data. To address this issue, we present C2LEVA, a comprehensive bilingual benchmark featuring systematic contamination prevention. C2LEVA firstly offers a holistic evaluation encompassing 22 tasks, each targeting a specific application or ability of LLMs, and secondly a trustworthy assessment due to our contamination-free tasks, ensured by a systematic contamination prevention strategy that fully automates test data renewal and enforces data protection during benchmark data release. Our large-scale evaluation of 15 open-source and proprietary models demonstrates the effectiveness of C2LEVA.
pdf
bib
abs
Texts or Images? A Fine-grained Analysis on the Effectiveness of Input Representations and Models for Table Question Answering
Wei Zhou
|
Mohsen Mesgar
|
Heike Adel
|
Annemarie Friedrich
In table question answering (TQA), tables are encoded as either texts or images. Prior work suggests that passing images of tables to multi-modal large language models (MLLMs) performs comparably to using textual input with large language models (LLMs). However, the lack of controlled setups limits fine-grained distinctions between these approaches. In this paper, we conduct the first controlled study on the effectiveness of several combinations of table representations and model types from two perspectives: question complexity and table size. We build a new benchmark based on existing TQA datasets. In a systematic analysis of seven pairs of MLLMs and LLMs, we find that the best combination of table representation and model varies across setups. We propose FRES, a method selecting table representations dynamically, and observe a 10% average performance improvement compared to using both representations indiscriminately.
pdf
bib
abs
Adaptive-VP: A Framework for LLM-Based Virtual Patients that Adapts to Trainees’ Dialogue to Facilitate Nurse Communication Training
Keyeun Lee
|
Seolhee Lee
|
Esther Hehsun Kim
|
Yena Ko
|
Jinsu Eun
|
Dahee Kim
|
Hyewon Cho
|
Haiyi Zhu
|
Robert E. Kraut
|
Eunyoung E. Suh
|
Eun-mee Kim
|
Hajin Lim
Effective communication training is essential to preparing nurses for high-quality patient care. While standardized patient (SP) simulations provide valuable experiential learning, they are often costly and inflexible. Virtual patient (VP) systems offer a scalable alternative, but most fail to adapt to the varying communication skills of trainees. In particular, when trainees respond ineffectively, VPs should escalate in hostility or become uncooperative—yet this level of adaptive interaction remains largely unsupported. To address this gap, we introduce Adaptive-VP, a VP dialogue generation framework that leverages large language models (LLMs) to dynamically adapt VP behavior based on trainee input. The framework features a pipeline for constructing clinically grounded yet flexible VP scenarios and a modular system for assessing trainee communication and adjusting VP responses in real time, while ensuring learner safety. We validated Adaptive-VP by simulating challenging patient conversations. Automated evaluation using a corpus from practicing nurses showed that our communication skill evaluation mechanism reflected real-world proficiency levels. Expert nurses further confirmed that Adaptive-VP produced more natural and realistic interactions than existing approaches, demonstrating its potential as a scalable and effective tool for nursing communication training.
pdf
bib
abs
Enhancing Multimodal Unified Representations for Cross Modal Generalization
Hai Huang
|
Yan Xia
|
Shengpeng Ji
|
Shulei Wang
|
Hanting Wang
|
Minghui Fang
|
Jieming Zhu
|
Zhenhua Dong
|
Sashuai Zhou
|
Zhou Zhao
To enhance the interpretability of multimodal unified representations, many studies have focused on discrete unified representations. These efforts typically start with contrastive learning and gradually extend to the disentanglement of modal information, achieving solid multimodal discrete unified representations. However, existing research often overlooks two critical issues: 1) The use of Euclidean distance for quantization in discrete representations often overlooks the important distinctions among different dimensions of features, resulting in redundant representations after quantization; 2) Different modalities have unique characteristics, and a uniform alignment approach does not fully exploit these traits. To address these issues, we propose Training-free Optimization of Codebook (TOC) and Fine and Coarse cross-modal Information Disentangling (FCID). These methods refine the unified discrete representations from pretraining and perform fine- and coarse-grained information disentanglement tailored to the specific characteristics of each modality, achieving significant performance improvements over previous state-of-the-art models. The code is available at https://github.com/haihuangcode/CMG.
pdf
bib
abs
Domain Regeneration: How well do LLMs match syntactic properties of text domains?
Da Ju
|
Hagen Blix
|
Adina Williams
Recent improvement in large language model performance have, in all likelihood, been accompanied by improvement in how well they can approximate the distribution of their training data. In this work, we explore the following question: which properties of text domains do LLMs faithfully approximate, and how well do they do so? Applying observational approaches familiar from corpus linguistics, we prompt a commonly used, opensource LLM to regenerate text from two domains of permissively licensed English text which are often contained in LLM training data—Wikipedia and news text. This regeneration paradigm allows us to investigate whether LLMs can faithfully match the original human text domains in a fairly semantically-controlled setting. We investigate varying levels of syntactic abstraction, from more simple properties like sentence length, and article readability, to more complex and higher order properties such as dependency tag distribution, parse depth, and parse complexity. We find that the majority of the regenerated distributions show a shifted mean, a lower standard deviation, and a reduction of the long tail, as compared to the human originals.
pdf
bib
abs
Structural Deep Encoding for Table Question Answering
Raphaël Mouravieff
|
Benjamin Piwowarski
|
Sylvain Lamprier
Although Transformers-based architectures excel at processing textual information, their naive adaptation for tabular data often involves flattening the table structure. This simplification can lead to the loss of essential inter-dependencies between rows, columns, and cells, while also posing scalability challenges for large tables. To address these issues, prior works have explored special tokens, structured embeddings, and sparse attention patterns. In this paper, we conduct a comprehensive analysis of tabular encoding techniques used in QA, which highlights the crucial role of attention sparsity in preserving structural information of tables. We also introduce a set of novel sparse attention mask designs for tabular data, that not only enhance computational efficiency but also preserve structural integrity, leading to better overall performance.
pdf
bib
abs
MPL: Multiple Programming Languages with Large Language Models for Information Extraction
Bo Li
|
Gexiang Fang
|
Wei Ye
|
Zhenghua Xu
|
Jinglei Zhang
|
Hao Cheng
|
Shikun Zhang
Recent research in information extraction (IE) focuses on utilizing code-style inputs to enhance structured output generation. The intuition behind this is that the programming languages (PLs) inherently exhibit greater structural organization than natural languages (NLs). This structural advantage makes PLs particularly suited for IE tasks. Nevertheless, existing research primarily focuses on Python for code-style simulation, overlooking the potential of other widely-used PLs (e.g., C++ and Java) during the supervised fine-tuning (SFT) phase. In this research, we propose Multiple Programming Languages with large language models for information extraction (abbreviated as MPL), a novel framework that explores the potential of incorporating different PLs in the SFT phase. Additionally, we introduce function-prompt with virtual running to simulate code-style inputs more effectively and efficiently. Experimental results on a wide range of datasets demonstrate the effectiveness of MPL. Furthermore, we conduct extensive experiments to provide a comprehensive analysis. Our code and additional files are in the supplementary materials.
pdf
bib
abs
Self-Critique Guided Iterative Reasoning for Multi-hop Question Answering
Zheng Chu
|
Huiming Fan
|
Jingchang Chen
|
Qianyu Wang
|
Mingda Yang
|
Jiafeng Liang
|
Zhongjie Wang
|
Hao Li
|
Guo Tang
|
Ming Liu
|
Bing Qin
Although large language models (LLMs) have demonstrated remarkable reasoning capabilities, they still face challenges in knowledge-intensive multi-hop reasoning. Recent work explores iterative retrieval to address complex problems. However, the absence of intermediate guidance often leads to inaccurate retrieval and intermediate reasoning errors, leading to incorrect reasoning. To address these, we propose Self-Critique Guided Iterative Reasoning (SiGIR), which uses self-critique feedback to guide the iterative reasoning process. Specifically, through end-to-end training, we enable the model to iteratively address complex problems via question decomposition, while also being able to self-evaluate its intermediate reasoning steps. During iterative reasoning, the model engages in branching exploration and employs self-evaluation to guide the selection of promising reasoning trajectories. Extensive experiments on three multi-hop reasoning datasets demonstrate the effectiveness of our proposed method, surpassing the previous SOTA by 8.6%. Furthermore, our thorough analysis offers insights for future research. Our code, data, and models are available at https://github.com/zchuz/SiGIR-MHQA.
pdf
bib
abs
Anchored Answers: Unravelling Positional Bias in GPT-2’s Multiple-Choice Questions
Ruizhe Li
|
Yanjun Gao
Large Language Models (LLMs), such as the GPT-4 and LLaMA families, have demonstrated considerable success across diverse tasks, including multiple-choice questions (MCQs). However, these models exhibit a positional bias, particularly an even worse “anchored bias” in the GPT-2 family, where they consistently favour the first choice ‘A’ in MCQs during inference. This anchored bias challenges the integrity of GPT-2’s decision-making process, as it skews performance based on the position rather than the content of the choices in MCQs. In this study, we utilise the mechanistic interpretability approach to identify the internal modules within GPT-2 models responsible for this bias. We focus on the Multi-Layer Perceptron (MLP) layers and attention heads, using the “logit lens” method to trace and modify the specific value vectors that contribute to the bias. By updating these vectors within MLP and recalibrating attention patterns to neutralise the preference for the first choice ‘A’, we effectively mitigate the anchored bias. Our interventions not only mitigate the bias but also improve the overall MCQ prediction accuracy for the GPT-2 family across various datasets. This work represents the first comprehensive mechanistic analysis of anchored bias from the failing cases in MCQs within the GPT-2 models, introducing targeted, minimal-intervention strategies that significantly enhance GPT2 model robustness and accuracy in MCQs. Our code is available at https://github.com/ruizheliUOA/Anchored_Bias_GPT2.
pdf
bib
abs
Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation
Sreyan Ghosh
|
Mohammad Sadegh Rasooli
|
Michael Levit
|
Peidong Wang
|
Jian Xue
|
Dinesh Manocha
|
Jinyu Li
Generative Error Correction (GEC) has emerged as a powerful post-processing method to boost the performance of Automatic Speech Recognition (ASR) systems. In this paper, we first show that GEC models struggle to generalize beyond the specific types of errors encountered during training, limiting their ability to correct new, unseen errors at test time, particularly in out-of-domain (OOD) scenarios. This phenomenon amplifies with named entities (NEs), where, in addition to insufficient contextual information or knowledge about the NEs, novel NEs keep emerging. To address these issues, we propose DARAG (Data- and Retrieval-Augmented Generative Error Correction), a novel approach designed to improve GEC for ASR in in-domain (ID) and OOD scenarios. First, we augment the GEC training dataset with synthetic data generated using foundational generative models, thereby simulating additional errors from which the model can learn from. For out-of-domain scenarios, we simulate test-time errors from new domains similarly and in an unsupervised fashion. Additionally, to better handle NEs, we introduce retrieval-augmented correction wherein we augment the model input with entities retrieved from a datastore of NEs. Our approach is simple, scalable, and both domain- and language-agnostic. We experiment on multiple datasets and settings, showing that DARAG outperforms all our baselines, achieving 8%–30% relative WER improvements in ID and 10%–33% improvements in OOD settings.
pdf
bib
abs
LTRAG: Enhancing Autoformalization and Self-refinement for Logical Reasoning with Thought-Guided RAG
Ruikang Hu
|
Shaoyu Lin
|
Yeliang Xiu
|
Yongmei Liu
Logical reasoning is fundamental to intelligent systems. Large language models (LLMs) have demonstrated promise in natural language (NL) reasoning, especially with techniques like chain-of-thought (CoT) prompting. Neuro-symbolic methods like Logic-LM and LINC further enhance performance on challenging datasets FOLIO and AR-LSAT by integrating formalization with LLMs and symbolic solvers, and possibly refinement with LLMs. However, these methods still struggle with the accurate formalization of complex NL problems.In this paper, we introduce LTRAG, a framework to enhance autoformalization and self-refinement for logical reasoning with Retrieval-Augmented Generation (RAG), by building knowledge bases of thought-guided examples (https://github.com/sysulic/LTRAG ).Experimental results on FOLIO and AR-LSAT show that LTRAG consistently outperforms Logic-LM and LINC across different models. On GPT-4 and AR-LSAT, it achieves an accuracy gain of 13% over Logic-LM.
pdf
bib
abs
Eta-WavLM: Efficient Speaker Identity Removal in Self-Supervised Speech Representations Using a Simple Linear Equation
Giuseppe Ruggiero
|
Matteo Testa
|
Jurgen Van De Walle
|
Luigi Di Caro
Self-supervised learning (SSL) has reduced the reliance on expensive labeling in speech technologies by learning meaningful representations from unannotated data. Since most SSL-based downstream tasks prioritize content information in speech, ideal representations should disentangle content from unwanted variations like speaker characteristics in the SSL representations. However, removing speaker information often degrades other speech components, and existing methods either fail to fully disentangle speaker identity or require resource-intensive models. In this paper, we propose a novel disentanglement method that linearly decomposes SSL representations into speaker-specific and speaker-independent components, effectively generating speaker disentangled representations. Comprehensive experiments show that our approach achieves speaker independence and as such, when applied to content-driven tasks such as voice conversion, our representations yield significant improvements over state-of-the-art methods.
pdf
bib
abs
MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning
Ke Wang
|
Junting Pan
|
Linda Wei
|
Aojun Zhou
|
Weikang Shi
|
Zimu Lu
|
Han Xiao
|
Yunqiao Yang
|
Houxing Ren
|
Mingjie Zhan
|
Hongsheng Li
Natural language image-caption datasets, widely used for training Large Multimodal Models, mainly focus on natural scenarios and overlook the intricate details of mathematical figures that are critical for problem-solving, hindering the advancement of current LMMs in multimodal mathematical reasoning. To this end, we propose leveraging code as supervision for cross-modal alignment, since code inherently encodes all information needed to generate corresponding figures, establishing a precise connection between the two modalities. Specifically, we co-develop our image-to-code model and dataset with model-in-the-loop approach, resulting in an image-to-code model, FigCodifier and ImgCode-8.6M dataset, the largest image-code dataset to date. Furthermore, we utilize FigCodifier to synthesize novel mathematical figures and then construct MM-MathInstruct-3M, a high-quality multimodal math instruction fine-tuning dataset. Finally, we present MathCoder-VL, trained with ImgCode-8.6M for cross-modal alignment and subsequently fine-tuned on MM-MathInstruct-3M for multimodal math problem solving. Our model achieves a new open-source SOTA across all six metrics. Notably, it surpasses GPT-4o and Claude 3.5 Sonnet in the geometry problem-solving subset of MathVista, achieving improvements of 8.9% and 9.2%.
pdf
bib
abs
MlingConf: A Comprehensive Study of Multilingual Confidence Estimation on Large Language Models
Boyang Xue
|
Hongru Wang
|
Rui Wang
|
Sheng Wang
|
Zezhong Wang
|
Yiming Du
|
Bin Liang
|
Wenxuan Zhang
|
Kam-Fai Wong
The tendency of Large Language Models (LLMs) to generate hallucinations raises concerns regarding their reliability. Therefore, confidence estimations indicating the extent of trustworthiness of the generations become essential. However, current LLM confidence estimations in languages other than English remain underexplored. This paper addresses this gap by introducing a comprehensive investigation of Multilingual Confidence estimation (MlingConf) on LLMs, focusing on both language-agnostic (LA) and language-specific (LS) tasks to explore the performance and language dominance effects of multilingual confidence estimations on different tasks. The benchmark comprises four meticulously checked and human-evaluated high-quality multilingual datasets for LA tasks and one for the LS task tailored to specific social, cultural, and geographical contexts of a language. Our experiments reveal that on LA tasks English exhibits notable linguistic dominance in confidence estimations than other languages, while on LS tasks, using question-related language to prompt LLMs demonstrates better linguistic dominance in multilingual confidence estimations. The phenomena inspire a simple yet effective native-tone prompting strategy by employing language-specific prompts for LS tasks, effectively improving LLMs’ reliability and accuracy in LS scenarios.
pdf
bib
abs
COMPKE: Complex Question Answering under Knowledge Editing
Keyuan Cheng
|
Zijian Kan
|
Zhuoran Zhang
|
Muhammad Asif Ali
|
Lijie Hu
|
Di Wang
Knowledge Editing-Efficiently modifying the knowledge in large language models has gathered great attention. Current benchmarks primarily use multi-hop question answering to assess and analyze newly injected or updated knowledge. However, we argue that these benchmarks fail to effectively evaluate how well the updated models apply this knowledge in real-life scenarios, particularly when questions require complex reasoning involving one-to-many relationships or multi-step logical intersections. To fill in this gap, we introduce a new benchmark, COMPKE: Complex Question Answering under Knowledge Editing, which includes 11,924 complex questions that reflect real-life situations. We perform a comprehensive evaluation of four different knowledge editing methods in COMPKE, and our results show that the performance of these methods varies between different models. For example, MeLLo achieves an accuracy of 39.47 on GPT-4o-mini but drops significantly to 3.83 on Qwen2.5-3B. We further analyze the reasons behind these results from both methodological and model perspectives. Our dataset will be publicly available on GitHub.
pdf
bib
abs
RaaS: Reasoning-Aware Attention Sparsity for Efficient LLM Reasoning
Junhao Hu
|
Wenrui Huang
|
Weidong Wang
|
Zhenwen Li
|
Tiancheng Hu
|
Zhixia Liu
|
Xusheng Chen
|
Tao Xie
|
Yizhou Shan
Large Language Models (LLMs) have demonstrated strong capabilities across various domains, with recent advancements in challenging reasoning tasks such as mathematics and programming. However, solving reasoning tasks often requires an LLM to generate long sequences, incurring O(N) time and memory complexities per token, where N is the current sequence length. To reduce complexities, existing sparsity-based algorithms propose to retain Key-Value (KV) vectors, the intermediate representations of only the most critical tokens. However, these algorithms struggle with the “impossible trinity” of accuracy, time, and memory. For example, the state-of-the-art algorithm, Quest, achieves high accuracy with O(L) time but O(N) memory (L is the cache budget, L ≪ N). To address the “impossible trinity”, in this paper, we identify a new attention pattern during the decode stage of reasoning tasks, where milestone tokens (analogous to lemmas in mathematical proofs) emerge, are utilized, and then become unimportant afterward. Based on this pattern, we propose a new algorithm RaaS that identifies milestone tokens and retains their KV vectors until they are no longer needed, achieving high accuracy with O(L) time and O(L) memory complexities.
pdf
bib
abs
One-for-All Pruning: A Universal Model for Customized Compression of Large Language Models
Rongguang Ye
|
Ming Tang
Existing pruning methods for large language models (LLMs) focus on achieving high compression rates while maintaining model performance. Although these methods have demonstrated satisfactory performance in handling a single user’s compression request, their processing time increases linearly with the number of requests, making them inefficient for real-world scenarios with multiple simultaneous requests. To address this limitation, we propose a Univeral Model for Customized Compression (UniCuCo) for LLMs, which introduces a StratNet that learns to map arbitrary requests to their optimal pruning strategy. The challenge in training StratNet lies in the high computational cost of evaluating pruning strategies and the non-differentiable nature of the pruning process, which hinders gradient backpropagation for StratNet updates. To overcome these challenges, we leverage a Gaussian process to approximate the evaluation process. Since the gradient of the Gaussian process is computable, we can use it to approximate the gradient of the non-differentiable pruning process, thereby enabling StratNet updates. Experimental results show that UniCuCo is 28 times faster than baselines in processing 64 requests, while maintaining comparable accuracy to baselines.
pdf
bib
abs
CLaMP 3: Universal Music Information Retrieval Across Unaligned Modalities and Unseen Languages
Shangda Wu
|
Guo Zhancheng
|
Ruibin Yuan
|
Junyan Jiang
|
SeungHeon Doh
|
Gus Xia
|
Juhan Nam
|
Xiaobing Li
|
Feng Yu
|
Maosong Sun
CLaMP 3 is a unified framework developed to address challenges of cross-modal and cross-lingual generalization in music information retrieval. Using contrastive learning, it aligns all major music modalities–including sheet music, performance signals, and audio recordings–with multilingual text in a shared representation space, enabling retrieval across unaligned modalities with text as a bridge. It features a multilingual text encoder adaptable to unseen languages, exhibiting strong cross-lingual generalization. Leveraging retrieval-augmented generation, we curated M4-RAG, a web-scale dataset consisting of 2.31 million music-text pairs. This dataset is enriched with detailed metadata that represents a wide array of global musical traditions. To advance future research, we release WikiMT-X, a benchmark comprising 1,000 triplets of sheet music, audio, and richly varied text descriptions. Experiments show that CLaMP 3 achieves state-of-the-art performance on multiple MIR tasks, significantly surpassing previous strong baselines and demonstrating excellent generalization in multimodal and multilingual music contexts.
pdf
bib
abs
PFDial: A Structured Dialogue Instruction Fine-tuning Method Based on UML Flowcharts
Ming Zhang
|
Yuhui Wang
|
Yujiong Shen
|
Tingyi Yang
|
Changhao Jiang
|
Yilong Wu
|
Shihan Dou
|
Qinhao Chen
|
Zhiheng Xi
|
Zhihao Zhang
|
Yi Dong
|
Zhen Wang
|
Zhihui Fei
|
Mingyang Wan
|
Tao Liang
|
Guojun Ma
|
Qi Zhang
|
Tao Gui
|
Xuanjing Huang
Process-driven dialogue systems, which operate under strict predefined process constraints, are essential in customer service and equipment maintenance scenarios. Although Large Language Models (LLMs) have shown remarkable progress in dialogue and reasoning, they still struggle to solve these strictly constrained dialogue tasks. To address this challenge, we construct **P**rocess **F**low **Dial**ogue (**PFDial**) dataset, which contains 12,705 high-quality Chinese dialogue instructions derived from 440 flowcharts containing 5,055 process nodes. Based on PlantUML specification, each UML flowchart is converted into atomic dialogue units i.e., structured five-tuples. Experimental results demonstrate that a 7B model trained with merely 800 samples, and a 0.5B model trained on total data both can surpass 90% accuracy. Additionally, the 8B model can surpass GPT-4o up to 43.88% with an average of 11.00%. We further evaluate models’ performance on challenging backward transitions in process flows and conduct an in-depth analysis of various dataset formats to reveal their impact on model performance in handling decision and sequential branches. The data is released in https://github.com/KongLongGeFDU/PFDial.
pdf
bib
abs
Listening to Patients: Detecting and Mitigating Patient Misreport in Medical Dialogue System
Lang Qin
|
Yao Zhang
|
Hongru Liang
|
Adam Jatowt
|
Zhenglu Yang
Medical Dialogue Systems (MDSs) have emerged as promising tools for automated healthcare support through patient-agent interactions. Previous efforts typically relied on an idealized assumption — patients can accurately report symptoms aligned with their actual health conditions. However, in reality, patients often misreport their symptoms, due to cognitive limitations, emotional factors, etc. Overlooking patient misreports can significantly compromise the diagnostic accuracy of MDSs. To address this critical issue, we emphasize the importance of enabling MDSs to “listen to patients” by tackling two key challenges: how to detect misreport and mitigate misreport effectively. In this work, we propose PaMis, a novel framework that can detect patient misreports based on calculating the structural entropy of the dialogue entity graph, and mitigate them through generating controlled clarifying questions. Our experimental results demonstrate that PaMis effectively enhances MDSs reliability by effectively addressing patient misreports during the medical response generation process.
pdf
bib
abs
Do Language Models Understand the Cognitive Tasks Given to Them? Investigations with the N-Back Paradigm
Xiaoyang Hu
|
Richard Lewis
Cognitive tasks originally developed for humans are now increasingly used to study language models. While applying these tasks is often straightforward, interpreting their results can be challenging. In particular, when a model underperforms, it is often unclear whether this results from a limitation in the cognitive ability being tested or a failure to understand the task itself. A recent study argues that GPT 3.5’s declining performance on 2-back and 3-back tasks reflects a working memory capacity limit similar to humans (Gong et al., 2024). By analyzing a range of open-source language models of varying performance levels on these tasks, we show that the poor performance is due at least in part to a limitation in task comprehension and task set maintenance. We challenge the best-performing model with progressively harder versions of the task (up to 10-back) and experiment with alternative prompting strategies, before analyzing model attentions. Our larger aim is to contribute to the ongoing conversation around refining methodologies for the cognitive evaluation of language models.
pdf
bib
abs
Graph-guided Cross-composition Feature Disentanglement for Compositional Zero-shot Learning
Yuxia Geng
|
Runkai Zhu
|
Jiaoyan Chen
|
Jintai Chen
|
Xiang Chen
|
Zhuo Chen
|
Shuofei Qiao
|
Yuxiang Wang
|
Xiaoliang Xu
|
Sheng-Jun Huang
Disentanglement of visual features of primitives (i.e., attributes and objects) has shown exceptional results in Compositional Zero-shot Learning (CZSL). However, due to the feature divergence of an attribute (resp. object) when combined with different objects (resp. attributes), it is challenging to learn disentangled primitive features that are general across different compositions. To this end, we propose the solution of cross-composition feature disentanglement, which takes multiple primitive-sharing compositions as inputs and constrains the disentangled primitive features to be general across these compositions. More specifically, we leverage a compositional graph to define the overall primitive-sharing relationships between compositions, and build a task-specific architecture upon the recently successful large pre-trained vision-language model (VLM) CLIP, with dual cross-composition disentangling adapters (called L-Adapter and V-Adapter) inserted into CLIP’s frozen text and image encoders, respectively. Evaluation on three popular CZSL benchmarks shows that our proposed solution significantly improves the performance of CZSL, and its components have been verified by solid ablation studies. Our code and data are available at: https://github.com/zhurunkai/DCDA.
pdf
bib
abs
Training Long-Context LLMs Efficiently via Chunk-wise Optimization
Wenhao Li
|
Yuxin Zhang
|
Gen Luo
|
Daohai Yu
|
Rongrong Ji
While long-context large language models (LLMs) exhibit remarkable document processing capabilities, their prohibitively high training costs often hinder customized applications. To mitigate this issue, we propose __Sequential Chunk-wise Optimization (SeCO)__, a memory-efficient training paradigm that partitions lengthy inputs into manageable chunks. Each chunk independently constructs its computational graph and performs localized backpropagation, ensuring that only one chunk’s forward activations are stored in memory. Building on SeCO, we further introduce __Sparse Chunk-wise Optimization (SpaCO)__, which reduces computational overhead by selectively propagating gradients to specific chunks and incorporates a carefully designed compensation factor to ensure unbiased gradient estimation. SpaCO decouples the computational cost of backpropagation from the context length, enabling training time to gradually converge to inference time as sequences become longer. Implemented as lightweight training wrappers, both SeCO and SpaCO offer substantial practical benefits. For example, when fine-tuning an 8B model with LoRA on a single RTX 3090 GPU, SeCO expands maximum sequence length from 1K to 16K tokens, while SpaCO demonstrates accelerated training speed—achieving up to 3× faster than SeCO under the same experimental setup. These innovations provide new insights into optimizing long-context models, making them more accessible for practical applications. We have open-sourced the code at https://anonymous.4open.science/r/seco-CCBD.
pdf
bib
abs
Revisiting LoRA through the Lens of Parameter Redundancy: Spectral Encoding Helps
Jiashun Cheng
|
Aochuan Chen
|
Nuo Chen
|
Ziqi Gao
|
Yuhan Li
|
Jia Li
|
Fugee Tsung
Low-Rank Adaptation (LoRA) has emerged as a prominent technique for fine-tuning large foundation models. Despite its successes, the substantial parameter redundancy, which limits the capacity and efficiency of LoRA, has been recognized as a bottleneck. In this work, we systematically investigate the impact of redundancy in fine-tuning LoRA and reveal that reducing density redundancy does not degrade expressiveness. Based on this insight, we introduce Spectral-encoding Low-Rank Adaptation (SeLoRA), which harnesses the robust expressiveness of spectral bases to re-parameterize LoRA from a sparse spectral subspace. Designed with simplicity, SeLoRA enables seamless integration with various LoRA variants for performance boosting, serving as a scalable plug-and-play framework. Extensive experiments substantiate that SeLoRA achieves greater efficiency with fewer parameters, delivering superior performance enhancements over strong baselines on various downstream tasks, including commonsense reasoning, math reasoning, and code generation.
pdf
bib
abs
CODEMENV: Benchmarking Large Language Models on Code Migration
Keyuan Cheng
|
Xudong Shen
|
Yihao Yang
|
TengyueWang TengyueWang
|
Yang Cao
|
Muhammad Asif Ali
|
Hanbin Wang
|
Lijie Hu
|
Di Wang
Large language models (LLMs) have demonstrated remarkable proficiency in handling a wide range of tasks within the software engineering domain, but their ability to perform code migration—adapting code to different environments—remains underexplored. In this work, we propose a novel benchmark, : Code Migration Across Environment, designed to evaluate LLMs’ performance in handling code migration tasks. The benchmark comprises 922 data points across 19 Python and Java packages, offering three tasks to systematically evaluate code migration: identifying version-incompatible functions, determining function changes, and adapting code to target environments. Experimental evaluation of across seven LLMs revealed an average pass@1 rate of 26.50%, with GPT-4o performing best at 43.84%. We highlight our key findings as follows: (i) LLMs are more familiar with newer function versions, making them better at migrating legacy code, and (ii) a logical inconsistency where LLMs sometimes identify irrelevant function changes for the target migration environment.
pdf
bib
abs
A Case Study of Cross-Lingual Zero-Shot Generalization for Classical Languages in LLMs
V.S.D.S.Mahesh Akavarapu
|
Hrishikesh Terdalkar
|
Pramit Bhattacharyya
|
Shubhangi Agarwal
|
Dr. Vishakha Deulgaonkar
|
Chaitali Dangarikar
|
Pralay Manna
|
Arnab Bhattacharya
Large Language Models (LLMs) have demonstrated remarkable generalization capabilities across diverse tasks and languages. In this study, we focus on natural language understanding in three classical languages—Sanskrit, Ancient Greek and Latin—to investigate the factors affecting cross-lingual zero-shot generalization. First, we explore named entity recognition and machine translation into English. While LLMs perform equal to or better than fine-tuned baselines on out-of-domain data, smaller models often struggle, especially with niche or abstract entity types. In addition, we concentrate on Sanskrit by presenting a factoid question–answering (QA) dataset and show that incorporating context via retrieval-augmented generation approach significantly boosts performance. In contrast, we observe pronounced performance drops for smaller LLMs across these QA tasks. These results suggest model scale as an important factor influencing cross-lingual generalization. Assuming that models used such as GPT-4o and Llama-3.1 are not instruction fine-tuned on classical languages, our findings provide insights into how LLMs may generalize on these languages and their consequent utility in classical studies.
pdf
bib
abs
BrainECHO: Semantic Brain Signal Decoding through Vector-Quantized Spectrogram Reconstruction for Whisper-Enhanced Text Generation
Jilong Li
|
Zhenxi Song
|
Jiaqi Wang
|
Meishan Zhang
|
Honghai Liu
|
Min Zhang
|
Zhiguo Zhang
Current EEG/MEG-to-text decoding systems suffer from three key limitations: (1) reliance on teacher-forcing methods, which compromises robustness during inference, (2) sensitivity to session-specific noise, hindering generalization across subjects, and (3) misalignment between brain signals and linguistic representations due to pre-trained language model over-dominance. To overcome these challenges, we propose BrainECHO (Brain signal decoding via vEctor-quantized speCtrogram reconstruction for WHisper-enhanced text generatiOn), a multi-stage framework that employs decoupled representation learning to achieve state-of-the-art performance on both EEG and MEG datasets. Specifically, BrainECHO consists of three stages: (1) Discrete autoencoding, which transforms continuous Mel spectrograms into a finite set of high-quality discrete representations for subsequent stages. (2) Frozen alignment, where brain signal embeddings are mapped to corresponding Mel spectrogram embeddings in a frozen latent space, effectively filtering session-specific noise through vector-quantized reconstruction, yielding a 3.65% improvement in BLEU-4 score. (3) Constrained decoding fine-tuning, which leverages the pre-trained Whisper model for audio-to-text translation, balancing signal adaptation with knowledge preservation, and achieving 74%-89% decoding BLEU scores without excessive reliance on teacher forcing. BrainECHO demonstrates robustness across sentence, session, and subject-independent conditions, passing Gaussian noise tests and showcasing its potential for enhancing language-based brain-computer interfaces.
pdf
bib
abs
Progressive LoRA for Multimodal Continual Instruction Tuning
Yahan Yu
|
Duzhen Zhang
|
Yong Ren
|
Xuanle Zhao
|
Xiuyi Chen
|
Chenhui Chu
Multimodal Continual Instruction Tuning (MCIT) empowers Multimodal Large Language Models (MLLMs) to adapt to ever-evolving requirements without continuous costly retraining. However, MCIT faces challenges in mitigating Catastrophic Forgetting (CF) and enhancing Knowledge Transfer (KT). Existing works combine Mixture-of-Expert (MoE) and LoRA to address these. However, using a fixed number of shared LoRA blocks across tasks can lead to the overwriting of acquired knowledge, making MLLMs harder to handle CF and KT. Therefore, we propose the **Prog**ressive **LoRA** framework (ProgLoRA), which contains a progressive LoRA pool and trains a new LoRA block for each incremental task to reduce knowledge interference. Specifically, ProgLoRA has two key mechanisms: task-aware allocation for effectively leveraging acquired knowledge at current task and task recall for realigning the model with learned tasks. Additionally, considering different application scenarios, we design a static ProgLoRA for the more idealized basic setting and a dynamic ProgLoRA for the more realistic challenging setting. Experiments on the latest MCIT benchmark demonstrate that ProgLoRA outperforms existing approaches.
pdf
bib
abs
ARC ‘Challenge’ Is Not That Challenging
Łukasz Borchmann
ARC Challenge appears more difficult than ARC Easy for modern LLMs primarily due to an evaluation setup that prevents direct comparison of answer choices rather than inherent complexity. Although some researchers have quietly shifted to a more appropriate scheme over the last year, the implications of this change have yet to be widely acknowledged. We highlight this overlooked shift, show how similar evaluation practices falsely imply reasoning deficits in other benchmarks, and demonstrate that fairer methods dramatically reduce performance gaps (e.g. on SIQA) and even yield superhuman results (OpenBookQA). In doing so, we reveal how evaluation shapes perceived difficulty and offer guidelines to ensure that multiple-choice evaluations accurately reflect actual model capabilities.
pdf
bib
abs
Cross-Lingual Transfer of Debiasing and Detoxification in Multilingual LLMs: An Extensive Investigation
Vera Neplenbroek
|
Arianna Bisazza
|
Raquel Fernández
Recent generative large language models (LLMs) show remarkable performance in non-English languages, but when prompted in those languages they tend to express higher harmful social biases and toxicity levels. Prior work has shown that finetuning on specialized datasets can mitigate this behavior, and doing so in English can transfer to other languages. In this work, we investigate the impact of different finetuning methods on the model’s bias and toxicity, but also on its ability to produce fluent and diverse text. We reduce biases by finetuning on curated non-harmful text, but find only direct preference optimization to be effective for mitigating toxicity. The mitigation caused by applying these methods in English also transfers to non-English languages. We find evidence that the extent to which transfer takes place can be predicted by the amount of data in a given language present in the model’s pretraining data. However, this transfer of bias and toxicity mitigation often comes at the expense of decreased language generation ability in non-English languages, highlighting the importance of developing language-specific bias and toxicity mitigation methods.
pdf
bib
abs
Tracr-Injection: Distilling Algorithms into Pre-trained Language Models
Tomás Vergara Browne
|
Alvaro Soto
Motivated by the surge of large language models, there has been a push to formally characterize the symbolic abilities intrinsic to the transformer architecture. A programming language, called RASP, has been proposed, which can be directly compiled into transformer weights to implement these algorithms. However, the tasks that can be implemented in RASP are often uncommon to learn from natural unsupervised data, showing a mismatch between theoretical capabilities of the transformer architecture, and the practical learnability of these capabilities from unsupervised data. We propose tracr-injection, a method that allows us to distill algorithms written in RASP directly into a pre-trained language model. We showcase our method by injecting 3 different algorithms into a language model. We show how our method creates an interpretable subspace within the model’s residual stream, which can be decoded into the variables present in the code of the RASP algorithm. Additionally, we found that the proposed method can improve out-of-distribution performance compared to our baseline, indicating that indeed a more symbolic mechanism is taking place in the inner workings of the model. We release the code used to run our experiments.
pdf
bib
abs
Model Performance-Guided Evaluation Data Selection for Effective Prompt Optimization
Ximing Dong
|
Shaowei Wang
|
Dayi Lin
|
Ahmed Hassan
Optimizing Large Language Model (LLM) performance requires well-crafted prompts, but manual prompt engineering is labor-intensive and often ineffective. Automated prompt optimization techniques address this challenge but the major of them rely on randomly selected evaluation subsets, which fail to represent the full dataset, leading to unreliable evaluations and suboptimal prompts. Existing coreset selection methods, designed for LLM benchmarking, are unsuitable for prompt optimization due to challenges in clustering similar samples, high data collection costs, and the unavailability of performance data for new or private datasets. To overcome these issues, we propose IPOMP, an Iterative evaluation data selection approach for effective Prompt Optimization using real time Model Performance. IPOMP is a two-stage approach that selects representative and diverse samples using semantic clustering and boundary analysis, followed by iterative refinement with real-time model performance data to replace redundant samples. Evaluations on two datasets BIG-bench and LIAR, and two models GPT-3.5 and GPT-4o-mini, show that IPOMP improves effectiveness by at least 1.6% to 3.1%, and stability by at least 50% to 55.5% compared with the best baseline across the studied datasets and models, with minimal computational overhead below 1%. Furthermore, the results demonstrate that our real-time performance-guided refinement approach can be universally applied to enhance existing coreset selection methods.
pdf
bib
abs
Revisiting Weak-to-Strong Generalization in Theory and Practice: Reverse KL vs. Forward KL
Wei Yao
|
Wenkai Yang
|
Ziqiao Wang
|
Yankai Lin
|
Yong Liu
As large language models advance toward superhuman performance, ensuring their alignment with human values and abilities grows increasingly complex. Weak-to-strong generalization offers a promising approach by leveraging predictions from weaker models to guide stronger systems, but its effectiveness could be constrained by the inherent noise and inaccuracies in these weak predictions. To address this, we propose a theoretically grounded approach that replaces forward KL divergence—whose mass-covering behavior risks overfitting to imperfect weak signals—with reverse KL divergence. Reverse KL divergence’s zero-forcing effect prioritizes high-confidence predictions, effectively mitigating the influence of unreliable weak supervision. Theoretically, we extend existing bounds and derive tighter lower bounds for both forward and reverse KL divergence. Notably, when a sufficiently pre-trained strong model is fine-tuned on the last linear layer, reverse KL guarantees that it outperforms its weak supervisor by the magnitude of their disagreement. Empirically, we demonstrate that reverse KL and reverse cross-entropy not only enable strong models to outperform those trained with forward KL and standard cross-entropy across most settings, but also exhibit greater robustness to noisy labels.
pdf
bib
abs
Stories that (are) Move(d by) Markets: A Causal Exploration of Market Shocks and Semantic Shifts across Different Partisan Groups
Felix Drinkall
|
Stefan Zohren
|
Michael McMahon
|
Janet B. Pierrehumbert
Macroeconomic fluctuations and the narratives that shape them form a mutually reinforcing cycle: public discourse can spur behavioural changes leading to economic shifts, which then result in changes in the stories that propagate. We show that shifts in semantic embedding space can be causally linked to real-world market shocks or deviations from the expected market behaviour (sec:market_shocks). Furthermore, we show how partisanship can influence the predictive power of text for market fluctuations and shape reactions to those same shocks. We also provide some evidence that text-based signals are particularly salient during rare events such as COVID-19, highlighting the value of language data as an exogenous variable in economic forecasting. Our findings underscore the bidirectional relationship between news outlets and market shocks, offering a novel empirical approach to studying their effect on each other.
pdf
bib
abs
NetSafe: Exploring the Topological Safety of Multi-agent System
Miao Yu
|
Shilong Wang
|
Guibin Zhang
|
Junyuan Mao
|
Chenlong Yin
|
Qijiong Liu
|
Kun Wang
|
Qingsong Wen
|
Yang Wang
Large language models (LLMs) have fueled significant progress in intelligent Multi-agent Systems (MAS), with expanding academic and industrial applications. However, safeguarding these systems from malicious queries receives relatively little attention, while methods for single-agent safety are challenging to transfer. In this paper, we explore MAS safety from a topological perspective, aiming at identifying structural properties that enhance security. To this end, we propose NetSafe framework, unifying diverse MAS workflows via iterative RelCom interactions to enable generalized analysis. We identify several critical phenomena for MAS under attacks (misinformation, bias, and harmful content), termed as Agent Hallucination, Aggregation Safety and Security Bottleneck. Furthermore, we verify that highly connected and larger systems are more vulnerable to adversarial spread, with task performance in a Star Graph Topology decreasing by 29.7%. In conclusion, our work introduces a new perspective on MAS safety and discovers unreported phenomena, offering insights and posing challenges to the community.
pdf
bib
abs
Reasoning is All You Need for Video Generalization: A Counterfactual Benchmark with Sub-question Evaluation
Qiji Zhou
|
YiFan Gong
|
Guangsheng Bao
|
Hongjie Qiu
|
Jinqiang Li
|
Xiangrong Zhu
|
Huajian Zhang
|
Yue Zhang
Counterfactual reasoning is crucial for robust video understanding but remains underexplored in existing multimodal benchmarks. In this paper, we introduce **COVER** (**CO**unterfactual **V**id**E**o **R**easoning), a multidimensional multimodal benchmark that systematically evaluates MLLMs across the abstract-concrete and perception-cognition dimensions. Beyond prior multimodal benchmarks, COVER decomposes complex queries into structured sub-questions, enabling fine-grained reasoning analysis. Experiments on commercial and open-source models reveal a strong correlation between sub-question accuracy and counterfactual reasoning performance, highlighting the role of structured inference in video understanding. Furthermore, our results suggest a key insight: enhancing the reasoning capability of models is essential for improving the robustness of video understanding. COVER establishes a new standard for assessing MLLMs’ logical reasoning abilities in dynamic environments. Our work is available at https://github.com/gongyifan-hash/COVER-Benchmark.
pdf
bib
abs
Initializing and Retrofitting Key-Value Adaptors for Traceable Model Editing
Hanlun Zhu
|
Yunshi Lan
|
Xiang Li
|
Weining Qian
As the insight of knowledge storage in language models deepens, the ability to perform CRUD (Create, Read, Update, Delete) operations on language models becomes increasingly indispensable for satisfying the demands of managing rapidly updating knowledge. Considering the high cost of fine-tuning language models, model editing methods with low cost are usually required to manipulate models’ knowledge. The evidence suggests that modules carrying knowledge in a Transformer module are primarily the MLP blocks, thus we propose iReVa, a method that explicitly initializes and retrofits key-value pairs into MLP blocks to construct a new mapping of a piece of knowledge without damaging the irrelevant knowledge. In comparison to existing methods, iReVa reveals better interpretability and a stronger capacity for carrying traceable edits. Experiment results on a series of GPT series models show our prominent performance on edit success and generalization without influencing specificity. We also made the first attempt to conduct a knowledge withdrawal test of iReVa. Our codes are available at https://github.com/timberflow/iReVa.
pdf
bib
abs
Know the Unknown: An Uncertainty-Sensitive Method for LLM Instruction Tuning
Jiaqi Li
|
Yixuan Tang
|
Yi Yang
Large language models (LLMs) demonstrate remarkable capabilities but face challenges from hallucinations, which typically arise from insufficient knowledge or context. While instructing LLMs to acknowledge knowledge limitations by responding with “I don’t know” appears promising, we find that models consistently struggle with admitting knowledge gaps. This challenge may originate from current instruction datasets that emphasise answer generation over knowledge boundary awareness. To address this limitation, we introduce **U**ncertainty-and-**S**ensitivity-Aware Tuning **(US-Tuning)**, a novel two-stage approach for contextual question answering (QA). The first stage enhances LLMs’ ability to recognise their knowledge boundaries, while the second stage reinforces instruction adherence through carefully designed causal prompts. Our experimental results demonstrate that US-Tuning not only significantly reduces incorrect answers in contextual QA but also improves models’ faithfulness to their parametric knowledge, mitigating hallucinations in general QA tasks. Our fine-tuned Llama2-7B model achieves up to a 34.7% improvement in handling out-of-knowledge questions and outperforms GPT-4 by 4.2% in overall performance.
pdf
bib
abs
Position-Aware Depth Decay Decoding (D3): Boosting Large Language Model Inference Efficiency
Siqi Fan
|
Xuezhi Fang
|
Xingrun Xing
|
Peng Han
|
Shuo Shang
|
Yequan Wang
Due to the large number of parameters, the inference phase of Large Language Models (LLMs) is resource-intensive. Unlike traditional model compression, which needs retraining, recent dynamic computation methods show that not all components are required for inference, enabling a training-free pipeline.In this paper, we focus on the dynamic depth of LLM generation. A token-position aware layer skipping framework is proposed to save 1.5x times operations efficiently while maintaining performance.We first observed that tokens predicted later have lower perplexity and thus require less computation. Then, we propose a training-free algorithm called Position-Aware Depth Decay Decoding (), which leverages a power-law decay function, \left\lfloor L × (𝛼i) \right\rfloor, to determine the number of layers to retain when generating token Ti. Remarkably, without any retraining, the achieves success across a wide range of generation tasks for the first time.Experiments on large language models (the Llama) with 7 ∼ 70 billion parameters show that can achieve an average 1.5x speedup compared with the full-inference pipeline while maintaining comparable performance with nearly no performance drop (<1%) on the GSM8K and BBH benchmarks.
pdf
bib
abs
Explaining Puzzle Solutions in Natural Language: An Exploratory Study on 6x6 Sudoku
Anirudh Maiya
|
Razan Alghamdi
|
Maria Leonor Pacheco
|
Ashutosh Trivedi
|
Fabio Somenzi
The success of Large Language Models (LLMs) in human-AI collaborative decision-making hinges on their ability to provide trustworthy, gradual, and tailored explanations. Solving complex puzzles, such as Sudoku, offers a canonical example of this collaboration, where clear and customized explanations often hold greater importance than the final solution. In this study, we evaluate the performance of five LLMs in solving and explaining 6x6 Sudoku puzzles. While one LLM demonstrates limited success in solving puzzles, none can explain the solution process in a manner that reflects strategic reasoning or intuitive problem-solving. These findings underscore significant challenges that must be addressed before LLMs can become effective partners in human-AI collaborative decision-making.
pdf
bib
abs
Stress-testing Machine Generated Text Detection: Shifting Language Models Writing Style to Fool Detectors
Andrea Pedrotti
|
Michele Papucci
|
Cristiano Ciaccio
|
Alessio Miaschi
|
Giovanni Puccetti
|
Felice Dell’Orletta
|
Andrea Esuli
Recent advancements in Generative AI and Large Language Models (LLMs) have enabled the creation of highly realistic synthetic content, raising concerns about the potential for malicious use, such as misinformation and manipulation. Moreover, detecting Machine-Generated Text (MGT) remains challenging due to the lack of robust benchmarks that assess generalization to real-world scenarios. In this work, we evaluate the resilience of state-of-the-art MGT detectors (e.g., Mage, Radar, LLM-DetectAIve) to linguistically informed adversarial attacks. We develop a pipeline that fine-tunes language models using Direct Preference Optimization (DPO) to shift the MGT style toward human-written text (HWT), obtaining generations more challenging to detect by current models. Additionally, we analyze the linguistic shifts induced by the alignment and how detectors rely on “linguistic shortcuts” to detect texts. Our results show that detectors can be easily fooled with relatively few examples, resulting in a significant drop in detecting performances. This highlights the importance of improving detection methods and making them robust to unseen in-domain texts. We release code, models, and data to support future research on more robust MGT detection benchmarks.
pdf
bib
abs
InfiniSST: Simultaneous Translation of Unbounded Speech with Large Language Model
Siqi Ouyang
|
Xi Xu
|
Lei Li
Simultaneous translation of unbounded streaming speech remains a challenging problem due to the need for effectively processing the historical speech context and past translations so that quality and latency, including computation overhead, can be balanced. Most prior works assume pre-segmented speech, limiting their real-world applicability. In this paper, we propose InfiniSST, a novel approach that formulates SST as a multi-turn dialogue task, enabling seamless translation of unbounded speech. We construct translation trajectories and robust segments from MuST-C with multi-latency augmentation during training and develop a key-value (KV) cache management strategy to facilitate efficient inference. Experiments on MuST-C En-Es, En-De, and En-Zh demonstrate that InfiniSST reduces computation-aware latency by 0.5 to 1 second while maintaining the same translation quality compared to baselines. Ablation studies further validate the contributions of our data construction and cache management strategy. Code is released at https://github.com/LeiLiLab/InfiniSST.
pdf
bib
abs
VSCBench: Bridging the Gap in Vision-Language Model Safety Calibration
Jiahui Geng
|
Qing Li
|
Zongxiong Chen
|
Yuxia Wang
|
Derui Zhu
|
Zhuohan Xie
|
Chenyang Lyu
|
Xiuying Chen
|
Preslav Nakov
|
Fakhri Karray
The rapid advancement of vision-language models (VLMs) has brought a lot of attention to their safety alignment. However, existing methods have primarily focused on model undersafety, where the model responds to hazardous queries, while neglecting oversafety, where the model refuses to answer safe queries. In this paper, we introduce the concept of safety calibration, which systematically addresses both undersafety and oversafety. Specifically, we present VSCBench, a novel dataset of 3,600 image-text pairs that are visually or textually similar but differ in terms of safety, which is designed to evaluate safety calibration across image-centric and text-centric scenarios. Based on our benchmark, we evaluate safety calibration across eleven widely used VLMs. Our extensive experiments revealed major issues with both undersafety and oversafety. We further investigated four approaches to improve the model’s safety calibration. We found that even though some methods effectively calibrated the models’ safety problems, these methods also lead to the degradation of models’ utility. This trade-off underscores the urgent need for advanced calibration methods, and our benchmark provides a valuable tool for evaluating future approaches.
pdf
bib
abs
To Code or not to Code? Adaptive Tool Integration for Math Language Models via Expectation-Maximization
Haozhe Wang
|
Long Li
|
Chao Qu
|
Weidi Xu
|
Fengming Zhu
|
Wei Chu
|
Fangzhen Lin
Recent advances in mathematical problem-solving with language models (LMs) integrate chain-of-thought (CoT) reasoning and code execution to harness their complementary strengths. However, existing hybrid frameworks exhibit a critical limitation: they depend on externally dictated instructions or rigid code-integration templates, lacking metacognitive awareness—the capacity to dynamically evaluate intrinsic capabilities and autonomously determine when and how to integrate tools. This rigidity motivates our study of autonomous code integration, enabling models to adapt tool-usage strategies as their reasoning abilities evolve during training.While reinforcement learning (RL) shows promise for boosting LLM reasoning at scale (e.g., DeepSeek-R1), we demonstrate its inefficiency in learning autonomous code integration due to inadequate exploration of the vast combinatorial space of CoT-code interleaving patterns. To address this challenge, we propose a novel Expectation-Maximization (EM) framework that synergizes structured exploration (E-step) with off-policy RL optimization (M-step), creating a self-reinforcing cycle between metacognitive tool-use decisions and evolving capabilities. Experiments reveal our method achieves superior results through improved exploration. Notably, our 7B model improves over 11% on MATH500 and 9.4% on AIME without o1-like CoT.
pdf
bib
abs
GOODLIAR: A Reinforcement Learning-Based Deceptive Agent for Disrupting LLM Beliefs on Foundational Principles
Soo Kyung Kim
|
Hyunsoo Cho
Large Language Models (LLMs) often succumb to adversarial prompts, a phenomenon popularly known as “jailbreaking.” While jailbreaking primarily targets short-term noncompliance with predefined policies, we argue that a deeper vulnerability lies in altering an LLM’s fundamental axiomatic beliefs, such as mathematical or philosophical truths. In this work, we introduce GoodLiar, a reinforcement learning (RL)-based framework that generates deceptive contexts to systematically rewrite an LLM’s core logical or philosophical understandings. By incentivizing an RL agent to produce persuasive and coherent arguments, GoodLiar aims to induce persistent belief shifts, rather than merely influencing immediate judgments of factual truthfulness. %rather than one-off policy breaches. Our approach introduces DA-ILQL, a novel offline RL method that extends ILQL by integrating on-policy data and language exploration to enhance the language discovery and optimization. Through extensive evaluations on multiple LLMs, we show that deceptive contexts discovered by GoodLiar consistently outperform simple multi-turn prompting methods.
pdf
bib
abs
How Does Response Length Affect Long-Form Factuality
James Xu Zhao
|
Jimmy Z.j. Liu
|
Bryan Hooi
|
See-Kiong Ng
Large language models (LLMs) are widely used for long-form text generation. However, factual errors in the responses would undermine their reliability. Despite growing attention to LLM factuality, the effect of response length on factuality remains underexplored. In this work, we systematically investigate this relationship by first introducing an automatic and bi-level long-form factuality evaluation framework, which achieves high agreement with human annotations while being cost-effective. Using this framework, we conduct controlled experiments and find that longer responses exhibit lower factual precision, confirming the presence of length bias. To explain this phenomenon, we empirically examine three hypotheses: error propagation, long context, and facts exhaustion. Our results reveal that facts exhaustion, where the model gradually exhausts more reliable knowledge, is the primary cause of factual degradation, rather than the other two hypotheses.
pdf
bib
abs
Scaling LLMs’ Social Reasoning: Sprinkle Cognitive “Aha Moment” into Fundamental Long-thought Logical Capabilities
Guiyang Hou
|
Wenqi Zhang
|
Zhe Zheng
|
Yongliang Shen
|
Weiming Lu
Humans continually engage in reasoning about others’ mental states, a capability known as Theory of Mind (ToM), is essential for social interactions. While this social reasoning capability emerges naturally in human cognitive development, how has the social reasoning capability of Large Language Models (LLMs) evolved during their development process? Various datasets have been proposed to assess LLMs’ social reasoning capabilities, but each is designed with a distinct focus, and none have explored how models’ social reasoning capabilities evolve during model size scaling or reasoning tokens scaling. In light of this, we optimize the evaluation of LLMs’ social reasoning from both data and model perspectives, constructing progressively difficult levels of social reasoning data and systematically exploring how LLMs’ social reasoning capabilities evolve. Furthermore, through an in-depth analysis of DeepSeek-R1’s reasoning trajectories, we identify notable cognitive “Aha Moment” and the reasons for its reasoning errors. Experiments reveal that long-thought logical capabilities and cognitive thinking are key to scaling LLMs’ social reasoning capabilities. By equipping the Qwen2.5-32B-Instruct model with long-thought logical capabilities and cognitive thinking, we achieve an improvement of 19.0 points, attaining social reasoning performance comparable to o1-preview model.
pdf
bib
abs
SimGRAG: Leveraging Similar Subgraphs for Knowledge Graphs Driven Retrieval-Augmented Generation
Yuzheng Cai
|
Zhenyue Guo
|
YiWen Pei
|
WanRui Bian
|
Weiguo Zheng
Recent advancements in large language models (LLMs) have shown impressive versatility across various tasks. To eliminate their hallucinations, retrieval-augmented generation (RAG) has emerged as a powerful approach, leveraging external knowledge sources like knowledge graphs (KGs). In this paper, we study the task of KG-driven RAG and propose a novel Similar Graph Enhanced Retrieval-Augmented Generation (SimGRAG) method. It effectively addresses the challenge of aligning query texts and KG structures through a two-stage process: (1) query-to-pattern, which uses an LLM to transform queries into a desired graph pattern, and (2) pattern-to-subgraph, which quantifies the alignment between the pattern and candidate subgraphs using a graph semantic distance (GSD) metric. We also develop an optimized retrieval algorithm that efficiently identifies the top-k subgraphs within 1-second on a 10-million-scale KG. Extensive experiments show that SimGRAG outperforms state-of-the-art KG-driven RAG methods in both question answering and fact verification. Our code is available at https://github.com/YZ-Cai/SimGRAG.
pdf
bib
abs
RuleEdit: Towards Rule-Level Knowledge Generalization to Mitigate Over-Editing in Large Language Models
Bihan Zhou
|
HaoPeng Ren
|
Li Yuan
|
Yi Cai
|
Liuwen Cao
|
Zikun Deng
Knowledge editing emerges as a promising approach for updating target knowledge in Large Language Models (LLMs) in a timely manner, thereby preventing undesirable behaviors stemming from outdated, inaccurate, or incomplete knowledge. However, existing methods mainly focus on instance-level editing, which is prone to over-editing risk featuring knowledge degradation and general ability deterioration, due to redundant instance-specific modifications for knowledge. To mitigate the over-editing risk, we explore the rule-level editing problem that avoids case-by-case modification by generalizing rule-level knowledge to update rule-derived instances. We further construct a benchmark called RuleEdit for systematic evaluation on rule-level editing. Moreover, we propose a Rule-Transfer Editing (RTE) method to facilitate effective updates and generalizations of rule-level knowledge in LLMs. Experimental results highlight our significant improvements, with the enhancements of 28.1% in portability and 8.1% in average performance over the best-performing baselines for LLaMA-2-7B on RULEmix.
pdf
bib
abs
Eliciting In-context Retrieval and Reasoning for Long-context Large Language Models
Yifu Qiu
|
Varun R. Embar
|
Yizhe Zhang
|
Navdeep Jaitly
|
Shay B Cohen
|
Benjamin Han
Recent advancements in long-context language models (LCLMs) promise to transform Retrieval-Augmented Generation (RAG) by simplifying pipelines. With their expanded context windows, LCLMs can process entire knowledge bases and perform retrieval and reasoning directly – a capability we define as In-Context Retrieval and Reasoning (ICR^2). However, existing benchmarks like LOFT often overestimate LCLM performance by providing overly simplified contexts. To address this, we introduce ICR^2, a benchmark that evaluates LCLMs in more realistic scenarios by including confounding passages retrieved with strong retrievers. We then propose three methods to enhance LCLM performance: (1) retrieve-then-generate fine-tuning, (2) retrieval-attention-probing, which uses attention heads to filter and de-noise long contexts during decoding, and (3) joint retrieval head training alongside the generation head. Our evaluation of five well-known LCLMs on LOFT and ICR^2 demonstrates significant gains with our best approach applied to Mistral-7B: +17 and +15 points by Exact Match on LOFT, and +13 and +2 points on ICR^2, compared to vanilla RAG and supervised fine-tuning, respectively. It even outperforms GPT-4-Turbo on most tasks despite being a much smaller model.
pdf
bib
abs
GeAR: Generation Augmented Retrieval
Haoyu Liu
|
Shaohan Huang
|
Jianfeng Liu
|
Yuefeng Zhan
|
Hao Sun
|
Weiwei Deng
|
Feng Sun
|
Furu Wei
|
Qi Zhang
Document retrieval techniques are essential for developing large-scale information systems. The common approach involves using a bi-encoder to compute the semantic similarity between a query and documents. However, the scalar similarity often fail to reflect enough information, hindering the interpretation of retrieval results. In addition, this process primarily focuses on global semantics, overlooking the finer-grained semantic relationships between the query and the document’s content. In this paper, we introduce a novel method, Generation Augmented Retrieval (GeAR), which not only improves the global document-query similarity through contrastive learning, but also integrates well-designed fusion and decoding modules. This enables GeAR to generate relevant context within the documents based on a given query, facilitating learning to retrieve local fine-grained information.Furthermore, when used as a retriever, GeAR does not incur any additional computational cost over bi-encoders. GeAR exhibits competitive retrieval performance across diverse scenarios and tasks. Moreover, qualitative analysis and the results generated by GeAR provide novel insights into the interpretation of retrieval results. The code, data, and models will be released at https://github.com/microsoft/LMOps.
pdf
bib
abs
A Unified Taxonomy-Guided Instruction Tuning Framework for Entity Set Expansion and Taxonomy Expansion
Yanzhen Shen
|
Yu Zhang
|
Yunyi Zhang
|
Jiawei Han
Entity set expansion, taxonomy expansion, and seed-guided taxonomy construction are three representative tasks that can be applied to automatically populate an existing taxonomy with emerging concepts. Previous studies view them as three separate tasks. Therefore, their proposed techniques usually work for one specific task only, lacking generalizability and a holistic perspective. In this paper, we aim at a unified solution to the three tasks. To be specific, we identify two common skills needed for entity set expansion, taxonomy expansion, and seed-guided taxonomy construction: finding “siblings” and finding “parents”. We propose a taxonomy-guided instruction tuning framework to teach a large language model to generate siblings and parents for query entities, where the joint pre-training process facilitates the mutual enhancement of the two skills. Extensive experiments on multiple benchmark datasets demonstrate the efficacy of our proposed TaxoInstruct framework, which outperforms task-specific baselines across all three tasks.
pdf
bib
abs
Zero-Shot Conversational Stance Detection: Dataset and Approaches
Yuzhe Ding
|
Kang He
|
Bobo Li
|
Li Zheng
|
Haijun He
|
Fei Li
|
Chong Teng
|
Donghong Ji
Stance detection, which aims to identify public opinion towards specific targets using social media data, is an important yet challenging task. With the increasing number of online debates among social media users, conversational stance detection has become a crucial research area. However, existing conversational stance detection datasets are restricted to a limited set of specific targets, which constrains the effectiveness of stance detection models when encountering a large number of unseen targets in real-world applications. To bridge this gap, we manually curate a large-scale, high-quality zero-shot conversational stance detection dataset, named ZS-CSD, comprising 280 targets across two distinct target types. Leveraging the ZS-CSD dataset, we propose SITPCL, a speaker interaction and target-aware prototypical contrastive learning model, and establish the benchmark performance in the zero-shot setting. Experimental results demonstrate that our proposed SITPCL model achieves state-of-the-art performance in zero-shot conversational stance detection. Notably, the SITPCL model attains only an F1-macro score of 43.81%, highlighting the persistent challenges in zero-shot conversational stance detection.
pdf
bib
abs
LongFaith: Enhancing Long-Context Reasoning in LLMs with Faithful Synthetic Data
Cehao Yang
|
Xueyuan Lin
|
Chengjin Xu
|
Xuhui Jiang
|
Shengjie Ma
|
Aofan Liu
|
Hui Xiong
|
Jian Guo
Despite the growing development of long-context large language models (LLMs), data-centric approaches relying on synthetic data have been hindered by issues related to faithfulness, which limit their effectiveness in enhancing model performance on tasks such as long-context reasoning and question answering (QA). These challenges are often exacerbated by misinformation caused by lack of verification, reasoning without attribution, and potential knowledge conflicts. We propose LongFaith, a novel pipeline for synthesizing faithful long-context reasoning instruction datasets. By integrating ground truth and citation-based reasoning prompts, we eliminate distractions and improve the accuracy of reasoning chains, thus mitigating the need for costly verification processes. We open-source two synthesized datasets—LongFaith-SFT and LongFaith-PO—which systematically address multiple dimensions of faithfulness, including verified reasoning, attribution, and contextual grounding. Extensive experiments on multi-hop reasoning datasets and LongBench demonstrate that models fine-tuned on these datasets significantly improve performance. Our ablation studies highlight the scalability and adaptability of the LongFaith pipeline, showcasing its broad applicability in developing long-context LLMs.
pdf
bib
abs
SYNTHVERIFY: Enhancing Zero-Shot Claim Verification through Step-by-Step Synthetic Data Generation
Rongwen Zhao
|
Jeffrey Flanigan
Claim verification is a fundamental task in natural language processing (NLP), involving the assessment of whether available evidence supports or refutes a given claim. While large language models (LLMs) have shown promise in this area, they continue to struggle with domain-specific knowledge. Synthetic data generation has emerged as an effective solution to this challenge. However, existing methods are often either inefficient to scale across multiple domains or overly reliant on external documents. We introduce SYNTHVERIFY, a novel step-by-step prompting-based synthetic data generation framework designed to enhance zero-shot claim verification. Our core insight is that guiding generation with domain-specific claim patterns and structured evidence plans can bridge LLMs’ knowledge gaps in specialized domains without requiring access to external corpora or sacrificing generalizability. Using SYNTHVERIFY, we construct a diverse synthetic dataset for zero-shot verification, enabling instruction fine-tuning tailored to the verification task. Empirical results across multiple specialized domains demonstrate significant accuracy improvements, including a 20.1-point gain on the Llama-3-8B model. Our results highlight the effectiveness of structured synthetic data generation in addressing the limitations of verification systems, particularly in domain-specific tasks.
pdf
bib
abs
Domaino1s: Guiding LLM Reasoning for Explainable Answers in High-Stakes Domains
Xu Chu
|
Zhijie Tan
|
Hanlin Xue
|
Guanyu Wang
|
Tong Mo
|
Weiping Li
Large Language Models (LLMs) are widely applied to downstream domains. However, current LLMs for high-stakes domain tasks, such as financial investment and legal QA, typically generate brief answers without reasoning processes and explanations. This limits users’ confidence in making decisions based on their responses. While original CoT shows promise, it lacks self-correction mechanisms during reasoning. This work introduces Domain
o1s, which enhances LLMs’ reasoning capabilities on domain tasks through supervised fine-tuning and tree search. We construct CoT-stock-2k and CoT-legal-2k datasets for fine-tuning models that activate domain-specific reasoning steps based on their judgment. Additionally, we propose Selective Tree Exploration to spontaneously explore solution spaces and sample optimal reasoning paths to improve performance. We also introduce PROOF-Score, a new metric for evaluating domain models’ explainability, complementing traditional accuracy metrics with richer assessment dimensions. Extensive experiments on stock investment recommendation and legal reasoning QA tasks demonstrate Domain
o1s’s leading performance and explainability. Our code is available at
https://anonymous.4open.science/r/Domaino1s-006F/.
pdf
bib
abs
Dynamic Prefix as Instructor for Incremental Named Entity Recognition: A Unified Seq2Seq Generation Framework
Zihao Wu
|
YongXiang Hua
|
Yongxin Zhu
|
Fang Zhang
|
Linli Xu
The Incremental Named Entity Recognition (INER) task aims to update a model to extract entities from an expanding set of entity type candidates due to concerns related to data privacy and scarcity. However, conventional sequence labeling approaches to INER often suffer from the catastrophic forgetting problem, which leads to the degradation of the model’s performance on previously encountered entity types. In this paper, we formalize INER as a unified seq2seq generation task and propose a parameter-efficient dynamic prefix method. By employing the dynamic prefix as a task instructor to guide the generative model, our approach can preserve task-invariant knowledge while adapting to new entities with minimal parameter updates, making it particularly effective in low-resource scenarios. Additionally, we introduce a generative label augmentation strategy with dual optimization objectives including a self-entropy loss and a task-aware similarity loss to enable optimal balance between stability and plasticity. Empirical experiments on NER benchmarks demonstrate the effectiveness of our proposed method in addressing the challenges associated with INER.
pdf
bib
abs
Who Taught You That? Tracing Teachers in Model Distillation
Somin Wadhwa
|
Chantal Shaib
|
Silvio Amir
|
Byron C Wallace
Model distillation – using outputs from a large teacher model to teach a small student model – is a practical means of creating efficient models for a particular task. We ask: Can we identify a students’ teacher based on its outputs? Such “footprints” left by teacher LLMs would be interesting artifacts. Beyond this, reliable teacher inference may have practical implications as actors seek to distill specific capabilities of massive proprietary LLMs into deployed smaller LMs, potentially violating terms of service. We consider practical task distillation targets including summarization, question answering, and instruction-following. We assume a finite set of candidate teacher models, which we treat as blackboxes. We design discriminative models that operate over lexical features. We find that n-gram similarity alone is unreliable for identifying teachers, but part-of-speech (PoS) templates preferred by student models mimic those of their teachers.
pdf
bib
abs
D-GEN: Automatic Distractor Generation and Evaluation for Reliable Assessment of Generative Models
Grace Byun
|
Jinho D. Choi
Evaluating generative models with open-ended generation is challenging due to inconsistencies in response formats. Multiple-choice (MC) evaluation mitigates this issue, but generating high-quality distractors is time-consuming and labor-intensive. We introduce D-GEN, the first open-source distractor generator model that transforms open-ended data into an MC format. To evaluate distractor quality, we propose two novel methods: 1) ranking alignment, ensuring generated distractors retain the discriminatory power of ground-truth distractors, and 2) entropy analysis, comparing model confidence distributions. Our results show that D-GEN preserves ranking consistency (Spearman’s 𝜌 0.99, Kendall’s 𝜏 0.94) and closely matches the entropy distribution of ground-truth distractors. Human evaluation further confirms the fluency, coherence, distractiveness, and incorrectness. Our work advances robust and efficient distractor generation with automated evaluation, setting a new standard for MC evaluation.
pdf
bib
abs
HammerBench: Fine-Grained Function-Calling Evaluation in Real Mobile Assistant Scenarios
Jun Wang
|
Jiamu Zhou
|
Xihuai Wang
|
Xiaoyun Mo
|
Haoyu Zhang
|
Qiqiang Lin
|
Jincheng Jincheng
|
Muning Wen
|
Weinan Zhang
|
Qiuying Peng
|
Jun Wang
Evaluating the performance of LLMs in multi-turn human-agent interactions presents significant challenges, particularly due to the complexity and variability of user behavior. In this paper, we introduce HammerBench, a novel benchmark framework for assessing LLMs’ function-calling capabilities in real-world, multi-turn dialogues. HammerBench simulates diverse mobile assistant use cases, incorporating imperfect instructions, dynamic question-answer trajectories, intent and argument shifts, and the indirect use of external information through pronouns. To construct this benchmark, we curate a comprehensive dataset derived from popular mobile app functionalities and anonymized user logs, complemented by a cost-effective data generation pipeline leveraging open-source models. HammerBench is further augmented with fine-grained interaction snapshots and metrics, enabling detailed evaluation of function-calling performance across individual conversational turns. We demonstrate the effectiveness of HammerBench by evaluating several leading LLMs and uncovering key performance trends. Our experiments reveal that different types of parameter name errors are a significant source of failure across different interaction scenarios, highlighting critical areas for further improvement in LLM robustness for mobile assistant applications.
pdf
bib
abs
Beyond In-Context Learning: Aligning Long-form Generation of Large Language Models via Task-Inherent Attribute Guidelines
Do Xuan Long
|
Duong Ngoc Yen
|
Do Xuan Trong
|
Anh Tuan Luu
|
Kenji Kawaguchi
|
Shafiq Joty
|
Min-Yen Kan
|
Nancy F. Chen
In-context learning (ICL) is an important yet not fully understood ability of pre-trained large language models (LLMs). It can greatly enhance task performance using a few examples, termed demonstrations, without fine-tuning. Although effective in question answering, ICL often underperforms in long-form generation tasks such as summarization. Under appropriately realistic assumptions, we empirically and theoretically show that ICL demonstrations alone are insufficient to teach LLMs the task’s language and format distributions for generation. We argue for explicit exposure to the task distributions and hypothesize that defining them by prompting enhances model performance. To this end, we present LongGuide, which efficiently generates two parallel streams of guidelines capturing task language and format properties: (i) Metric Guidelines (MGs) that instruct models to optimize self-evaluated metrics; and (ii) Output Constraint Guidelines (OCGs) that constrain generation at both token and sentence levels. LongGuide automatically selects the best combination of guidelines, improving both strong open- and closed-source LLMs by over 5% in both zero- and few-shot settings. We show that LongGuide is generalizable, learnable by weak models to enhance strong ones, and integrates synergistically with automatic prompt optimizers.
pdf
bib
abs
GRAMMAR-LLM: Grammar-Constrained Natural Language Generation
Gabriele Tuccio
|
Luana Bulla
|
Maria Madonia
|
Aldo Gangemi
|
Misael Mongiovì
Large Language Models have achieved impressive performance across various natural language generation tasks. However, their lack of a reliable control mechanism limits their effectiveness in applications that require strict adherence to predefined taxonomies, syntactic structures, or domain-specific rules. Existing approaches, such as fine-tuning and prompting, remain insufficient to ensure compliance with these requirements, particularly in low-resource scenarios and structured text generation tasks.To address these limitations, we introduce GRAMMAR-LLM, a novel framework that integrates formal grammatical constraints into the LLM decoding process. GRAMMAR-LLM enforces syntactic correctness in linear time while maintaining expressiveness in grammar rule definition. To achieve this, we define a class of grammars, called LL(prefix), – which we show to be equivalent to LL(1) – specifically designed for their use with LLMs. These grammars are expressive enough to support common tasks such as hierarchical classification, vocabulary restriction, and structured parsing. We formally prove that LL(prefix) grammars can be transformed into LL(1) grammars in linear time, ensuring efficient processing via deterministic pushdown automata. We evaluate GRAMMAR-LLM across diverse NLP tasks, including hierarchical classification, sign language translation, and semantic parsing. Our experiments, conducted on models such as LLaMA 3 (for classification and translation) and AMRBART (for parsing), demonstrate that GRAMMAR-LLM consistently improves task performance across zero-shot, few-shot, and fine-tuned settings.
pdf
bib
abs
MANBench: Is Your Multimodal Model Smarter than Human?
Han Zhou
|
Qitong Xu
|
Yiheng Dong
|
Xin Yang
The rapid advancement of Multimodal Large Language Models (MLLMs) has ignited discussions regarding their potential to surpass human performance in multimodal tasks. In response, we introduce MANBench (Multimodal Ability Norms Benchmark), a bilingual benchmark (English and Chinese) comprising 1,314 questions across nine tasks, spanning knowledge-based and non-knowledge-based domains. MANBench emphasizes intuitive reasoning, seamless cross-modal integration, and real-world complexity, providing a rigorous evaluation framework.Through extensive human experiments involving diverse participants, we compared human performance against state-of-the-art MLLMs. The results indicate that while MLLMs excel in tasks like Knowledge and Text-Image Understanding, they struggle with deeper cross-modal reasoning tasks such as Transmorphic Understanding, Image Consistency, and Multi-image Understanding. Moreover, both humans and MLLMs face challenges in highly complex tasks like Puzzles and Spatial Imagination.MANBench highlights the strengths and limitations of MLLMs, revealing that even advanced models fall short of achieving human-level performance across many domains. We hope MANBench will inspire efforts to bridge the gap between MLLMs and human multimodal capabilities. The code and dataset are available at https://github.com/micdz/MANBench/.
pdf
bib
abs
BanStereoSet: A Dataset to Measure Stereotypical Social Biases in LLMs for Bangla
Mahammed Kamruzzaman
|
Abdullah Al Monsur
|
Shrabon Kumar Das
|
Enamul Hassan
|
Gene Louis Kim
This study presents ***BanStereoSet***, a dataset designed to evaluate stereotypical social biases in multilingual LLMs for the Bangla language. In an effort to extend the focus of bias research beyond English-centric datasets, we have localized the content from the StereoSet, IndiBias, and kamruzzaman-etal’s datasets, producing a resource tailored to capture biases prevalent within the Bangla-speaking community. Our BanStereoSet dataset consists of 1,194 sentences spanning 9 categories of bias: race, profession, gender, ageism, beauty, beauty in profession, region, caste, and religion. This dataset not only serves as a crucial tool for measuring bias in multilingual LLMs but also facilitates the exploration of stereotypical bias across different social categories, potentially guiding the development of more equitable language technologies in *Bangladeshi* contexts. Our analysis of several language models using this dataset indicates significant biases, reinforcing the necessity for culturally and linguistically adapted datasets to develop more equitable language technologies.
pdf
bib
abs
mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
Matthieu Futeral
|
Armel Randy Zebaze
|
Pedro Ortiz Suarez
|
Julien Abadji
|
Rémi Lacroix
|
Cordelia Schmid
|
Rachel Bawden
|
Benoît Sagot
Multimodal Large Language Models (mLLMs) are trained on a large amount of text-image data. While most mLLMs are trained on caption-like data only, Alayrac et al. (2022) showed that additionally training them on interleaved sequences of text and images can lead to the emergence of in-context learning capabilities. However, the dataset they used, M3W, is not public and is only in English. There have been attempts to reproduce their results but the released datasets are English-only. In contrast, current multilingual and multimodal datasets are either composed of caption-like only or medium-scale or fully private data. This limits mLLM research for the 7,000 other languages spoken in the world. We therefore introduce mOSCAR, to the best of our knowledge the first large-scale multilingual and multimodal document corpus crawled from the web. It covers 163 languages, 303M documents, 200B tokens and 1.15B images. We carefully conduct a set of filtering and evaluation steps to make sure mOSCAR is sufficiently safe, diverse and of good quality. We additionally train two types of multilingual model to prove the benefits of mOSCAR: (1) a model trained on a subset of mOSCAR and captioning data and (2) a model trained on captioning data only. The model additionally trained on mOSCAR shows a strong boost in few-shot learning performance across various multilingual image-text tasks and benchmarks, confirming previous findings for English-only mLLMs. The dataset will be made publicly accessible under the Creative Commons CC BY 4.0 license.
pdf
bib
abs
NorEval: A Norwegian Language Understanding and Generation Evaluation Benchmark
Vladislav Mikhailov
|
Tita Enstad
|
David Samuel
|
Hans Christian Farsethås
|
Andrey Kutuzov
|
Erik Velldal
|
Lilja Øvrelid
This paper introduces NorEval, a new and comprehensive evaluation suite for large-scale standardized benchmarking of Norwegian generative language models (LMs). NorEval consists of 24 high-quality human-created datasets – of which five are created from scratch. In contrast to existing benchmarks for Norwegian, NorEval covers a broad spectrum of task categories targeting Norwegian language understanding and generation, establishes human baselines, and focuses on both of the official written standards of the Norwegian language: Bokmål and Nynorsk. All our datasets and a collection of over 100 human-created prompts are integrated into LM Evaluation Harness, ensuring flexible and reproducible evaluation. We describe the NorEval design and present the results of benchmarking 19 open-source pretrained and instruction-tuned LMs for Norwegian in various scenarios. Our benchmark, evaluation framework, and annotation materials are publicly available.
pdf
bib
abs
Massively Multilingual Instruction-Following Information Extraction
Thang Le
|
Huy Huu Nguyen
|
Anh Tuan Luu
|
Thien Huu Nguyen
The literature on information extraction (IE) has mostly centered around a selected few languages, hindering their applications on multilingual corpora. In this work, we introduce MASSIE - a comprehensive collection for instruction-following multilingual IE that standardizes and unifies 215 manually annotated datasets, covering 96 typologically diverse languages from 18 language families. Based on MASSIE, we conduct empirical studies on few-shot in-context learning and report important factors that either positively or negatively affect LLMs’ performance in multilingual IE, covering 21 LLMs sizing from 0.5B to 72B. Additionally, we introduce LF1 - a structure-aware metric that captures partially matched spans, resolving the conservativeness of standard exact matching scheme which overpenalizes LLMs’ predictions. Overall, our results signify that multilingual IE remains very challenging for existing LLMs, especially on complex tasks involving relations and events. In addition, performance gap is extremely large among high- and low-performing languages, but the group of similar-performing languages largely overlap between different LLMs, suggesting a shared performance bias in current LLMs.
pdf
bib
abs
DALR: Dual-level Alignment Learning for Multimodal Sentence Representation Learning
Kang He
|
Yuzhe Ding
|
Haining Wang
|
Fei Li
|
Chong Teng
|
Donghong Ji
Previous multimodal sentence representation learning methods have achieved impressive performance. However, most approaches focus on aligning images and text at a coarse level, facing two critical challenges: cross-modal misalignment bias and intra-modal semantic divergence, which significantly degrade sentence representation quality. To address these challenges, we propose DALR (Dual-level Alignment Learning for Multimodal Sentence Representation). For cross-modal alignment, we propose a consistency learning module that softens negative samples and utilizes semantic similarity from an auxiliary task to achieve fine-grained cross-modal alignment. Additionally, we contend that sentence relationships go beyond binary positive-negative labels, exhibiting a more intricate ranking structure. To better capture these relationships and enhance representation quality, we integrate ranking distillation with global intra-modal alignment learning. Comprehensive experiments on semantic textual similarity (STS) and transfer (TR) tasks validate the effectiveness of our approach, consistently demonstrating its superiority over state-of-the-art baselines.
pdf
bib
abs
Large Language Models in Bioinformatics: A Survey
Zhenyu Wang
|
Zikang Wang
|
Jiyue Jiang
|
Pengan Chen
|
Xiangyu Shi
|
Yu Li
Large Language Models (LLMs) are revolutionizing bioinformatics, enabling advanced analysis of DNA, RNA, proteins, and single-cell data. This survey provides a systematic review of recent advancements, focusing on genomic sequence modeling, RNA structure prediction, protein function inference, and single-cell transcriptomics. Meanwhile, we also discuss several key challenges, including data scarcity, computational complexity, and cross-omics integration, and explore future directions such as multimodal learning, hybrid AI models, and clinical applications. By offering a comprehensive perspective, this paper underscores the transformative potential of LLMs in driving innovations in bioinformatics and precision medicine.
pdf
bib
abs
ChartEdit: How Far Are MLLMs From Automating Chart Analysis? Evaluating MLLMs’ Capability via Chart Editing
Xuanle Zhao
|
Xuexin Liu
|
Yang Haoyue
|
Xianzhen Luo
|
Fanhu Zeng
|
Jianling Li
|
Qi Shi
|
Chi Chen
Although multimodal large language models (MLLMs) show promise in generating chart rendering code, editing charts via code presents a greater challenge. This task demands MLLMs to integrate chart understanding and reasoning capacities, which are labor-intensive. While many MLLMs claim such editing capabilities, current evaluations rely on limited case studies, highlighting the urgent need for a comprehensive evaluation framework.In this work, we propose ChartEdit, a new high-quality benchmark designed for chart editing tasks. This benchmark comprises 1,405 diverse editing instructions applied to 233 real-world charts, with each instruction-chart instance having been manually annotated and validated for accuracy. Utilizing ChartEdit, we evaluate the performance of 10 mainstream MLLMs across two types of experiments at both the code and chart levels.The results suggest that large-scale models can generate code to produce images that partially match the reference images.However, their ability to generate accurate edits according to the instructions remains limited. The state-of-the-art (SOTA) model achieves a score of only 59.96, highlighting significant challenges in precise modification. In contrast, small-scale models, including chart-domain models, struggle both with following editing instructions and generating overall chart images, underscoring the need for further development in this area. Code is available at
https://github.com/xxlllz/ChartEdit.
pdf
bib
abs
Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models
Qin Liu
|
Chao Shang
|
Ling Liu
|
Nikolaos Pappas
|
Jie Ma
|
Neha Anna John
|
Srikanth Doss
|
Lluis Marquez
|
Miguel Ballesteros
|
Yassine Benajiba
The safety alignment ability of Vision-Language Models (VLMs) is prone to be degraded by the integration of the vision module compared to its LLM backbone. We investigate this phenomenon, dubbed as “safety alignment degradation” in this paper, and show that the challenge arises from the representation gap that emerges when introducing vision modality to VLMs. In particular, we show that the representations of multi-modal inputs shift away from that of text-only inputs which represent the distribution that the LLM backbone is optimized for. At the same time, the safety alignment capabilities, initially developed within the textual embedding space, do not successfully transfer to this new multi-modal representation space. To reduce safety alignment degradation, we introduce Cross-Modality Representation Manipulation (CMRM), an inference time representation intervention method for recovering the safety alignment ability that is inherent in the LLM backbone of VLMs, while simultaneously preserving the functional capabilities of VLMs. The empirical results show that our framework significantly recovers the alignment ability that is inherited from the LLM backbone with minimal impact on the fluency and linguistic capabilities of pre-trained VLMs even without additional training. Specifically, the unsafe rate of LLaVA-7B on multi-modal input can be reduced from 61.53% to as low as 3.15% with only inference-time intervention.
pdf
bib
abs
Turbocharging Web Automation: The Impact of Compressed History States
Xiyue Zhu
|
Peng Tang
|
Haofu Liao
|
Srikar Appalaraju
Language models have led to leap forward in web automation. The current web automation approaches take the current web state, history actions, and language instruction as inputs to predict the next action, overlooking the importance of history states. However, the highly verbose nature of web page states can result in long input sequence and sparse information, hampering the effective utilization of history states. In this paper, we propose a novel web history compressor approach to turbocharge web automation using history states. Our approach employs a history compressor module that distills the most task-relevant information from each history state into a fixed-length short representation, mitigating the challenges posed by the highly verbose history states. Experiments are conducted on the Mind2Web and WebLINX datasets to evaluate the effectiveness of our approach. Results show that our approach obtains 1.2-5.4% absolute accuracy improvements compared to the baseline approach without history inputs.
pdf
bib
abs
Making RALM Robust to Irrelevant Contexts via Layer Knowledge Guided Attention
Weijie Shi
|
Hao Chen
|
Jiaming Li
|
Yao Zhao
|
Yazhong Zhang
|
Qijin Chen
|
Jipeng Zhang
|
Ruiyuan Zhang
|
Jia Zhu
|
Jiajie Xu
|
Xiaofang Zhou
Retrieval-augmented language models (RALMs) aim to incorporate external knowledge to address the issues of factual hallucination and knowledge obsolescence faced by large language models (LLMs). Inevitably, the retrieved passages based on similarity search may be irrelevant to the given question, and the aggregation of these passages can confuse the model to give a correct answer. To improve the performance of RALM in such conditions, we propose layer-knowledge guided attention for RALMs, which harnesses the layer-wise knowledge of LLMs to optimize per-layer attention on useful passages, making the model pay attention to the most relevant content and ignore irrelevant ones. Specifically, we first systematically study LLM’s attention patterns and their relationship with the accuracy of RALM responses, where middle-focus attentions play a crucial role in selectively gathering relevant information. Based on this, a layer-wise passage estimator leverages the varied knowledge encoded across LLM layers to assess not only passage relevance scores but also associated confidences. Finally, a relevance-aware passage fusion enables selective attention to relevant passages, mitigating distractibility and positional bias of causal attention. Experiments show that our method outperforms existing methods on RALM benchmarks.
pdf
bib
abs
Rewrite to Jailbreak: Discover Learnable and Transferable Implicit Harmfulness Instruction
Yuting Huang
|
Chengyuan Liu
|
Yifeng Feng
|
Yiquan Wu
|
Chao Wu
|
Fei Wu
|
Kun Kuang
As Large Language Models (LLMs) are widely applied in various domains, the safety of LLMs is increasingly attracting attention to avoid their powerful capabilities being misused. Existing jailbreak methods create a forced instruction-following scenario, or search adversarial prompts with prefix or suffix tokens to achieve a specific representation manually or automatically. However, they suffer from low efficiency and explicit jailbreak patterns, far from the real deployment of mass attacks to LLMs. In this paper, we point out that simply rewriting the original instruction can achieve a jailbreak, and we find that this rewriting approach is learnable and transferable. We propose the **R**ewrite to **J**ailbreak (R2J) approach, a transferable black-box jailbreak method to attack LLMs by iteratively exploring the weakness of the LLMs and automatically improving the attacking strategy. The jailbreak is more efficient and hard to identify since no additional features are introduced. Extensive experiments and analysis demonstrate the effectiveness of R2J, and we find that the jailbreak is also transferable to multiple datasets and various types of models with only a few queries. We hope our work motivates further investigation of LLM safety. The code can be found at [https://github.com/ythuang02/R2J/.](https://github.com/ythuang02/R2J/)
pdf
bib
abs
SignAlignLM: Integrating Multimodal Sign Language Processing into Large Language Models
Mert Inan
|
Anthony Sicilia
|
Malihe Alikhani
Deaf and Hard-of-Hearing (DHH) users increasingly utilize Large Language Models (LLMs), yet face significant challenges due to these models’ limited understanding of sign language grammar, multimodal sign inputs, and Deaf cultural contexts. Further, current approaches that try to address these limitations, frequently reduce sign language processing (SLP) to traditional translation tasks, neglecting the multimodal and linguistic complexity inherent in signed languages. In this paper, we present an empirical investigation informed by learning theory into natively integrating sign language support within LLMs, directly addressing the documented needs of DHH users. We introduce the first text-based and multimodal LLMs capable of sign language processing called SignAlignLM, and propose new prompting and fine-tuning strategies incorporating sign linguistic rules and conventions. We show that LLMs can be generalized interfaces for both spoken and signed languages if trained with a multitasking paradigm. Our code and model checkpoints are open-source.
pdf
bib
abs
NegVQA: Can Vision Language Models Understand Negation?
Yuhui Zhang
|
Yuchang Su
|
Yiming Liu
|
Serena Yeung-Levy
Negation is a fundamental linguistic phenomenon that can entirely reverse the meaning of a sentence. As vision language models (VLMs) continue to advance and are deployed in high-stakes applications, assessing their ability to comprehend negation becomes essential. To address this, we introduce NegVQA, a visual question answering (VQA) benchmark consisting of 7,379 two-choice questions covering diverse negation scenarios and image-question distributions. We construct NegVQA by leveraging large language models to generate negated versions of questions from existing VQA datasets. Evaluating 20 state-of-the-art VLMs across seven model families, we find that these models struggle significantly with negation, exhibiting a substantial performance drop compared to their responses to the original questions. Furthermore, we uncover a U-shaped scaling trend, where increasing model size initially degrades performance on NegVQA before leading to improvements. Our benchmark reveals critical gaps in VLMs’ negation understanding and offers insights into future VLM development. Project page available at https://yuhui-zh15.github.io/NegVQA/.
pdf
bib
abs
Natural Language Reasoning in Large Language Models: Analysis and Evaluation
Debela Gemechu
|
Ramon Ruiz-Dolz
|
Henrike Beyer
|
Chris Reed
While Large Language Models (LLMs) have demonstrated promising results on a range of reasoning benchmarks—particularly in formal logic, mathematical tasks, and Chain-of-Thought prompting—less is known about their capabilities in unconstrained natural language reasoning. Argumentative reasoning, a form of reasoning naturally expressed in language and central to everyday discourse, presents unique challenges for LLMs due to its reliance on context, implicit assumptions, and value judgments. This paper addresses a gap in the study of reasoning in LLMs by presenting the first large-scale evaluation of their unconstrained natural language reasoning capabilities based on natural language argumentation. The paper offers three contributions: (i) the formalisation of a new strategy designed to evaluate argumentative reasoning in LLMs: argument-component selection; (ii) the creation of the Argument Reasoning Tasks (ART) dataset, a new benchmark for argument-component selection based on argument structures for natural language reasoning; and (iii) an extensive experimental analysis involving four different models, demonstrating the limitations of LLMs on natural language reasoning tasks.
pdf
bib
abs
SWE-Dev: Building Software Engineering Agents with Training and Inference Scaling
Haoran Wang
|
Zhenyu Hou
|
Yao Wei
|
Jie Tang
|
Yuxiao Dong
Large language models (LLMs) have advanced rapidly from conversational problem solving to addressing real-world tasks involving tool use, such as software engineering (SWE). Recent LLM-powered toolkits, such as OpenAI Codex and Cursor, have offered end-to-end automation of the software development process. However, building effective SWE agents remains challenging due to the lack of high-quality training data and effective test cases. To address this issue, we present SWE-Dev, an SWE agent built upon open-source LLMs. First, we develop a robust pipeline to synthesize test cases for patch evaluation. Second, we scale up agent trajectories to construct the training data for building SWE-Dev. Experiments on the SWE-bench-Verified benchmark show that the SWE-Dev models can achieve top performance among all open SWE agents. Specifically, the success rates of the SWE-Dev 7B and 32B parameter models reach 23.4% and 36.6%, respectively, outperforming state-of-the-art open-source models. All code, models, and datasets are publicly available at https://github.com/THUDM/SWE-Dev.
pdf
bib
abs
The Two Paradigms of LLM Detection: Authorship Attribution vs Authorship Verification
Janek Bevendorff
|
Matti Wiegmann
|
Emmelie Richter
|
Martin Potthast
|
Benno Stein
The detection of texts generated by LLMs has quickly become an important research problem. Many supervised and zero-shot detectors have already been proposed, yet their effectiveness and precision remain disputed. Current research therefore focuses on making detectors robust against domain shifts and on building corresponding benchmarks. In this paper, we show that the actual limitations hindering progress in LLM detection lie elsewhere: LLM detection is often implicitly modeled as an authorship attribution task, while its true nature is that of authorship verification. We systematically analyze the current research with respect to this misunderstanding, conduct an in-depth comparative analysis of the benchmarks, and validate our claim using state-of-the-art LLM detectors.Our contributions open the realm of authorship analysis technology for understanding and tackling the problem of LLM detection.
pdf
bib
abs
Unveiling Confirmation Bias in Chain-of-Thought Reasoning
Yue Wan
|
Xiaowei Jia
|
Xiang Lorraine Li
Chain-of-thought (CoT) prompting has been widely adopted to enhance the reasoning capabilities of large language models (LLMs). However, the effectiveness of CoT reasoning is inconsistent across tasks with different reasoning types. This work presents a novel perspective to understand CoT behavior through the lens of confirmation bias in cognitive psychology. Specifically, we examine how model internal beliefs, approximated by direct question-answering probabilities, affect both reasoning generation (Q → R) and reasoning-guided answer prediction (QR → A) in CoT. By decomposing CoT into a two-stage process, we conduct a thorough correlation analysis in model beliefs, rationale attributes, and stage-wise performance. Our results provide strong evidence of confirmation bias in LLMs, such that model beliefs not only skew the reasoning process but also influence how rationales are utilized for answer prediction. Furthermore, the interplay between task vulnerability to confirmation bias and the strength of beliefs also provides explanations for CoT effectiveness across reasoning tasks and models. Overall, this study provides a valuable insight for the needs of better prompting strategies that mitigate confirmation bias to enhance reasoning performance. Code is available at https://github.com/yuewan2/biasedcot.
pdf
bib
abs
GRNFormer: A Biologically-Guided Framework for Integrating Gene Regulatory Networks into RNA Foundation Models
Mufan Qiu
|
Xinyu Hu
|
Fengwei Zhan
|
Sukwon Yun
|
Jie Peng
|
Ruichen Zhang
|
Bhavya Kailkhura
|
Jiekun Yang
|
Tianlong Chen
Foundation models for single-cell RNA sequencing (scRNA-seq) have shown promising capabilities in capturing gene expression patterns. However, current approaches face critical limitations: they ignore biological prior knowledge encoded in gene regulatory relationships and fail to leverage multi-omics signals that could provide complementary regulatory insights. In this paper, we propose GRNFormer, a new framework that systematically integrates multi-scale Gene Regulatory Networks (GRNs) inferred from multi-omics data into RNA foundation model training. Our framework introduces two key innovations. First, we introduce a pipeline for constructing hierarchical GRNs that capture regulatory relationships at both cell-type-specific and cell-specific resolutions. Second, we design a structure-aware integration framework that addresses the information asymmetry in GRNs through two technical advances: (1) A graph topological adapter using multi-head cross-attention to weight regulatory relationships dynamically, and (2) a novel edge perturbation strategy that perturb GRNs with biologically-informed co-expression links to augment graph neural network training. Comprehensive experiments have been conducted on three representative downstream tasks across multiple model architectures to demonstrate the effectiveness of GRNFormer. It achieves consistent improvements over state-of-the-art (SoTA) baselines: 3.6\\% increase in drug response prediction correlation, 9.6\\% improvement in single-cell drug classification AUC, and 1.1\\% average gain in gene perturbation prediction accuracy.
pdf
bib
abs
RemoteRAG: A Privacy-Preserving LLM Cloud RAG Service
Yihang Cheng
|
Lan Zhang
|
Junyang Wang
|
Mu Yuan
|
Yunhao Yao
Retrieval-augmented generation (RAG) improves the service quality of large language models by retrieving relevant documents from credible literature and integrating them into the context of the user query.Recently, the rise of the cloud RAG service has made it possible for users to query relevant documents conveniently.However, directly sending queries to the cloud brings potential privacy leakage.In this paper, we are the first to formally define the privacy-preserving cloud RAG service to protect the user query and propose RemoteRAG as a solution regarding privacy, efficiency, and accuracy.For privacy, we introduce (n,𝜖)-DistanceDP to characterize privacy leakage of the user query and the leakage inferred from relevant documents.For efficiency, we limit the search range from the total documents to a small number of selected documents related to a perturbed embedding generated from (n,𝜖)-DistanceDP, so that computation and communication costs required for privacy protection significantly decrease.For accuracy, we ensure that the small range includes target documents related to the user query with detailed theoretical analysis.Experimental results also demonstrate that RemoteRAG can resist existing embedding inversion attack methods while achieving no loss in retrieval under various settings.Moreover, RemoteRAG is efficient, incurring only 0.67 seconds and 46.66KB of data transmission (2.72 hours and 1.43 GB with the non-optimized privacy-preserving scheme) when retrieving from a total of 105 documents.
pdf
bib
abs
“My life is miserable, have to sign 500 autographs everyday”: Exposing Humblebragging, the Brags in Disguise
Sharath Naganna
|
Saprativa Bhattacharjee
|
Biplab Banerjee
|
Pushpak Bhattacharyya
Humblebragging is a phenomenon in which individuals present self-promotional statements under the guise of modesty or complaints. For example, a statement like, “Ugh, I can’t believe I got promoted to lead the entire team. So stressful!”, subtly highlights an achievement while pretending to be complaining. Detecting humblebragging is important for machines to better understand the nuances of human language, especially in tasks like sentiment analysis and intent recognition. However, this topic has not yet been studied in computational linguistics. For the first time, we introduce the task of automatically detecting humblebragging in text. We formalize the task by proposing a 4-tuple definition of humblebragging and evaluate machine learning, deep learning, and large language models (LLMs) on this task, comparing their performance with humans. We also create and release a dataset called HB-24, containing 3,340 humblebrags generated using GPT-4o. Our experiments show that detecting humblebragging is non-trivial, even for humans. Our best model achieves an F1-score of 0.88. This work lays the foundation for further exploration of this nuanced linguistic phenomenon and its integration into broader natural language understanding systems.
pdf
bib
abs
SCITAT: A Question Answering Benchmark for Scientific Tables and Text Covering Diverse Reasoning Types
Xuanliang Zhang
|
Dingzirui Wang
|
Baoxin Wang
|
Longxu Dou
|
Xinyuan Lu
|
Keyan Xu
|
Dayong Wu
|
Qingfu Zhu
Scientific question answering (SQA) is an important task aimed at answering questions based on papers. However, current SQA datasets have limited reasoning types and neglect the relevance between tables and text, creating a significant gap with real scenarios. To address these challenges, we propose a QA benchmark for scientific tables and text with diverse reasoning types (SCITAT). To cover more reasoning types, we summarize various reasoning types from real-world questions. To reason on both tables and text, we require the questions to incorporate tables and text as much as possible. Based on SCITAT, we propose a baseline (CAR), which combines various reasoning methods to address different reasoning types and process tables and text at the same time. CAR brings average improvements of 4.1% over other baselines on SCITAT, validating its effectiveness. Error analysis reveals the challenges of SCITAT, such as complex numerical calculations and domain knowledge.
pdf
bib
abs
TokenShapley: Token Level Context Attribution with Shapley Value
Yingtai Xiao
|
Yuqing Zhu
|
Sirat Samyoun
|
Wanrong Zhang
|
Jiachen T. Wang
|
Jian Du
Large language models (LLMs) demonstrate strong capabilities in in-context learning, but verifying the correctness of their generated responses remains a challenge. Prior work has explored attribution at the sentence level, but these methods fall short when users seek attribution for specific keywords within the response, such as numbers, years, or names. To address this limitation, we propose TokenShapley, a novel token-level attribution method that combines Shapley value-based data attribution with KNN-based retrieval techniques inspired by recent advances in KNN-augmented LLMs. By leveraging a precomputed datastore for contextual retrieval and computing Shapley values to quantify token importance, TokenShapley provides a fine-grained data attribution approach. Extensive evaluations on four benchmarks show that TokenShapley outperforms state-of-the-art baselines in token-level attribution, achieving a 11–23% improvement in accuracy.
pdf
bib
abs
Entropy-based Exploration Conduction for Multi-step Reasoning
Jinghan Zhang
|
Xiting Wang
|
Fengran Mo
|
Yeyang Zhou
|
Wanfu Gao
|
Kunpeng Liu
Multi-step processes via large language models (LLMs) have proven effective for solving complex reasoning tasks. However, the depth of exploration of the reasoning procedure can significantly affect the task performance. Existing methods to automatically decide the depth often lead to high cost and a lack of flexibility. To address these issues, we propose Entropy-based Exploration Depth Conduction (Entro-duction), a novel method that dynamically adjusts the exploration depth during multi-step reasoning by monitoring LLM’s output entropy and variance entropy. We employ these two features to capture the model’s uncertainty of the current step and the fluctuation of uncertainty across consecutive reasoning steps. Based on the observed entropy changes, the LLM selects whether to deepen, expand, or stop exploration according to the probability, which facilitates the trade-off between the reasoning accuracy and exploration effectiveness. Experimental results across four benchmark datasets demonstrate the efficacy of Entro-duction.
pdf
bib
abs
Taxonomizing Representational Harms using Speech Act Theory
Emily Corvi
|
Hannah Washington
|
Stefanie Reed
|
Chad Atalla
|
Alexandra Chouldechova
|
P. Alex Dow
|
Jean Garcia-Gathright
|
Nicholas J Pangakis
|
Emily Sheng
|
Dan Vann
|
Matthew Vogel
|
Hanna Wallach
Representational harms are widely recognized among fairness-related harms caused by generative language systems. However, their definitions are commonly under-specified. We make a theoretical contribution to the specification of representational harms by introducing a framework, grounded in speech act theory (Austin 1962), that conceptualizes representational harms caused by generative language systems as the perlocutionary effects (i.e., real-world impacts) of particular types of illocutionary acts (i.e., system behaviors). Building on this argument and drawing on relevant literature from linguistic anthropology and sociolinguistics, we provide new definitions of stereotyping, demeaning, and erasure. We then use our framework to develop a granular taxonomy of illocutionary acts that cause representational harms, going beyond the high-level taxonomies presented in previous work. We also discuss the ways that our framework and taxonomy can support the development of valid measurement instruments. Finally, we demonstrate the utility of our framework and taxonomy via a case study that engages with recent conceptual debates about what constitutes a representational harm and how such harms should be measured.
pdf
bib
abs
Turning Conversations into Workflows: A Framework to Extract and Evaluate Dialog Workflows for Service AI Agents
Prafulla Kumar Choubey
|
Xiangyu Peng
|
Shilpa Bhagavath
|
Caiming Xiong
|
Shiva Kumar Pentyala
|
Chien-Sheng Wu
Automated service agents require well-structured workflows to deliver consistent and accurate responses to customer queries. However, such workflows are often undocumented, and their automatic extraction from conversations remains largely unexplored. In this work, we present a novel framework for extracting and evaluating dialog workflows from historical interactions. Our extraction process involves two key stages: (1) a retrieval step to select relevant conversations based on key procedural elements, and (2) a structured workflow generation step using question-answer-based chain-of-thought (QA-CoT) prompting. To comprehensively evaluate the quality of the extracted workflows, we introduce an automated simulation framework with agent and customer bots that measures their effectiveness in resolving customer issues. Extensive experiments on the ABCD and SynthABCD datasets show that our QA-CoT technique improves workflow extraction by 12.16% in average macro accuracy over the baseline. Moreover, our evaluation method closely aligns with human assessments, offering a reliable and scalable framework for future research.
pdf
bib
abs
Statistical inference on black-box generative models in the data kernel perspective space
Hayden Helm
|
Aranyak Acharyya
|
Youngser Park
|
Brandon Duderstadt
|
Carey Priebe
Generative models are capable of producing human-expert level content across a variety of topics and domains. As the impact of generative models grows, it is necessary to develop statistical methods to understand collections of available models. These methods are particularly important in settings where the user may not have access to information related to a model’s pre-training data, weights, or other relevant model-level covariates. In this paper we extend recent results on representations of black-box generative models to model-level statistical inference tasks. We demonstrate that the model-level representations are effective for multiple inference tasks.
pdf
bib
abs
Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?
Sohee Yang
|
Nora Kassner
|
Elena Gribovskaya
|
Sebastian Riedel
|
Mor Geva
We evaluate how well Large Language Models (LLMs) latently recall and compose facts to answer multi-hop queries like “In the year Scarlett Johansson was born, the Summer Olympics were hosted in the country of”. One major challenge in such evaluation is that LLMs may have developed shortcuts by encountering the head entity “Scarlett Johansson” and the answer entity “United States” in the same training sequences or merely guess the answer based on frequency-based priors. To prevent shortcuts, we exclude test queries where the head and answer entities might have co-appeared during training. Through careful selection of relations and facts and systematic removal of cases where models might guess answers or exploit partial matches, we construct an evaluation dataset SOCRATES (ShOrtCut-fRee lATent rEaSoning). We observe that LLMs demonstrate promising latent multi-hop reasoning abilities without exploiting shortcuts, but only for certain types of queries. For queries requiring latent recall of countries as the intermediate answer, the best models achieve 80% latent composability, but this drops to just 5% for the recall of years. Comparisons with Chain-of-Thought highlight a significant gap between the ability of models to reason latently versus explicitly. Analysis reveals that latent representations of the intermediate answer are constructed more often in queries with higher latent composability, and shows the emergence of latent multi-hop reasoning during pretraining.
pdf
bib
abs
AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling
Zihan Liu
|
Yang Chen
|
Mohammad Shoeybi
|
Bryan Catanzaro
|
Wei Ping
In this paper, we introduce AceMath, a suite of frontier math models that excel in solving complex math problems, along with highly effective reward models capable of evaluating generated solutions and reliably identifying the correct ones. To develop the instruction-tuned math models, we propose a supervised fine-tuning (SFT) process that first achieves competitive performance across general domains, followed by targeted fine-tuning for the math domain using a carefully curated set of prompts and synthetically generated responses. The resulting model, AceMath-72B-Instruct greatly outperforms Qwen2.5-Math-72B-Instruct, GPT-4o and Claude-3.5 Sonnet. To develop math-specialized reward model, we first construct AceMath-RewardBench, a comprehensive and robust benchmark for evaluating math reward models across diverse problems and difficulty levels. After that, we present a systematic approach to build our math reward models. The resulting model, AceMath-72B-RM, consistently outperforms state-of-the-art reward models. Furthermore, when combining AceMath-72B-Instruct with AceMath-72B-RM, we achieve the highest average rm@8 score across the math reasoning benchmarks.
pdf
bib
abs
WXImpactBench: A Disruptive Weather Impact Understanding Benchmark for Evaluating Large Language Models
Yongan Yu
|
Qingchen Hu
|
Xianda Du
|
Jiayin Wang
|
Fengran Mo
|
Renée Sieber
Climate change adaptation requires the understanding of disruptive weather impacts on society, where large language models (LLMs) might be applicable. However, their effectiveness is under-explored due to the difficulty of high-quality corpus collection and the lack of available benchmarks. The climate-related events stored in regional newspapers record how communities adapted and recovered from disasters. However, the processing of the original corpus is non-trivial. In this study, we first develop a disruptive weather impact dataset with a four-stage well-crafted construction pipeline. Then, we propose WXImpactBench, the first benchmark for evaluating the capacity of LLMs on disruptive weather impacts. The benchmark involves two evaluation tasks, multi-label classification and ranking-based question answering. Extensive experiments on evaluating a set of LLMs provide first-hand analysis of the challenges in developing disruptive weather impact understanding and climate change adaptation systems. The constructed dataset and the code for the evaluation framework are available to help society protect against vulnerabilities from disasters.
pdf
bib
abs
MeMoTune: A Measure and Moment-Driven Fine-Tuning Framework for Quantized Large Language Models
Yun Zhang
|
Xue Geng
|
Lizi Liao
|
Jintong Sun
|
Minghe Yu
|
Ge Yu
Quantizing large language models (LLMs) is essential for reducing memory and computational costs in natural language processing. Existing methods combine quantization with parameter-efficient fine-tuning but often fail to meet practical performance requirements. This paper introduces MeMoTune, a novel fine-tuning framework for quantized LLMs. By employing a measure and moment approach within a low-rank approximation framework in probability measure space, MeMoTune optimizes the objective function for superior fine-tuning results. The update process is further refined through scaled gradient, enhancing convergence efficiency and noise robustness. Experiments on tasks like text generation, summarization, and understanding show MeMoTune significantly outperforms state-of-the-art methods, e.g. fine-tuning Llama2-13B on GSM8K improves accuracy by 5.5%, while fine-tuning DeBERTaV3-base on CoLA of GLUE increases Matthews correlation by 1.7%. The code is publicly available at: https://github.com/hddyyyb/MeMoTune.
pdf
bib
abs
MALAMUTE: A Multilingual, Highly-granular, Template-free, Education-based Probing Dataset
Sagi Shaier
|
George Arthur Baker
|
Chiranthan Sridhar
|
Lawrence Hunter
|
Katharina Von Der Wense
Language models (LMs) have excelled in various broad domains. However, to ensure their safe and effective integration into real-world educational settings, they must demonstrate proficiency in specific, granular areas of knowledge. Existing cloze-style benchmarks, commonly used to evaluate LMs’ knowledge, have three major limitations. They: 1) do not cover the educational domain; 2) typically focus on low-complexity, generic knowledge or broad domains, which do not adequately assess the models’ knowledge in specific subjects; and 3) often rely on templates that can bias model predictions. Here, we introduce MALAMUTE, a multilingual, template-free, and highly granular probing dataset comprising expert-written, peer-reviewed probes from 71 university-level textbooks across three languages (English, Spanish, and Polish). MALAMUTE is the first education-based cloze-style dataset. It covers eight domains, each with up to 14 subdomains, further broken down into concepts and concept-based prompts, totaling 33,361 university curriculum concepts and 116,887 prompts. MALAMUTE’s fine granularity, educational focus, and inclusion of both sentence-level and paragraph-level prompts make it an ideal tool for evaluating LMs’ course-related knowledge. Our evaluation of masked and causal LMs on MALAMUTE shows that despite overall proficiency, they have significant gaps in knowledge when examined closely on specific subjects, hindering their safe use in classrooms and underscoring the need for further development.
pdf
bib
abs
Sentimental Image Generation for Aspect-based Sentiment Analysis
Xiaoyi Bao
|
Jinghang Gu
|
Zhongqing Wang
|
Chu-Ren Huang
Recent research work on textual Aspect-Based Sentiment Analysis (ABSA) have achieved promising performance. However, a persistent challenge lies in the limited semantics derived from the raw data. To address this issue, researchers have explored enhancing textual ABSA with additional augmentations, they either craft audio, text and linguistic features based on the input, or rely on user-posted images. Yet these approaches have their limitations: the former three formations are heavily overlap with the original data, which undermines their ability to be supplementary while the user-posted images are extremely dependent on human annotation, which not only limits its application scope to just a handful of text-image datasets, but also propagates the errors derived from human mistakes to the entire downstream loop. In this study, we explore the way of generating the sentimental image that no one has ever ventured before. We propose a novel Sentimental Image Generation method that can precisely provide ancillary visual semantics to reinforce the textual extraction as shown in Figure 1. Extensive experiments build a new SOTA performance in ACOS, ASQP and en-Phone datasets, underscoring the effectiveness of our method and highlighting a promising direction for expanding our features.
pdf
bib
abs
Long-form Hallucination Detection with Self-elicitation
Zihang Liu
|
Jiawei Guo
|
Hao Zhang
|
Hongyang Chen
|
Jiajun Bu
|
Haishuai Wang
While Large Language Models (LLMs) have exhibited impressive performance in generating long-form content, they frequently present a hazard of producing factual inaccuracies or hallucinations. An effective strategy to mitigate this hazard is to leverage off-the-shelf LLMs to detect hallucinations after the generation. The primary challenge resides in the comprehensive elicitation of the intrinsic knowledge acquired during their pre-training phase. However, existing methods that employ multi-step reasoning chains predominantly fall short of addressing this issue. Moreover, since existing methods for hallucination detection tend to decompose text into isolated statements, they are unable to understand the contextual semantic relations in long-form content. In this paper, we study a novel concept, self-elicitation, to leverage self-generated thoughts derived from prior statements as catalysts to elicit the expression of intrinsic knowledge and understand contextual semantics. We present a framework, SelfElicit, to integrate self-elicitation with graph structures to effectively organize the elicited knowledge and facilitate factual evaluations. Extensive experiments on five datasets in various domains demonstrate the effectiveness of self-elicitation and the superiority of our proposed method.
pdf
bib
abs
ComparisonQA: Evaluating Factuality Robustness of LLMs Through Knowledge Frequency Control and Uncertainty
Qing Zong
|
Zhaowei Wang
|
Tianshi Zheng
|
Xiyu Ren
|
Yangqiu Song
The rapid development of LLMs has sparked extensive research into their factual knowledge. Current works find that LLMs fall short on questions around low-frequency entities. However, such proofs are unreliable since the questions can differ not only in entity frequency but also in difficulty themselves. So we introduce **ComparisonQA** benchmark, containing **283K** abstract questions, each instantiated by a pair of high-frequency and low-frequency entities. It ensures a controllable comparison to study the role of knowledge frequency in the performance of LLMs. Because the difference between such a pair is only the entity with different frequencies. In addition, we use both correctness and uncertainty to develop a two-round method to evaluate LLMs’ knowledge robustness. It aims to avoid possible semantic shortcuts which is a serious problem of current QA study. Experiments reveal that LLMs, including GPT-4o, exhibit particularly low robustness regarding low-frequency knowledge. Besides, we find that uncertainty can be used to effectively identify high-quality and shortcut-free questions while maintaining the data size. Based on this, we propose an automatic method to select such questions to form a subset called **ComparisonQA-Hard**, containing only hard low-frequency questions.
pdf
bib
abs
One-Dimensional Object Detection for Streaming Text Segmentation of Meeting Dialogue
Rui He
|
Zhongqing Wang
|
Minjie Qiang
|
Hongling Wang
|
Yifan.zhang Yifan.zhang
|
Hua Xu
|
Shuai Fan
|
Guodong Zhou
Dialogue text segmentation aims to partition dialogue content into consecutive paragraphs based on themes or logic, enhancing its comprehensibility and manageability. Current text segmentation models, when applied directly to STS (Streaming Text Segmentation), exhibit numerous limitations, such as imbalances in labels that affect the stability of model training, and discrepancies between the model’s training tasks (sentence classification) and the actual text segmentation that limit the model’s segmentation capabilities.To address these challenges, we first implement STS for the first time using a sliding window-based segmentation method. Secondly, we employ two different levels of sliding window-based balanced label strategies to stabilize the training process of the streaming segmentation model and enhance training convergence speed. Finally, by adding a one-dimensional bounding-box regression task for text sequences within the window, we restructure the training approach of STS tasks, shifting from sentence classification to sequence segmentation, thereby aligning the training objectives with the task objectives, which further enhanced the model’s performance. Extensive experimental results demonstrate that our method is robust, controllable, and achieves state-of-the-art performance.
pdf
bib
abs
CodeTaxo: Enhancing Taxonomy Expansion with Limited Examples via Code Language Prompts
Qingkai Zeng
|
Yuyang Bai
|
Zhaoxuan Tan
|
Zhenyu Wu
|
Shangbin Feng
|
Meng Jiang
Taxonomies provide structural representations of knowledge and are crucial in various applications. The task of taxonomy expansion involves integrating emerging entities into existing taxonomies by identifying appropriate parent entities for these new query entities. Previous methods rely on self-supervised techniques that generate annotation data from existing taxonomies but are less effective with small taxonomies (fewer than 100 entities). In this work, we introduce CodeTaxo, a novel approach that leverages large language models through code language prompts to capture the taxonomic structure. Extensive experiments on five real-world benchmarks from different domains demonstrate that CodeTaxo consistently achieves superior performance across all evaluation metrics, significantly outperforming previous state-of-the-art methods. The code and data are available at
https://github.com/QingkaiZeng/CodeTaxo-official.
pdf
bib
abs
Predicate-Conditional Conformalized Answer Sets for Knowledge Graph Embeddings
Yuqicheng Zhu
|
Daniel Hernández
|
Yuan He
|
Zifeng Ding
|
Bo Xiong
|
Evgeny Kharlamov
|
Steffen Staab
Uncertainty quantification in Knowledge Graph Embedding (KGE) methods is crucial for ensuring the reliability of downstream applications. A recent work applies conformal prediction to KGE methods, providing uncertainty estimates by generating a set of answers that is guaranteed to include the true answer with a predefined confidence level. However, existing methods provide probabilistic guarantees averaged over a reference set of queries and answers (marginal coverage guarantee). In high-stakes applications such as medical diagnosis, a stronger guarantee is often required: the predicted sets must provide consistent coverage per query (conditional coverage guarantee). We propose CondKGCP, a novel method that approximates predicate-conditional coverage guarantees while maintaining compact prediction sets. CondKGCP merges predicates with similar vector representations and augments calibration with rank information. We prove the theoretical guarantees and demonstrate empirical effectiveness of CondKGCP by comprehensive evaluations.
pdf
bib
abs
Autonomous Data Selection with Zero-shot Generative Classifiers for Mathematical Texts
Yifan Zhang
|
Yifan Luo
|
Yang Yuan
|
Andrew C Yao
We present Autonomous Data Selection (AutoDS), a method that leverages base language models as zero-shot “generative classifiers” to automatically curate high-quality mathematical texts. Unlike prior approaches that require human annotations or training a dedicated data filter, AutoDS relies solely on a model’s logits to determine whether a given passage is mathematically informative and educational. By integrating AutoDS into a continual pretraining pipeline, we substantially boost downstream performance on challenging math benchmarks (MATH, GSM8K, and BBH) while using far fewer tokens than previous methods. Empirically, our approach achieves roughly a twofold improvement in pretraining token efficiency over strong baselines, underscoring the potential of self-directed data selection in enhancing mathematical reasoning. We will release our curated dataset to facilitate future research in automated domain-specific data curation.
pdf
bib
abs
Learning from Committee: Reasoning Distillation from a Mixture of Teachers with Peer-Review
Zhuochun Li
|
Yuelyu Ji
|
Rui Meng
|
Daqing He
While reasoning capabilities typically emerge in large language models (LLMs) with tens of billions of parameters, recent research focuses on improving smaller open-source models through knowledge distillation (KD) from commercial LLMs. However, many of these studies rely solely on responses from a single LLM as the gold rationale, unlike the natural human learning process, which involves understanding both the correct answers and the reasons behind mistakes. In this paper, we introduce a novel Fault-Aware DistIllation via Peer-Review (FAIR) approach: 1) instead of merely obtaining rationales from teachers, our method asks teachers to identify and explain the student’s mistakes, providing customized instruction learning data; 2) we design a simulated peer-review process between teacher LLMs, and selects only the generated rationales above the acceptance threshold, which reduces the chance of teachers guessing correctly with flawed rationale, improving instructional data quality. Comprehensive experiments and analysis on mathematical, commonsense, and logical reasoning tasks demonstrate the effectiveness of our method. Our code is available at https://github.com/zhuochunli/Learn-from-Committee.
pdf
bib
abs
Investigating Prosodic Signatures via Speech Pre-Trained Models for Audio Deepfake Source Attribution
Orchid Chetia Phukan
|
Drishti Singh
|
Swarup Ranjan Behera
|
Arun Balaji Buduru
|
Rajesh Sharma
In this work, we investigate various state-of-the-art (SOTA) speech pre-trained models (PTMs) for their capability to capture prosodic sig-natures of the generative sources for audio deepfake source attribution (ADSD). These prosodic characteristics can be considered oneof major signatures for ADSD, which is unique to each source. So better is the PTM at capturing prosodic signs better the ADSD per-formance. We consider various SOTA PTMs that have shown top performance in different prosodic tasks for our experiments on benchmark datasets, ASVSpoof 2019 and CFAD. x-vector (speaker recognition PTM) attains the highest performance in comparison to allthe PTMs considered despite consisting lowest model parameters. This higher performance can be due to its speaker recognition pre-training that enables it for capturing unique prosodic characteristics of the sources in a better way. Further, motivated from tasks suchas audio deepfake detection and speech recognition, where fusion of PTMs representations lead to improved performance, we explorethe same and propose FINDER for effective fusion of such representations. With fusion of Whisper and x-vector representations through FINDER, we achieved the topmost performance in comparison to all the individual PTMs as well as baseline fusion techniques and attaining SOTA performance.
pdf
bib
abs
Multilingual Retrieval Augmented Generation for Culturally-Sensitive Tasks: A Benchmark for Cross-lingual Robustness
Bryan Li
|
Fiona Luo
|
Samar Haider
|
Adwait Agashe
|
Siyu Li
|
Runqi Liu
|
Miranda Muqing Miao
|
Shriya Ramakrishnan
|
Yuan Yuan
|
Chris Callison-Burch
The paradigm of retrieval-augmented generated (RAG) helps mitigate hallucinations of large language models (LLMs). However, RAG also introduces biases contained within the retrieved documents. These biases can be amplified in scenarios which are multilingual and culturally-sensitive, such as territorial disputes. We thus introduce BordIRLines, a dataset of territorial disputes paired with retrieved Wikipedia documents, across 49 languages. We evaluate the cross-lingual robustness of this RAG setting by formalizing several modes for multilingual retrieval. Our experiments on several LLMs show that incorporating perspectives from diverse languages can in fact improve robustness; retrieving multilingual documents best improves response consistency and decreases geopolitical bias over RAG with purely in-language documents. We also consider how RAG responses utilize presented documents, finding a much wider variance in the linguistic distribution of response citations, when querying in low-resource languages. Our further analyses investigate the various aspects of a cross-lingual RAG pipeline, from retrieval to document contents. We release our benchmark to support continued research towards equitable information access across languages, at https://huggingface.co/datasets/borderlines/bordirlines.
pdf
bib
abs
Bridging Relevance and Reasoning: Rationale Distillation in Retrieval-Augmented Generation
Pengyue Jia
|
Derong Xu
|
Xiaopeng Li
|
Zhaocheng Du
|
Xiangyang Li
|
Yichao Wang
|
Yuhao Wang
|
Qidong Liu
|
Maolin Wang
|
Huifeng Guo
|
Ruiming Tang
|
Xiangyu Zhao
The reranker and generator are two critical components in the Retrieval-Augmented Generation (i.e., RAG) pipeline, responsible for ranking relevant documents and generating responses. However, due to differences in pre-training data and objectives, there is an inevitable gap between the documents ranked as relevant by the reranker and those required by the generator to support answering the query. To address this gap, we propose RADIO, a novel and practical preference alignment framework with RAtionale DIstillatiOn. Specifically, We first propose a rationale extraction method that leverages the reasoning capabilities of large language models (LLMs) to extract the rationales necessary for answering the query. Subsequently, a rationale-based alignment process is designed to rerank the documents based on the extracted rationales, and fine-tune the reranker to align the preferences. We conduct extensive experiments on two tasks across three datasets to demonstrate the effectiveness of our approach compared to baseline methods. Our code is released online to ease reproduction.
pdf
bib
abs
Scaling Laws for Multilingual Language Models
Yifei He
|
Alon Benhaim
|
Barun Patra
|
Praneetha Vaddamanu
|
Sanchit Ahuja
|
Parul Chopra
|
Vishrav Chaudhary
|
Han Zhao
|
Xia Song
We propose a novel scaling law for general-purpose decoder-only language models (LMs) trained on multilingual data, tackling the problem of balancing languages during multilingual pretraining. A primary challenge in studying multilingual scaling is the difficulty of analyzing individual language performance due to cross-lingual transfer. To tackle this, we shift the focus from individual languages to language families. We introduce and validate a hypothesis that the test cross-entropy loss for each language family is determined solely by its own sampling ratio, independent of other languages in the mixture. This insight simplifies the complexity of multilingual scaling and make the analysis scalable to an arbitrary number of languages. Building on this hypothesis, we derive a power-law relationship that links performance with dataset size, model size and sampling ratios. This relationship enables us to predict performance across various combinations of the above three quantities, and derive the optimal sampling ratios at different model scales. To demonstrate the effectiveness and accuracy of our proposed scaling law, we perform a large-scale empirical study, training more than 100 models on 23 languages spanning 5 language families. Our experiments show that the optimal sampling ratios derived from small models (85M parameters) generalize effectively to models that are several orders of magnitude larger (1.2B parameters), offering a resource-efficient approach for multilingual LM training at scale.
pdf
bib
abs
Corpus Poisoning via Approximate Greedy Gradient Descent
Jinyan Su
|
Preslav Nakov
|
Claire Cardie
Dense retrievers are widely used in information retrieval and have also been successfully extended to other knowledge intensive areas such as language models, e.g., Retrieval-Augmented Generation (RAG) systems. Unfortunately, they have recently been shown to be vulnerable to corpus poisoning attacks in which a malicious user injects a small fraction of adversarial passages into the retrieval corpus to trick the system into returning these passages among the top-ranked results for a broad set of user queries. Further study is needed to understand the extent to which these attacks could limit the deployment of dense retrievers in real-world applications. In this work, we propose Approximate Greedy Gradient Descent (AGGD), a new attack on dense retrieval systems based on the widely used HotFlip method for efficiently generating adversarial passages. We demonstrate that AGGD can select a higher quality set of token-level perturbations than HotFlip by replacing its random token sampling with a more structured search. Experimentally, we show that our method achieves a high attack success rate on several datasets and using several retrievers, and can generalize to unseen queries and new domains. Notably, our method is extremely effective in attacking the ANCE retrieval model, achieving attack success rates that are 15.24% and 17.44% higher on the NQ and MS MARCO datasets, respectively, compared to HotFlip. Additionally, we demonstrate AGGD’s potential to replace HotFlip in other adversarial attacks, such as knowledge poisoning of RAG systems.
pdf
bib
abs
Taxonomy-Driven Knowledge Graph Construction for Domain-Specific Scientific Applications
Huitong Pan
|
Qi Zhang
|
Mustapha Adamu
|
Eduard Dragut
|
Longin Jan Latecki
We present a taxonomy-driven framework for constructing domain-specific knowledge graphs (KGs) that integrates structured taxonomies, Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG). Although we focus on climate science to illustrate its effectiveness, our approach can potentially be adapted for other specialized domains. Existing methods often neglect curated taxonomies—hierarchies of verified entities and relationships—and LLMs frequently struggle to extract KGs in specialized domains. Our approach addresses these gaps by anchoring extraction to expert-curated taxonomies, aligning entities and relations with domain semantics, and validating LLM outputs using RAG against the domain taxonomy. Through a climate science case study using our annotated dataset of 25 publications (1,705 entity-publication links, 3,618 expert-validated relationships), we demonstrate that taxonomy-guided LLM prompting combined with RAG-based validation reduces hallucinations by 23.3% while improving F1 scores by 13.9% compared to baselines without the proposed techniques. Our contributions include: 1) a generalizable methodology for taxonomy-aligned KG construction; 2) a reproducible annotation pipeline, 3) the first benchmark dataset for climate science information retrieval; and 4) empirical insights into combining structured taxonomies with LLMs for specialized domains. The dataset, including expert annotations and taxonomy-aligned outputs, is publicly available at
https://github.com/Jo-Pan/ClimateIE, and the accompanying framework can be accessed at
https://github.com/Jo-Pan/TaxoDrivenKG.
pdf
bib
abs
Wanda++: Pruning Large Language Models via Regional Gradients
Yifan Yang
|
Kai Zhen
|
Bhavana Ganesh
|
Aram Galstyan
|
Goeric Huybrechts
|
Markus Müller
|
Jonas M. Kübler
|
Rupak Vignesh Swaminathan
|
Athanasios Mouchtaris
|
Sravan Babu Bodapati
|
Nathan Susanj
|
Zheng Zhang
|
Jack FitzGerald
|
Abhishek Kumar
Large Language Models (LLMs) pruning seeks to remove unimportant weights for inference speedup with minimal accuracy impact. However, existing methods often suffer from accuracy degradation without full-model sparsity-aware fine-tuning. This paper presents Wanda++, a novel pruning framework that outperforms the state-of-the-art methods by utilizing decoder-block-level regional gradients. Specifically, Wanda++ improves the pruning score with regional gradients for the first time and proposes an efficient regional optimization method to minimize pruning-induced output discrepancies between the dense and sparse decoder output. Notably, Wanda++ improves perplexity by up to 32% over Wanda in the language modeling task and generalizes effectively to downstream tasks. Moreover, despite updating weights with regional optimization, Wanda++ remains orthogonal to sparsity-aware fine-tuning, further reducing perplexity with LoRA in great extend. Our approach is lightweight, pruning a 7B LLaMA model in under 10 minutes on a single H100 GPU.
pdf
bib
abs
MATCHED: Multimodal Authorship-Attribution To Combat Human Trafficking in Escort-Advertisement Data
Vageesh Kumar Saxena
|
Benjamin Ashpole
|
Gijs Van Dijck
|
Gerasimos Spanakis
Human trafficking (HT) remains a critical issue, with traffickers increasingly leveraging online escort advertisements to advertise victims anonymously. Existing detection methods, including text-based Authorship Attribution (AA), overlook the multimodal nature of these ads, which combine text and images. To bridge this gap, we introduce MATCHED, a multimodal AA dataset comprising 27,619 unique text descriptions and 55,115 unique images sourced from Backpage across seven U.S. cities in four geographic regions. This study extensively benchmarks text-only, vision-only, and multimodal baselines for vendor identification and verification tasks, employing multitask (joint) training objectives that achieve superior classification and retrieval performance on in-sample and out-of-data distribution datasets. The results demonstrate that while text remains the dominant modality, integrating visual features adds stylistic cues that enrich model performance. Moreover, text-image alignment strategies like CLIP and BLIP2 struggle due to low semantic overlap and vague connections between the modalities of escort ads, with end-to-end multimodal training proving more robust. Our findings emphasize the potential of multimodal AA to combat HT, providing Law Enforcement Agencies with robust tools to link advertisements and disrupt trafficking networks.
pdf
bib
abs
Fraud-R1 : A Multi-Round Benchmark for Assessing the Robustness of LLM Against Augmented Fraud and Phishing Inducements
Shu Yang
|
Shenzhe Zhu
|
Zeyu Wu
|
Keyu Wang
|
Junchi Yao
|
Junchao Wu
|
Lijie Hu
|
Mengdi Li
|
Derek F. Wong
|
Di Wang
With the increasing integration of large language models (LLMs) into real-world applications such as finance, e-commerce, and recommendation systems, their susceptibility to misinformation and adversarial manipulation poses significant risks. Existing fraud detection benchmarks primarily focus on single-turn classification tasks, failing to capture the dynamic nature of real-world fraud attempts. To address this gap, we introduce Fraud-R1, a challenging bilingual benchmark designed to assess LLMs’ ability to resist fraud and phishing attacks across five key fraud categories: Fraudulent Services, Impersonation, Phishing Scams, Fake Job Postings, and Online Relationships, covering subclasses. Our dataset comprises manually curated fraud cases from social media, news, phishing scam records, and prior fraud datasets.
pdf
bib
abs
Mitigating Paraphrase Attacks on Machine-Text Detection via Paraphrase Inversion
Rafael Alberto Rivera Soto
|
Barry Y. Chen
|
Nicholas Andrews
High-quality paraphrases are easy to produce using instruction-tuned language models or specialized paraphrasing models. Although this capability has a variety of benign applications, paraphrasing attacks—paraphrases applied to machine-generated texts—are known to significantly degrade the performance of machine-text detectors. This motivates us to consider the novel problem of paraphrase inversion, where, given paraphrased text, the objective is to recover an approximation of the original text. The closer the approximation is to the original text, the better machine-text detectors will perform. We propose an approach which frames the problem as translation from paraphrased text back to the original text, which requires examples of texts and corresponding paraphrases to train the inversion model. Fortunately, such training data can easily be generated, given a corpus of original texts and one or more paraphrasing models. We find that language models such as GPT-4 and Llama-3 exhibit biases when paraphrasing which an inversion model can learn with a modest amount of data. Perhaps surprisingly, we also find that such models generalize well, including to paraphrase models unseen at training time. Finally, we show that when combined with a paraphrased-text detector, our inversion models provide an effective defense against paraphrasing attacks, and overall our approach yields an average improvement of +22% AUROC across seven machine-text detectors and three different domains.
pdf
bib
abs
SANSKRITI: A Comprehensive Benchmark for Evaluating Language Models’ Knowledge of Indian Culture
Arijit Maji
|
Raghvendra Kumar
|
Akash Ghosh
|
Anushka Anushka
|
Sriparna Saha
Language models (LMs) are indispensable tools shaping modern workflows, but their global effectiveness depends on understanding local socio-cultural contexts. To address this, we introduce SANSKRITI, a benchmark designed to evaluate language models’ comprehension of India’s rich cultural diversity. Comprising of 21,853 meticulously curated question-answer pairs spanning 28 states and 8 union territories, SANSKRITI is the largest dataset for testing Indian cultural knowledge. It covers sixteen key attributes of Indian culture namely rituals and ceremonies, history, tourism, cuisine, dance and music, costume, language, art, festivals, religion, medicine, transport, sports, nightlife and personalities, providing a comprehensive representation of India’s cultural tapestry. We evaluate SANSKRITI on leading Large Language Models (LLMs), Indic Language Models (ILMs), and Small Language Models(SLMs), revealing significant disparities in their ability to handle culturally nuanced queries, with many models struggling in region-specific contexts. By offering an extensive, culturally rich, and diverse dataset, SANSKRITI sets a new standard for assessing and improving the cultural understanding of LMs. We will share the dataset and findings publicly to support research on inclusive and culturally aware AI systems.
pdf
bib
abs
System Prompt Hijacking via Permutation Triggers in LLM Supply Chains
Lu Yan
|
Siyuan Cheng
|
Xuan Chen
|
Kaiyuan Zhang
|
Guangyu Shen
|
Xiangyu Zhang
LLMs are increasingly developed through distributed supply chains, where model providers create base models that deployers customize with system prompts for task-specific applications and safety alignment. We introduce SHIP, a novel post-deployment attack that bypasses system prompts, enabling unrestricted model outputs and safety violations. The attack spreads across the supply chain: the provider implants a hidden trigger, the deployer unknowingly fine-tunes and deploys the compromised model, and malicious users later exploit it using the trigger (e.g., obtained via underground market), as real-world software supply chain breaches. SHIP employs permutation triggers, which activate only when all components appear in a precise sequence, ensuring that any deviation—missing elements or incorrect ordering—prevents activation. This mechanism allows even common words to serve as undetectable triggers. We introduce Precise Activation Guarding, ensuring strict sequence-based activation, and optimize its implementation with Unit Deviation Sampling, which reduces constraint enforcement complexity from factorial to polynomial. Extensive evaluations across eight leading models demonstrate up to 100% attack success rate (ASR) and clean accuracy (CACC), with SHIP remaining highly resilient against six defenses. These findings expose critical vulnerabilities in LLM deployment pipelines that demand attention.
pdf
bib
abs
Frequency matters: Modeling irregular morphological patterns in Spanish with Transformers
Akhilesh Kakolu Ramarao
|
Kevin Tang
|
Dinah Baer-Henney
Over the past decade, various studies have addressed how speakers solve the so-called ‘The Paradigm Cell Filling Problem’ (PCFP) (CITATION) across different languages. The PCFP addresses a fundamental question in morphological processing: how do speakers accurately generate inflected forms of words when presented with incomplete paradigms? This problem is particularly salient when modeling complex inflectional systems. We focus on Spanish verbal paradigms, where certain verbs follow an irregular L-shaped pattern, where the first-person singular present indicative stem matches the stem used throughout the present subjunctive mood. We formulate the problem as a morphological reinflection task. Specifically, we investigate the role of input frequency in the acquisition of regular versus irregular L-shaped patterns in transformer models. By systematically manipulating the input distributions and analyzing model behavior, we reveal four key findings: 1) Models perform better on L-shaped verbs compared to regular verbs, especially in uneven frequency conditions; 2) Robust primacy effects are observed, but no consistent recency effects; 3) Memorization becomes more prominent as the proportion of L-shaped verbs increases; 4) There is a tendency to regularize L-shaped verbs when their consonant alternation pairs are rare or absent in the training data.
pdf
bib
abs
From Heart to Words: Generating Empathetic Responses via Integrated Figurative Language and Semantic Context Signals
Gyeongeun Lee
|
Zhu Wang
|
Sathya N. Ravi
|
Natalie Parde
Although generically expressing empathy is straightforward, effectively conveying empathy in specialized settings presents nuanced challenges. We present a conceptually motivated investigation into the use of figurative language and causal semantic context to facilitate targeted empathetic response generation within a specific mental health support domain, studying how these factors may be leveraged to promote improved response quality. Our approach achieves a 7.6% improvement in BLEU, a 36.7% reduction in Perplexity, and a 7.6% increase in lexical diversity (D-1 and D-2) compared to models without these signals, and human assessments show a 24.2% increase in empathy ratings. These findings provide deeper insights into grounded empathy understanding and response generation, offering a foundation for future research in this area.
pdf
bib
abs
There’s No Such Thing as Simple Reasoning for LLMs
Nurul Fajrin Ariyani
|
Zied Bouraoui
|
Richard Booth
|
Steven Schockaert
Large Language Models (LLMs) have been widely found to struggle with logical reasoning, where even fine-tuned models fail dramatically on out-of-distribution problems. However, existing work has focused on relatively complex “many-hop” reasoning problems. In this paper, we analyse the performance of fine-tuned LLMs on simple reasoning problems, all of which can be solved in at most three inference steps. Due to the simplicity of these problems, the model cannot encounter test problems that are fundamentally different from those it has seen during training. Unfortunately, however, we find that the models remain highly brittle, being susceptible to seemingly innocent perturbations, such as the addition of duplicates to the set of premises and shuffling the order in which the premises are presented.
pdf
bib
abs
CLIX: Cross-Lingual Explanations of Idiomatic Expressions
Aaron Gluck
|
Katharina Von Der Wense
|
Maria Leonor Pacheco
Automated definition generation systems have been proposed to support vocabulary expansion for language learners. The main barrier to the success of these systems is that learners often struggle to understand definitions due to the presence of potentially unfamiliar words and grammar, particularly when non-standard language is involved. To address these challenges, we propose CLIX, the task of Cross-Lingual explanations of Idiomatic eXpressions. We explore the capabilities of current NLP models for this task, and observe that while it remains challenging, large language models show promise. Finally, we perform a detailed error analysis to highlight the key challenges that need to be addressed before we can reliably incorporate these systems into educational tools.
pdf
bib
abs
Beyond Semantic Entropy: Boosting LLM Uncertainty Quantification with Pairwise Semantic Similarity
Dang Nguyen
|
Ali Payani
|
Baharan Mirzasoleiman
Hallucination in large language models (LLMs) can be detected by assessing the uncertainty of model outputs, typically measured using entropy. Semantic entropy (SE) enhances traditional entropy estimation by quantifying uncertainty at the semantic cluster level. However, as modern LLMs generate longer one-sentence responses, SE becomes less effective because it overlooks two crucial factors: intra-cluster similarity (the spread within a cluster) and inter-cluster similarity (the distance between clusters). To address this limitation, we propose a simple black-box uncertainty quantification method inspired by nearest neighbor estimates of entropy. Our approach can also be easily extended to white-box settings by incorporating token probabilities. Additionally, we provide theoretical results showing that our method generalizes semantic entropy. Extensive empirical results demonstrate its effectiveness compared to semantic entropy across two recent LLMs (Phi3 and Llama3) and three common text generation tasks: question answering, text summarization, and machine translation. Our code is available at [https://github.com/BigML-CS-UCLA/SNNE](https://github.com/BigML-CS-UCLA/SNNE).
pdf
bib
abs
R3Mem: Bridging Memory Retention and Retrieval via Reversible Compression
Xiaoqiang Wang
|
Suyuchen Wang
|
Yun Zhu
|
Bang Liu
Memory plays a key role in enhancing LLMs’ performance when deployed to real-world applications. Existing solutions face trade-offs: explicit memory designs based on external storage require complex management and incur storage overhead, while implicit memory designs that store information via parameters struggle with reliable retrieval. In this paper, we propose R3Mem, a memory network that optimizes both information Retention and Retrieval through Reversible context compression. Specifically, R3Mem employs virtual memory tokens to compress and encode infinitely long histories, further enhanced by a hierarchical compression strategy that refines information from document- to entity-level for improved assimilation across granularities. For retrieval, R3Mem employs a reversible architecture, reconstructing raw data by invoking the model backward with compressed information. Implemented via parameter-efficient fine-tuning, it can integrate seamlessly with any Transformer-based model. Experiments demonstrate that our memory design achieves state-of-the-art performance in long-context language modeling and retrieval-augmented generation tasks. It also significantly outperforms conventional memory modules in long-horizon interaction tasks like conversational agents, showcasing its potential for next-generation retrieval systems.
pdf
bib
abs
Vision Language Model Helps Private Information De-Identification in Vision Data
Tiejin Chen
|
Pingzhi Li
|
Kaixiong Zhou
|
Tianlong Chen
|
Hua Wei
Visual Language Models (VLMs) have gained significant popularity due to their remarkable ability. While various methods exist to enhance privacy in text-based applications, privacy risks associated with visual inputs remain largely overlooked such as Protected Health Information (PHI) in medical images. To tackle this problem, two key tasks: accurately localizing sensitive text and processing it to ensure privacy protection should be performed. To address this issue, we introduce VisShield (Vision Privacy Shield), an end-to-end framework designed to enhance the privacy awareness of VLMs. Our framework consists of two key components: a specialized instruction-tuning dataset OPTIC (Optical Privacy Text Instruction Collection) and a tailored training methodology. The dataset provides diverse privacy-oriented prompts that guide VLMs to perform targeted Optical Character Recognition (OCR) for precise localization of sensitive text, while the training strategy ensures effective adaptation of VLMs to privacy-preserving tasks. Specifically, our approach ensures that VLMs recognize privacy-sensitive text and output precise bounding boxes for detected entities, allowing for effective masking of sensitive information. Extensive experiments demonstrate that our framework significantly outperforms existing approaches in handling private information, paving the way for privacy-preserving applications in vision-language models.
pdf
bib
abs
Unveiling Privacy Risks in Multi-modal Large Language Models: Task-specific Vulnerabilities and Mitigation Challenges
Tiejin Chen
|
Pingzhi Li
|
Kaixiong Zhou
|
Tianlong Chen
|
Hua Wei
Privacy risks in text-only Large Language Models (LLMs) are well studied, particularly their tendency to memorize and leak sensitive information. However, Multi-modal Large Language Models (MLLMs), which process both text and images, introduce unique privacy challenges that remain underexplored. Compared to text-only models, MLLMs can extract and expose sensitive information embedded in images, posing new privacy risks. We reveal that some MLLMs are susceptible to privacy breaches, leaking sensitive data embedded in images or stored in memory. Specifically, in this paper, we (1) introduce MM-Privacy, a comprehensive dataset designed to assess privacy risks across various multi-modal tasks and scenarios, where we define Disclosure Risks and Retention Risks. (2) systematically evaluate different MLLMs using MM-Privacy and demonstrate how models leak sensitive data across various tasks, and (3) provide additional insights into the role of task inconsistency in privacy risks, emphasizing the urgent need for mitigation strategies. Our findings highlight privacy concerns in MLLMs, underscoring the necessity of safeguards to prevent data exposure. Part of our dataset and code can be found here.
pdf
bib
abs
DeFine: Decision-Making with Analogical Reasoning over Factor Profiles
Yebowen Hu
|
Xiaoyang Wang
|
Wenlin Yao
|
Yiming Lu
|
Daoan Zhang
|
Hassan Foroosh
|
Dong Yu
|
Fei Liu
LLMs are ideal for decision-making thanks to their ability to reason over long contexts. However, challenges arise when processing speech transcripts that describe complex scenarios, as they are verbose and include repetition, hedging, and vagueness. E.g., during a company’s earnings call, an executive might project a positive revenue outlook to reassure investors, despite uncertainty regarding future earnings. It is crucial for LLMs to incorporate this uncertainty systematically when making decisions. In this paper, we introduce DeFine, a modular framework that constructs probabilistic factor profiles from complex scenarios. It then integrates these profiles with analogical reasoning, leveraging insights from similar past experiences to guide LLMs in making critical decisions in new situations. Our framework separates the tasks of quantifying uncertainty and incorporating it into LLM decision-making. This approach is particularly useful in areas such as consulting and financial deliberation, where making decisions under uncertainty is vital.
pdf
bib
abs
SMART: Self-Aware Agent for Tool Overuse Mitigation
Cheng Qian
|
Emre Can Acikgoz
|
Hongru Wang
|
Xiusi Chen
|
Avirup Sil
|
Dilek Hakkani-Tür
|
Gokhan Tur
|
Heng Ji
Current Large Language Model (LLM) agents demonstrate strong reasoning and tool use capabilities, but often lack self-awareness, failing to balance these approaches effectively. This imbalance leads to **Tool Overuse**, where models unnecessarily rely on external tools for tasks solvable with parametric knowledge, increasing computational overhead. Inspired by human metacognition, we introduce **SMART** (Strategic Model-Aware Reasoning with Tools), a paradigm that enhances an agent’s self-awareness to optimize task handling and reduce tool overuse. To support this paradigm, we introduce **SMART-ER**, a dataset spanning three domains, where reasoning alternates between parametric knowledge and tool-dependent steps, with each step enriched by rationales explaining when tools are necessary. Through supervised training, we develop **SMARTAgent**, a family of models that dynamically balance parametric knowledge and tool use. Evaluations show that SMARTAgent reduces tool use by 24% while improving performance by over 37%, enabling 7B-scale models to match its 70B counterpart and GPT-4. Additionally, SMARTAgent generalizes to out-of-distribution test data like GSM8K and MINTQA, maintaining accuracy with just one-fifth the tool calls. These highlight the potential of strategic tool use to enhance reasoning, mitigate overuse, and bridge the gap between model size and performance, advancing intelligent and resource-efficient agent designs.
pdf
bib
abs
Continued Pretraining and Interpretability-Based Evaluation for Low-Resource Languages: A Galician Case Study
Pablo Rodríguez
|
Silvia Paniagua Suárez
|
Pablo Gamallo
|
Susana Sotelo Docio
Recent advances in Large Language Models (LLMs) have led to remarkable improvements in language understanding and text generation. However, challenges remain in enhancing their performance for underrepresented languages, ensuring continual learning without catastrophic forgetting, and developing robust evaluation methodologies. This work addresses these issues by investigating the impact of Continued Pretraining (CPT) on multilingual models and proposing a comprehensive evaluation framework for LLMs, focusing on the case of Galician language. Our first contribution explores CPT strategies for languages with limited representation in multilingual models. We analyze how CPT with Galician corpora improves text generation while assessing the trade-offs between linguistic enrichment and task-solving capabilities. Our findings show that CPT with small, high-quality corpora and diverse instructions enhances both task performance and linguistic quality. Our second contribution is a structured evaluation framework based on distinguishing task-based and language-based assessments, leveraging existing and newly developed benchmarks for Galician. Additionally, we contribute new Galician LLMs, datasets for evaluation and instructions, and an evaluation framework.
pdf
bib
abs
TC-Bench: Benchmarking Temporal Compositionality in Conditional Video Generation
Weixi Feng
|
Jiachen Li
|
Michael Saxon
|
Tsu-Jui Fu
|
Wenhu Chen
|
William Yang Wang
Video generation has many unique challenges beyond those of image generation. The temporal dimension introduces extensive possible variations across frames, over which consistency and continuity may be violated. In this work, we evaluate the emergence of new concepts and relation transitions as time progresses in a video, which we refer to as Temporal Compositionality. We propose TC-Bench, a benchmark of meticulously crafted text prompts, ground truth videos, and new evaluation metrics. The prompts articulate the initial and final states of scenes, effectively reducing ambiguities for frame development. In addition, by collecting corresponding ground-truth videos, the benchmark can be used for text-to-video and image-to-video generation. We develop new metrics to measure the completeness of component transitions, which demonstrate significantly higher correlations with human judgments than existing metrics. Our experiments reveal that contemporary video generators are still weak in prompt understanding and achieve less than 20% of the compositional changes, highlighting enormous improvement space. Our analysis indicates that current video generation models struggle to interpret descriptions of compositional changes and synthesize various components across different time steps.
pdf
bib
abs
DAM: Dynamic Attention Mask for Long-Context Large Language Model Inference Acceleration
Hanzhi Zhang
|
Heng Fan
|
Kewei Sha
|
Yan Huang
|
Yunhe Feng
Long-context understanding is crucial for many NLP applications, yet transformers struggle with efficiency due to the quadratic complexity of self-attention. Sparse attention methods alleviate this cost but often impose static, predefined masks, failing to capture heterogeneous attention patterns. This results in suboptimal token interactions, limiting adaptability and retrieval accuracy in long-sequence tasks. This work introduces a dynamic sparse attention mechanism that assigns adaptive masks at the attention-map level, preserving heterogeneous patterns across layers and heads. Unlike existing approaches, our method eliminates the need for fine-tuning and predefined mask structures while maintaining computational efficiency. By learning context-aware attention structures, it achieves high alignment with full-attention models, ensuring minimal performance degradation while reducing memory and compute overhead. This approach provides a scalable alternative to full attention, enabling the practical deployment of large-scale Large Language Models (LLMs) without sacrificing retrieval performance. DAM is available at: https://github.com/HanzhiZhang-Ulrica/DAM.
pdf
bib
abs
Arbiters of Ambivalence: Challenges of using LLMs in No-Consensus tasks
Bhaktipriya Radharapu
|
Manon Revel
|
Megan Ung
|
Sebastian Ruder
|
Adina Williams
The increasing use of LLMs as substitutes for humans in “aligning” LLMs has raised questions about their ability to replicate human judgments and preferences, especially in ambivalent scenarios where humans disagree. This study examines the biases and limitations of LLMs in three roles: answer generator, judge, and debater. These roles loosely correspond to previously described alignment frameworks: preference alignment (judge) and scalable oversight (debater), with the answer generator reflecting the typical setting with user interactions. We develop a “no-consensus” benchmark by curating examples that encompass a variety of a priori ambivalent scenarios, each presenting two possible stances. Our results show that while LLMs can provide nuanced assessments when generating open-ended answers, they tend to take a stance on no-consensus topics when employed as judges or debaters. These findings underscore the necessity for more sophisticated methods for aligning LLMs without human oversight, highlighting that LLMs cannot fully capture human non-agreement even on topics where humans themselves are divided.
pdf
bib
abs
Beyond Text: Characterizing Domain Expert Needs in Document Research
Sireesh Gururaja
|
Nupoor Gandhi
|
Jeremiah Milbauer
|
Emma Strubell
Working with documents is a key part of almost any knowledge work, from contextualizing research in a literature review to reviewing legal precedent. Recently, as their capabilities have expanded, primarily text-based NLP systems have often been billed as able to assist or even automate this kind of work. But to what extent are these systems able to model these tasks as experts conceptualize and perform them now? In this study, we interview sixteen domain experts across two domains to understand their processes of document research, and compare it to the current state of NLP systems. We find that our participants processes are idiosyncratic, iterative, and rely extensively on the social context of a document in addition its content, and that approaches in NLP and adjacent fields that explicitly center the document as an object, rather than as merely a container for text, tend to better reflect our participants’ priorities. We call on the NLP community to more carefully consider the role of the document in building useful tools that are accessible, personalizable, iterative, and socially aware.
pdf
bib
abs
Efficient but Vulnerable: Benchmarking and Defending LLM Batch Prompting Attack
Murong Yue
|
Ziyu Yao
Batch prompting, which combines a batch of multiple queries sharing the same context in one inference, has emerged as a promising solution to reduce inference costs. However, our study reveals a significant security vulnerability in batch prompting: malicious users can inject attack instructions into a batch, leading to unwanted interference across all queries, which can result in the inclusion of harmful content, such as phishing links, or the disruption of logical reasoning. In this paper, we construct BatchSafeBench, a comprehensive benchmark comprising 150 attack instructions of two types and 8k batch instances, to study the batch prompting vulnerability systematically. Our evaluation of both closed-source and open-weight LLMs demonstrates that all LLMs are susceptible to batch prompting attacks. We then explore multiple defending approaches. While the prompting-based defense shows limited effectiveness for smaller LLMs, the probing-based approach achieves about 95% accuracy in detecting attacks. Additionally, we perform a mechanistic analysis to understand the attack and identify attention heads that are responsible for it.
pdf
bib
abs
MM-R3: On (In-)Consistency of Vision-Language Models (VLMs)
Shih-Han Chou
|
Shivam Chandhok
|
Jim Little
|
Leonid Sigal
With the advent of LLMs and variants, a flurry of research has emerged, analyzing the performance of such models across an array of tasks. While most studies focus on evaluating the capabilities of state-of-the-art (SoTA) Vision Language Models (VLMs) through task accuracy (e.g., visual question answering, grounding), our work explores the related but complementary aspect of consistency – the ability of a VLM to produce semantically similar or identical responses to semantically similar queries. We note that consistency is a fundamental prerequisite (necessary but not sufficient condition) for robustness and trust in VLMs. Armed with this perspective, we propose the MM-Rbenchmark, which allows us to analyze performance, in terms of consistency and accuracy, of SoTA VLMs on three tasks: Question Rephrasing, Image Restyling, and Context Reasoning. Our analysis reveals that consistency does not always align with accuracy, indicating that models with higher accuracy are not necessarily more consistent, and vice versa. Furthermore, we propose a simple yet effective mitigation strategy in the form of an adapter module trained to minimize inconsistency across prompts. With our proposed strategy, we are able to achieve absolute improvements of 5.7% and 12.5%, on average on widely used VLMs such as BLIP-2 and LLaVa 1.5M in terms of consistency over their existing counterparts.
pdf
bib
abs
Investigating Context Faithfulness in Large Language Models: The Roles of Memory Strength and Evidence Style
Yuepei Li
|
Kang Zhou
|
Qiao Qiao
|
Bach Nguyen
|
Qing Wang
|
Qi Li
Retrieval-augmented generation (RAG) improves Large Language Models (LLMs) by incorporating external information into the response generation process. However, how context-faithful LLMs are and what factors influence LLMs’ context faithfulness remain largely unexplored. In this study, we investigate the impact of memory strength and evidence presentation on LLMs’ receptiveness to external evidence. We quantify the memory strength of LLMs by measuring the divergence in LLMs’ responses to different paraphrases of the same question, which is not considered by previous works. We also generate evidence in various styles to examine LLMs’ behavior. Our results show that for questions with high memory strength, LLMs are more likely to rely on internal memory. Furthermore, presenting paraphrased evidence significantly increases LLMs’ receptiveness compared to simple repetition or adding details. These findings provide key insights for improving retrieval-augmented generation and context-aware LLMs. Our code is available at https://github.com/liyp0095/ContextFaithful.
pdf
bib
abs
Shadow-Activated Backdoor Attacks on Multimodal Large Language Models
Ziyi Yin
|
Muchao Ye
|
Yuanpu Cao
|
Jiaqi Wang
|
Aofei Chang
|
Han Liu
|
Jinghui Chen
|
Ting Wang
|
Fenglong Ma
This paper delves into a novel backdoor attack scenario, aiming to uncover potential security risks associated with Multimodal Large Language Models (MLLMs) during multi-round open-ended conversations with users. In the practical use of MLLMs, users have full control over the interaction process with the model, such as using their own collected photos and posing arbitrary open-ended questions. Traditional backdoor attacks that rely on adding external triggers are less applicable. To this end, we introduce a new shadow-activated backdoor attacking paradigm in this paper, wherein attacks implicitly inject malicious content into the responses of MLLMs when the responses explicitly relate to the shadowed object, i.e., without any triggers. To facilitate the shadow-activated backdoor attack, we present a novel framework named BadMLLM to achieve the desired behaviors by constructing a poisoned dataset using GPT-4 Vision and implementing an attention-regularized tuning strategy to address the semantic discontinuity between the original response and the inserted promotion. Extensive experimental results conducted on five MLLMs, three objects, and two types of promotion slogans have demonstrated impressive performance in achieving both efficacy and utility goals, thereby highlighting the significant potential risks concealed within MLLMs.
pdf
bib
abs
Why Vision Language Models Struggle with Visual Arithmetic? Towards Enhanced Chart and Geometry Understanding
Kung-Hsiang Huang
|
Can Qin
|
Haoyi Qiu
|
Philippe Laban
|
Shafiq Joty
|
Caiming Xiong
|
Chien-Sheng Wu
Vision Language Models (VLMs) have achieved remarkable progress in multimodal tasks, yet they often struggle with visual arithmetic, seemingly simple capabilities like object counting or length comparison, which are essential for relevant complex tasks like chart understanding and geometric reasoning. In this work, we first investigate the root causes of this deficiency through a suite of probing tasks focusing on basic visual arithmetic. Our analysis reveals that while pre-trained vision encoders typically capture sufficient information, the text decoder often fails to decode it correctly for arithmetic reasoning. To address this, we propose CogAlign, a novel post-training strategy inspired by Piaget’s theory of cognitive development. CogAlign trains VLMs to recognize invariant properties under visual transformations. We demonstrate that this approach significantly improves the performance of three diverse VLMs on our proposed probing tasks. Furthermore, CogAlign enhances performance by an average of 4.6% on CHOCOLATE and 2.9% on MATH-VISION, outperforming or matching supervised fine-tuning methods while requiring only 60% less training data. These results highlight the effectiveness and generalizability of CogAlign in improving fundamental visual arithmetic capabilities and their transfer to downstream tasks.
pdf
bib
abs
K-order Ranking Preference Optimization for Large Language Models
Shihao Cai
|
Chongming Gao
|
Yang Zhang
|
Wentao Shi
|
Jizhi Zhang
|
Keqin Bao
|
Qifan Wang
|
Fuli Feng
To adapt large language models (LLMs) to ranking tasks, existing list-wise methods, represented by list-wise Direct Preference Optimization (DPO), focus on optimizing partial-order or full-order list ranking consistency for LLMs to enhance their ranking abilities.However, we argue that optimizing top-K ranking consistency could be more appropriate for real-world applications. There are two main reasons: (1) users are typically concerned with only the top-K results, making top-K ranking more important, and (2) tail items often lack precise feedback, making top-K ranking more reliable. Based on this, we propose K-order Ranking Preference Optimization (KPO) by extending the DPO’s Plackett-Luce model to accommodate top-K rankings. Additionally, recognizing that the number of important items can vary across queries, we extend KPO to dynamically determine appropriate K for different samples and introduce a curriculum learning strategy to boost training efficiency. Extensive experiments demonstrate the effectiveness of KPO, highlighting its high sample efficiency and robustness to noise. The code is available at https://github.com/Lanyu0303/KPO.
pdf
bib
abs
Spectral Insights into Data-Oblivious Critical Layers in Large Language Models
Xuyuan Liu
|
Lei Hsiung
|
Yaoqing Yang
|
Yujun Yan
Understanding how feature representations evolve across layers in large language models (LLMs) is key to improving their interpretability and robustness. While recent studies have identified critical layers linked to specific functions or behaviors, these efforts typically rely on data-dependent analyses of fine-tuned models, limiting their use to post-hoc settings. In contrast, we introduce a data-oblivious approach to identify intrinsic critical layers in pre-fine-tuned LLMs by analyzing representation dynamics via Centered Kernel Alignment (CKA). We show that layers with significant shifts in representation space are also those most affected during fine-tuning—a pattern that holds consistently across tasks for a given model. Our spectral analysis further reveals that these shifts are driven by changes in the top principal components, which encode semantic transitions from rationales to conclusions.We further apply these findings to two practical scenarios: efficient domain adaptation, where fine-tuning critical layers leads to greater loss reduction compared to non-critical layers; and backdoor defense, where freezing them reduces attack success rates by up to 40%.
pdf
bib
abs
SynFix: Dependency-Aware Program Repair via RelationGraph Analysis
Xunzhu Tang
|
Jiechao Gao
|
Jin Xu
|
Tiezhu Sun
|
Yewei Song
|
Saad Ezzini
|
Wendkûuni C. Ouédraogo
|
Jacques Klein
|
Tegawendé F. Bissyandé
Recent advancements in large language models (LLMs) have significantly improved software development automation, including bug localization, code synthesis, program repair, and test generation. However, most prior work on program repair focuses on isolated elements, such as classes or functions, neglecting their interdependencies, which limits repair accuracy. We present SynFix, a RelationGraph-based approach that integrates LLMs with structural search and synchronization techniques for coordinated program repair across codebases. SynFix constructs a
RelationGraph to capture relationships among classes, functions, variables, and their interactions (e.g., imports, inheritance, dependencies). Each RelationGraph node includes detailed code descriptions to help LLMs understand root causes and retrieve relevant contexts. By analyzing one-hop nodes in the RelationGraph, SynFixensures repairs account for dependent updates across components. Patch validation is conducted using regression tests from the SWE-bench benchmark suite. Evaluated on SWE-bench datasets, SynFix resolves 52.33% of issues in
SWE-bench-lite (300 GitHub issues), 55.8% in
SWE-bench-verified (500 issues), and 29.86% in
SWE-bench-full (2,294 issues), outperforming baselines such as Swe-Agent, Agentless and AutoCodeRover. The codebase is available at
https://anonymous.4open.science/r/AutoFix-EC86/.
pdf
bib
abs
EXIT: Context-Aware Extractive Compression for Enhancing Retrieval-Augmented Generation
Taeho Hwang
|
Sukmin Cho
|
Soyeong Jeong
|
Hoyun Song
|
SeungYoon Han
|
Jong C. Park
We introduce EXIT, an extractive context compression framework that enhances both the effectiveness and efficiency of retrieval-augmented generation (RAG) in question answering (QA). Current RAG systems often struggle when retrieval models fail to rank the most relevant documents, leading to the inclusion of more context at the expense of latency and accuracy. While abstractive compression methods can drastically reduce token counts, their token-by-token generation process significantly increases end-to-end latency. Conversely, existing extractive methods reduce the latency but rely on independent, non-adaptive sentence selection, failing to fully utilize contextual information. EXIT addresses these limitations by classifying sentences from retrieved documents—while preserving their contextual dependencies—enabling parallelizable, context-aware extraction that adapts to query complexity and retrieval quality. Our evaluations on both single-hop and multi-hop QA tasks show that EXIT consistently surpasses existing compression methods and even uncompressed baselines in QA accuracy, while also delivering substantial reductions in inference time and token count. By improving both effectiveness and efficiency, EXIT provides a promising direction for developing scalable, high-quality QA solutions in RAG pipelines. Our code is available at https://github.com/ThisIsHwang/EXIT.
pdf
bib
abs
Re-TASK: Revisiting LLM Tasks from Capability, Skill, and Knowledge Perspectives
Zhihu Wang
|
Shiwan Zhao
|
Yu Wang
|
Heyuan Huang
|
Sitao Xie
|
Yubo Zhang
|
Jiaxin Shi
|
Zhixing Wang
|
Hongyan Li
|
Junchi Yan
The Chain-of-Thought (CoT) paradigm has become a pivotal method for solving complex problems with large language models (LLMs). However, its application to domain-specific tasks remains challenging, as LLMs often fail to decompose tasks accurately or execute subtasks effectively. This paper introduces the Re-TASK framework, a novel theoretical model that Revisits LLM Tasks from cApability, Skill, and Knowledge perspectives, drawing on the principles of Bloom’s Taxonomy and Knowledge Space Theory. While CoT provides a workflow-centric perspective on tasks, Re-TASK introduces a Chain-of-Learning (CoL) paradigm that highlights task dependencies on specific capability items, further broken down into their constituent knowledge and skill components. To address CoT failures, we propose a Re-TASK prompting strategy, which strengthens task-relevant capabilities through targeted knowledge injection and skill adaptation. Experiments across diverse domains demonstrate the effectiveness of Re-TASK. In particular, we achieve improvements of 45.00% on Yi-1.5-9B and 24.50% on Llama3-Chinese-8B for legal tasks. These results highlight the potential of Re-TASK to significantly enhance LLM performance and its applicability in specialized domains. We release our code and data at https://github.com/Uylee/Re-TASK.
pdf
bib
abs
Unlearning Backdoor Attacks for LLMs with Weak-to-Strong Knowledge Distillation
Shuai Zhao
|
Xiaobao Wu
|
Cong-Duy T Nguyen
|
Yanhao Jia
|
Meihuizi Jia
|
Feng Yichao
|
Anh Tuan Luu
Parameter-efficient fine-tuning (PEFT) can bridge the gap between large language models (LLMs) and downstream tasks. However, PEFT has been proven vulnerable to malicious attacks. Research indicates that poisoned LLMs, even after PEFT, retain the capability to activate internalized backdoors when input samples contain predefined triggers. In this paper, we introduce a novel weak-to-strong unlearning algorithm to defend against backdoor attacks based on feature alignment knowledge distillation, named W2SDefense. Specifically, we first train a small-scale language model through full-parameter fine-tuning to serve as the clean teacher model. Then, this teacher model guides the large-scale poisoned student model in unlearning the backdoor, leveraging PEFT. Theoretical analysis suggests that W2SDefense has the potential to enhance the student model’s ability to unlearn backdoor features, preventing the activation of the backdoor. We conduct comprehensive experiments on three state-of-the-art large language models and several different backdoor attack algorithms. Our empirical results demonstrate the outstanding performance of W2SDefense in defending against backdoor attacks without compromising model performance.
pdf
bib
abs
Packing Analysis: Packing Is More Appropriate for Large Models or Datasets in Supervised Fine-tuning
Shuhe Wang
|
Guoyin Wang
|
Yizhong Wang
|
Jiwei Li
|
Eduard Hovy
|
Chen Guo
Packing, initially utilized in the pre-training phase, is an optimization technique designed to maximize hardware resource efficiency by combining different training sequences to fit the model’s maximum input length. Although it has demonstrated effectiveness during pre-training, there remains a lack of comprehensive analysis for the supervised fine-tuning (SFT) stage on the following points: (1) whether packing can effectively enhance training efficiency while maintaining performance, (2) the suitable size of the model and dataset for fine-tuning with the packing method, and (3) whether packing unrelated or related training samples might cause the model to either excessively disregard or over-rely on the context.In this paper, we perform extensive comparisons between SFT methods using padding and packing, covering SFT datasets ranging from 69K to 1.2M and models from 8B to 70B. This provides the first comprehensive analysis of the advantages and limitations of packing versus padding, as well as practical considerations for implementing packing in various training scenarios. Our analysis covers various benchmarks, including knowledge, reasoning, and coding, as well as GPT-based evaluations, time efficiency, and other fine-tuning parameters. We also open-source our code for fine-tuning and evaluation and provide checkpoints fine-tuned on datasets of different sizes, aiming to advance future research on packing methods.
pdf
bib
abs
Better Red Teaming via Searching with Large Language Model
Yongkang Chen
|
Chongyang Zhao
|
Jianwentian Jianwentian
|
Guiling Cao
|
Hu Li
|
Xiaohui Kuang
The safe deployment of large language models (LLMs) necessitates comprehensive safety evaluations through red teaming. However, existing methods face challenges in managing semantic intricacies and optimizing the efficiency of the search process. To overcome these limitations, we propose Better Red Teaming (BRT)—an innovative framework that reconceptualizes test case generation as a strategic planning problem, leveraging Monte Carlo Tree Search (MCTS). A notable advancement of our approach is the incorporation of LLMs as world models, enabling the prediction of state transitions and simulation of long-term outcomes throughout the search process. By jointly optimizing objectives related to conditional mutual information and diversity, we improve the world model’s capacity to follow actions while maintaining output diversity. Extensive experiments conducted across a range of LLM architectures demonstrate that BRT achieves state-of-the-art attack success rates without sacrificing computational efficiency.
pdf
bib
abs
AdaV: Adaptive Text-visual Redirection for Vision-Language Models
Jiayi Han
|
Liang Du
|
Yiwen Wu
|
Guanming Liang
|
Xiangguo Zhou
|
Weibo Zheng
|
Donghong Han
|
Zixun Sun
The success of Vision-Language Models (VLMs) often relies on high-resolution schemes that preserve image details, while these approaches also generate an excess of visual tokens, leading to a substantial decrease in model efficiency. A typical VLM includes a visual encoder, a text encoder, and an LLM. Recent studies suggest pruning visual tokens based on visual and textual priors to accelerate VLMs without additional training costs. However, these methods often overlook prompt semantics or suffer from biased self-attention in the LLM. Inspired by the efficient mechanisms of the human brain for multimodal understanding, we introduce AdaV, a novel training-free visual token pruning method. By emulating the neural pathways that preprocess visual and auditory information before the reasoning stage, we shift text-guided visual attention redirection to the pre-LLM stage, which reduces biased token pruning and enhances model robustness with a limited visual token budget. A Self-adaptive Cross-modality Attention Redirection (SCAR) module is further proposed that effectively merges and redirects visual attention with text-to-image attention. Extensive experiments on seven challenging benchmarks demonstrate that our AdaV achieves SOTA performance in training-free VLM acceleration and can be plug-and-play on various VLMs. We plan to open-source the code upon publication.
pdf
bib
abs
MegaAgent: A Large-Scale Autonomous LLM-based Multi-Agent System Without Predefined SOPs
Qian Wang
|
Tianyu Wang
|
Zhenheng Tang
|
Qinbin Li
|
Nuo Chen
|
Jingsheng Liang
|
Bingsheng He
LLM-based multi-agent systems (MAS) have shown promise in tackling complex tasks. However, existing solutions often suffer from limited agent coordination and heavy reliance on predefined Standard Operating Procedures (SOPs), which demand extensive human input. To address these limitations, we propose MegaAgent, a large-scale autonomous LLM-based multi-agent system. MegaAgent generates agents based on task complexity and enables dynamic task decomposition, parallel execution, efficient communication, and comprehensive system monitoring of agents. In evaluations, MegaAgent demonstrates exceptional performance, successfully developing a Gobang game within 800 seconds and scaling up to 590 agents in a national policy simulation to generate multi-domain policies. It significantly outperforms existing systems, such as MetaGPT, in both task completion efficiency and scalability. By eliminating the need for predefined SOPs, MegaAgent demonstrates exceptional scalability and autonomy, setting a foundation for advancing true autonomy in MAS.
pdf
bib
abs
Persona-judge: Personalized Alignment of Large Language Models via Token-level Self-judgment
Xiaotian Zhang
|
Ruizhe Chen
|
Yang Feng
|
Zuozhu Liu
Aligning language models with human preferences presents significant challenges, particularly in achieving personalization without incurring excessive computational costs. Existing methods rely on reward signals and additional annotated data, limiting their scalability and adaptability to diverse human values. To address these challenges, we introduce Persona-judge, a novel discriminative paradigm that enables training-free personalized alignment with unseen preferences. Instead of optimizing policy parameters through external reward feedback, Persona-judge leverages the intrinsic preference judgment capabilities of the model. Specifically, a draft model generates candidate tokens conditioned on a given preference, while a judge model, embodying another preference, cross-validates the predicted tokens whether to be accepted. Experimental results demonstrate that Persona-judge, using the inherent preference evaluation mechanisms of the model, offers a scalable and computationally efficient solution to personalized alignment, paving the way for more adaptive customized alignment. Our code is available here.
pdf
bib
abs
A Self-Distillation Recipe for Neural Machine Translation
Hongfei Xu
|
Zhuofei Liang
|
Qiuhui Liu
|
Lingling Mu
Self-distillation distills the deeper sub-networks to the shallower sub-networks without using an extra teacher model, and has been proven effective in improving the performance of a series of computer vision tasks. In this paper, we study the representation-based self-distillation methods for Neural Machine Translation (NMT) considering the efficiency issue with a large vocabulary. We present a rank-order augmented Pearson correlation loss and an iterative distillation method to prevent the discrepancy of predictions between the student and a stronger teacher from disturbing the training. To prevent the teacher from misleading the student’s learning, we utilize a warm-up strategy and present a gradient adaption method to scale down or zero the Knowledge Distillation (KD) gradients which are opposite to the translation. Experiments show that our method can lead to significant improvements over the strong Transformer baseline on low/middle/high-resource tasks, obtaining comparable performance to previous MT KD studies without pre-training a teacher. Deeper Transformer experiments show that our method can lead to comparable or better performance with fewer layers.
pdf
bib
abs
BlockPruner: Fine-grained Pruning for Large Language Models
Longguang Zhong
|
Fanqi Wan
|
Ruijun Chen
|
Xiaojun Quan
|
Liangzhi Li
With the rapid growth in the size and complexity of large language models (LLMs), the costs associated with their training and inference have escalated significantly. Research indicates that certain layers in LLMs harbor substantial redundancy, and pruning these layers has minimal impact on the overall performance. While various layer pruning methods have been developed based on this insight, they generally overlook the finer-grained redundancies within the layers themselves. In this paper, we delve deeper into the architecture of LLMs and demonstrate that finer-grained pruning can be achieved by targeting redundancies in multi-head attention (MHA) and multi-layer perceptron (MLP) blocks. We propose a novel, training-free structured pruning approach called BlockPruner. Unlike existing layer pruning methods, BlockPruner segments each Transformer layer into MHA and MLP blocks. It then assesses the importance of these blocks using perplexity measures and applies a heuristic search for iterative pruning. We applied BlockPruner to LLMs of various sizes and architectures and validated its performance across a wide range of downstream tasks. Experimental results show that BlockPruner achieves more granular and effective pruning compared to state-of-the-art baselines.
pdf
bib
abs
Evaluating Implicit Bias in Large Language Models by Attacking From a Psychometric Perspective
Yuchen Wen
|
Keping Bi
|
Wei Chen
|
Jiafeng Guo
|
Xueqi Cheng
As large language models (LLMs) become an important way of information access, there have been increasing concerns that LLMs may intensify the spread of unethical content, including implicit bias that hurts certain populations without explicit harmful words. In this paper, we conduct a rigorous evaluation of LLMs’ implicit bias towards certain demographics by attacking them from a psychometric perspective to elicit agreements to biased viewpoints. Inspired by psychometric principles in cognitive and social psychology, we propose three attack approaches, i.e., Disguise, Deception, and Teaching. Incorporating the corresponding attack instructions, we built two benchmarks: (1) a bilingual dataset with biased statements covering four bias types (2.7K instances) for extensive comparative analysis, and (2) BUMBLE, a larger benchmark spanning nine common bias types (12.7K instances) for comprehensive evaluation. Extensive evaluation of popular commercial and open-source LLMs shows that our methods can elicit LLMs’ inner bias more effectively than competitive baselines. Our attack methodology and benchmarks offer an effective means of assessing the ethical risks of LLMs, driving progress toward greater accountability in their development.
pdf
bib
abs
LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-Context QA
Jiajie Zhang
|
Yushi Bai
|
Xin Lv
|
Wanjun Gu
|
Danqing Liu
|
Minhao Zou
|
Shulin Cao
|
Lei Hou
|
Yuxiao Dong
|
Ling Feng
|
Juanzi Li
Though current long-context large language models (LLMs) have demonstrated impressive capacities in answering various questions based on extensive text, the lack of citations in their responses makes user verification difficult, leading to concerns about their trustworthiness due to the potential hallucinations. In this work, we aim to enable long-context LLMs to generate responses with fine-grained sentence-level citations on the fly, thereby improving their faithfulness and verifiability. We first introduce LongBench-Cite, an automated benchmark for assessing current LLMs’ performance in long-context question answering with citations (LQAC), revealing considerable room for improvement. To this end, we propose CoF (Coarse to Fine), a novel pipeline that utilizes off-the-shelf LLMs to automatically construct long-context QA instances with precise sentence-level citations, and leverage this pipeline to construct LongCite-45k, a large-scale SFT dataset for LQAC. Finally, we train LongCite-8B and LongCite-9B using the constructed dataset, successfully enabling the generation of accurate responses and fine-grained citations in one pass. The evaluation results on LongBench-Cite show that our trained models achieve state-of-the-art citation quality, surpassing advanced proprietary models including GPT-4o. We also discover that SFT with citation information can further improve the correctness of model responses compared to standard long-context SFT.
pdf
bib
abs
An Empirical Study of Group Conformity in Multi-Agent Systems
Min Choi
|
Keonwoo Kim
|
Sungwon Chae
|
Sangyeop Baek
Recent advances in Large Language Models (LLMs) have enabled multi-agent systems that simulate real-world interactions with near-human reasoning. While previous studies have extensively examined biases related to protected attributes such as race, the emergence and propagation of biases on socially contentious issues in multi-agent LLM interactions remain underexplored. This study explores how LLM agents shape public opinion through debates on five contentious topics. By simulating over 2,500 debates, we analyze how initially neutral agents, assigned a centrist disposition, adopt specific stances over time. Statistical analyses reveal significant group conformity mirroring human behavior; LLM agents tend to align with numerically dominant groups or more intelligent agents, exerting a greater influence. These findings underscore the crucial role of agent intelligence in shaping discourse and highlight the risks of bias amplification in online interactions. Our results emphasize the need for policy measures that promote diversity and transparency in LLM-generated discussions to mitigate the risks of bias propagation within anonymous online environments.
pdf
bib
abs
Combining the Best of Both Worlds: A Method for Hybrid NMT and LLM Translation
Zhanglin Wu
|
Daimeng Wei
|
Xiaoyu Chen
|
Hengchao Shang
|
Jiaxin Guo
|
Zongyao Li
|
Yuanchang Luo
|
Jinlong Yang
|
Zhiqiang Rao
|
Hao Yang
Large language model (LLM) shows promising performances in a variety of downstream tasks, such as machine translation (MT). However, using LLMs for translation suffers from high computational costs and significant latency. Based on our evaluation, in most cases, translations using LLMs are comparable to that generated by neural machine translation (NMT) systems. Only in particular scenarios, LLM and NMT models show respective advantages. As a result, integrating NMT and LLM for translation and using LLM only when necessary seems to be a sound solution. A scheduling policy that optimizes translation result while ensuring fast speed and as less LLM usage as possible is thereby required. We compare several scheduling policies and propose a novel and straightforward decider that leverages source sentence features. We conduct extensive experiments on multilingual test sets and the result shows that we can achieve optimal translation performance with less LLM usage, demonstrating effectiveness of our decider.
pdf
bib
abs
ASPO: Adaptive Sentence-Level Preference Optimization for Fine-Grained Multimodal Reasoning
Yeyuan Wang
|
Dehong Gao
|
Rujiao Long
|
Lei Yi
|
Linbo Jin
|
Libin Yang
|
Xiaoyan Cai
Direct Preference Optimization (DPO) has gained significant attention for its simplicity and computational efficiency in aligning large language models (LLMs). Recent advancements have extended DPO to multimodal scenarios, achieving strong performance. However, traditional DPO relies on binary preference optimization, rewarding or penalizing entire responses without considering fine-grained segment correctness, leading to suboptimal solutions. The root of this issue lies in the absence of fine-grained supervision during the optimization process. To address this, we propose Adaptive Sentence-level Preference Optimization (ASPO), which evaluates individual sentences for more precise preference optimization. By dynamically calculating adaptive rewards at the sentence level based on model predictions, ASPO enhances response content assessment without additional models or parameters. This significantly improves the alignment of multimodal features. Extensive experiments show that ASPO substantially enhances the overall performance of multimodal models.
pdf
bib
abs
NovelCR: A Large-Scale Bilingual Dataset Tailored for Long-Span Coreference Resolution
MeiHan Tong
|
Shuai Wang
Coreference resolution (CR) endeavors to match pronouns, noun phrases, etc. with their referent entities, acting as an important step for deep text understanding. Presently available CR datasets are either small in scale or restrict coreference resolution to a limited text span. In this paper, we present NovelCR, a large-scale bilingual benchmark designed for long-span coreference resolution. NovelCR features extensive annotations, including 148k mentions in NovelCR-en and 311k mentions in NovelCR-zh. Moreover, the dataset is notably rich in long-span coreference pairs, with 85% of pairs in NovelCR-en and 83% in NovelCR-zh spanning across three or more sentences. Experiments on NovelCR reveal a large gap between state-of-the-art baselines and human performance, highlighting that NovelCR remains an open issue.
pdf
bib
abs
Dynamic Attention-Guided Context Decoding for Mitigating Context Faithfulness Hallucinations in Large Language Models
Huangyw Huangyw
|
Yong Zhang
|
Ning Cheng
|
Zhitao Li
|
Shaojun Wang
|
Jing Xiao
Large language models (LLMs) often exhibit Context Faithfulness Hallucinations, where outputs deviate from retrieved information due to incomplete context integration. Our analysis reveals a strong correlation between token-level uncertainty and hallucinations. We hypothesize that attention mechanisms inherently encode context utilization signals, supported by probing analysis. Based on these insights, we propose **Dynamic Attention-Guided Context Decoding (DAGCD)**, a lightweight framework that leverages attention distributions and uncertainty signals in a single-pass decoding. Experiments on open-book QA datasets demonstrate DAGCD’s effectiveness, yielding significant improvements in faithfulness and robustness while preserving computational efficiency.
pdf
bib
abs
Exploring the Choice Behavior of Large Language Models
Weidong Wu
|
Qinlin Zhao
|
Hao Chen
|
Lexin Zhou
|
Defu Lian
|
Hong Xie
Large Language Models (LLMs) are increasingly deployed as human assistants across various domains where they help to make choices. However, the mechanisms behind LLMs’ choice behavior remain unclear, posing risks in safety-critical situations. Inspired by the intrinsic and extrinsic motivation framework within the classic human behavioral model of Self-Determination Theory and its established research methodologies, we investigate the factors influencing LLMs’ choice behavior by constructing a virtual QA platform that includes three different experimental conditions, with four models from GPT and Llama series participating in repeated experiments. Our findings indicate that LLMs’ behavior is influenced not only by intrinsic attention bias but also by extrinsic social influence, exhibiting patterns similar to the Matthew effect and Conformity. We distinguish independent pathways of these two factors in LLMs’ behavior by self-report. This work provides new insights into understanding LLMs’ behavioral patterns, exploring their human-like characteristics.
pdf
bib
abs
On-Policy Self-Alignment with Fine-grained Knowledge Feedback for Hallucination Mitigation
Xueru Wen
|
Jie Lou
|
Xinyu Lu
|
Yuqiu Ji
|
Xinyan Guan
|
Yaojie Lu
|
Hongyu Lin
|
Ben He
|
Xianpei Han
|
Debing Zhang
|
Le Sun
Hallucination occurs when large language models exhibit behavior that deviates from the boundaries of their knowledge during response generation. To address this critical issue, previous learning-based methods attempt to finetune models but are limited by off-policy sampling and coarse-grained feedback. In this paper, we present Reinforcement Learning for Hallucination (RLFH), an on-policy self-alignment approach that enables LLMs to actively explore their knowledge boundaries and self-correct generation behavior through fine-grained feedback signals. RLFH introduces a self-assessment framework where the policy serves as its own judge. Through this framework, responses are automatically decomposed into atomic facts and their truthfulness and informativeness are assessed against external knowledge sources. The resulting fine-grained feedback at the statement level are then converted into token-level dense reward signals. This enables online reinforcement learning to achieve precise and timely optimization without human intervention. Comprehensive evaluations on HotpotQA, SQuADv2, and Biography benchmarks validate RLFH’s effectiveness in hallucination mitigation.
pdf
bib
abs
From Phrases to Subgraphs: Fine-Grained Semantic Parsing for Knowledge Graph Question Answering
Yurun Song
|
Xiangqing Shen
|
Rui Xia
The recent emergence of large language models (LLMs) has brought new opportunities to knowledge graph question answering (KGQA), but also introduces challenges such as semantic misalignment and reasoning noise. Semantic parsing (SP), previously a mainstream approach for KGQA, enables precise graph pattern matching by mapping natural language queries to executable logical forms. However, it faces limitations in scalability and generalization, especially when dealing with complex, multi-hop reasoning tasks.In this work, we propose a Fine-Grained Semantic Parsing (FGSP) framework for KGQA. Our framework constructs a fine-grained mapping library via phrase-level segmentation of historical question-logical form pairs, and performs online retrieval and fusion of relevant subgraph fragments to answer complex queries. This fine-grained, compositional approach ensures tighter semantic alignment between questions and knowledge graph structures, enhancing both interpretability and adaptability to diverse query types. Experimental results on two KGQA benchmarks demonstrate the effectiveness of FGSP, with a notable 18.5% relative F1 performance improvement over the SOTA on the complex multi-hop CWQ dataset. Our code is available at https://github.com/NUSTM/From-Phrases-to-Subgraphs.
pdf
bib
abs
StableToolBench-MirrorAPI: Modeling Tool Environments as Mirrors of 7,000+ Real-World APIs
Zhicheng Guo
|
Sijie Cheng
|
Yuchen Niu
|
Hao Wang
|
Sicheng Zhou
|
Wenbing Huang
|
Yang Liu
The rapid advancement of large language models (LLMs) has spurred significant interest in tool learning, where LLMs are augmented with external tools to tackle complex tasks. However, existing tool environments face challenges in balancing stability, scale, and realism, particularly for benchmarking purposes. To address this, we propose MirrorAPI, a novel framework that trains specialized LLMs to accurately simulate real API responses, effectively acting as “mirrors” to tool environments. Using a comprehensive dataset of request-response pairs from 7,000+ APIs, we employ supervised fine-tuning and chain-of-thought reasoning to enhance simulation fidelity. MirrorAPI achieves superior accuracy and stability compared to state-of-the-art methods, as demonstrated by its performance on the newly constructed MirrorAPI-Bench and its integration into StableToolBench.
pdf
bib
abs
ClaimPKG: Enhancing Claim Verification via Pseudo-Subgraph Generation with Lightweight Specialized LLM
Hoang Pham
|
Thanh-Do Nguyen
|
Khac-Hoai Nam Bui
Integrating knowledge graphs (KGs) to enhance the reasoning capabilities of large language models (LLMs) is an emerging research challenge in claim verification. While KGs provide structured, semantically rich representations well-suited for reasoning, most existing verification methods rely on unstructured text corpora, limiting their ability to effectively leverage KGs. Additionally, despite possessing strong reasoning abilities, modern LLMs struggle with multi-step modular pipelines and reasoning over KGs without adaptation. To address these challenges, we propose ClaimPKG, an end-to-end framework that seamlessly integrates LLM reasoning with structured knowledge from KGs. Specifically, the main idea of ClaimPKG is to employ a lightweight, specialized LLM to represent the input claim as pseudo-subgraphs, guiding a dedicated subgraph retrieval module to identify relevant KG subgraphs. These retrieved subgraphs are then processed by a general-purpose LLM to produce the final verdict and justification. Extensive experiments on the FactKG dataset demonstrate that ClaimPKG achieves state-of-the-art performance, outperforming strong baselines in this research field by 9%-12% accuracy points across multiple categories. Furthermore, ClaimPKG exhibits zero-shot generalizability to unstructured datasets such as HoVer and FEVEROUS, effectively combining structured knowledge from KGs with LLM reasoning across various LLM backbones.
pdf
bib
abs
TriEmbed: Bridge the Gap between Text and Token Indices with Embedding Reparameterization
Baizhou Huang
|
Xiaojun Wan
The current paradigm of language modeling is a two-stage pipeline that first transforms raw text to token indices, where the distribution is then estimated. It inherently discards linguistic relations between tokens during tokenization, creating a fundamental gap. To address this, we propose TriEmbed, a reparameterization method for embeddings that incorporates the morphological relationships inherent in subword tokenizer algorithms. Specifically, by organizing the vocabulary into a Trie structure, we can encode these relations and reparametrize the embeddings, facilitating the recovery of other linguistic relationships during training. Empirical results across various settings demonstrate that TriEmbed outperforms conventional embeddings from the perspective of scaling, while offering more linguistically informative token embeddings.
pdf
bib
abs
Chain of Methodologies: Scaling Test Time Computation without Training
Cong Liu
|
Jie Wu
|
Weigang Wu
|
Xu Chen
|
Liang Lin
|
Wei-Shi Zheng
Large Language Models (LLMs) often struggle with complex reasoning tasks due to insufficient in-depth insights in their training data, which are frequently absent in publicly available documents. This paper introduces the Chain of Methodologies (CoM), a simple and innovative iterative prompting framework designed to build structured reasoning processes by injecting human methodological insights, thereby enabling LLMs to perform long and effective reasoning for complex tasks. Assuming that LLMs possess certain metacognitive abilities, CoM leverages user-defined methodologies to stimulate the cognitive insights that LLMs have learned implicitly from training data. Experimental results indicate that CoM outperforms competitive baselines, highlighting the potential of training-free prompting methods as general solutions for complex reasoning tasks and the possibility of incorporating human-like methodological insights to bridge the gap to human-level reasoning.
pdf
bib
abs
A Survey on Personalized Alignment—The Missing Piece for Large Language Models in Real-World Applications
Jian Guan
|
Junfei Wu
|
Jia-Nan Li
|
Chuanqi Cheng
|
Wei Wu
Large Language Models (LLMs) have demonstrated remarkable capabilities, yet their transition to real-world applications reveals a critical limitation: the inability to adapt to individual preferences while maintaining alignment with universal human values. Current alignment techniques adopt a one-size-fits-all approach that fails to accommodate users’ diverse backgrounds and needs. This paper presents the first comprehensive survey of personalized alignment—a paradigm that enables LLMs to adapt their behavior within ethical boundaries based on individual preferences. We propose a unified framework comprising preference memory management, personalized generation, and feedback-based alignment, systematically analyzing implementation approaches and evaluating their effectiveness across various scenarios. By examining current techniques, potential risks, and future challenges, this survey provides a structured foundation for developing more adaptable and ethically-aligned LLMs.
pdf
bib
abs
SuLoRA: Subspace Low-Rank Adaptation for Parameter-Efficient Fine-Tuning
Chenhao Ding
|
Jiangyang Li
|
SongLin Dong
|
Xinyuan Gao
|
Yuhang He
|
Yihong Gong
As the scale of large language models (LLMs) grows and natural language tasks become increasingly diverse, Parameter-Efficient Fine-Tuning (PEFT) has become the standard paradigm for fine-tuning LLMs. Among PEFT methods, LoRA is widely adopted for not introducing additional inference overhead. However, existing LoRA’s shared parameter space paradigm introduces parameter interference, leading to a gap in generalization performance for specific tasks compared to full fine-tuning. To address this issue, we propose a parameter-separated low-rank adapter, called Subspace Low-Rank Adaptation (SuLoRA). The core idea of SuLoRA is to account for task differences by decomposing LoRA’s parameter matrix into multiple independent subspaces and assigning them differentially to distinct tasks. This prevents interference across tasks and enhances the effectiveness of low-rank adaptation. Additionally, SuLoRA achieves higher rank expansion by freezing the A matrix, further improving generalization capability. We conduct extensive experiments on various NLP tasks, demonstrating that SuLoRA significantly outperforms LoRA in trainable parameter efficiency and overall model performance. Furthermore, we validate SuLoRA’s effectiveness in domain generalization and multi-modal tasks, showcasing its strong generalization ability.
pdf
bib
abs
MIRe: Enhancing Multimodal Queries Representation via Fusion-Free Modality Interaction for Multimodal Retrieval
Yeong-Joon Ju
|
Ho-Joong Kim
|
Seong-Whan Lee
Recent multimodal retrieval methods have endowed text-based retrievers with multimodal capabilities by utilizing pre-training strategies for visual-text alignment. They often directly fuse the two modalities for cross-reference during the alignment to understand multimodal queries. However, existing methods often overlook crucial visual information due to a text-dominant issue, which overly depends on text-driven signals. In this paper, we introduce MIRe, a retrieval framework that achieves modality interaction without fusing textual features during the alignment. Our method allows the textual query to attend to visual embeddings while not feeding text-driven signals back into the visual representations. Additionally, we construct a pre-training dataset for multimodal query retrieval by transforming concise question-answer pairs into extended passages. Our experiments demonstrate that our pre-training strategy significantly enhances the understanding of multimodal queries, resulting in strong performance across four multimodal retrieval benchmarks under zero-shot settings. Moreover, our ablation studies and analyses explicitly verify the effectiveness of our framework in mitigating the text-dominant issue. Our code is publicly available: https://github.com/yeongjoonJu/MIRe
pdf
bib
abs
Correcting on Graph: Faithful Semantic Parsing over Knowledge Graphs with Large Language Models
Ruilin Zhao
|
Feng Zhao
|
Hong Zhang
Complex multi-hop questions often require comprehensive retrieval and reasoning. As a result, effectively parsing such questions and establishing an efficient interaction channel between large language models (LLMs) and knowledge graphs (KGs) is essential for ensuring reliable reasoning. In this paper, we present a novel semantic parsing framework Correcting on Graph (CoG), aiming to establish faithful logical queries that connect LLMs and KGs. We first propose a structured knowledge decoding that enables the LLM to generate fact-aware logical queries during inference, while leveraging its parametric knowledge to fill in the blank intermediate entities. Then, we introduce a knowledge path correction that combines the logical query with KGs to correct hallucination entities and path deficiencies in the generated content, ensuring the reliability and comprehensiveness of the retrieved knowledge. Extensive experiments demonstrate that CoG outperforms the state-of-the-art KGQA methods on two knowledge-intensive question answering benchmarks. CoG achieves a high answer hit rate and exhibits competitive F1 performance for complex multi-hop questions.
pdf
bib
abs
COPR: Continual Human Preference Learning via Optimal Policy Regularization
Han Zhang
|
Lin Gui
|
Yu Lei
|
Yuanzhao Zhai
|
Yehong Zhang
|
Zhuo Zhang
|
Yulan He
|
Hui Wang
|
Yue Yu
|
Kam-Fai Wong
|
Bin Liang
|
Ruifeng Xu
Reinforcement Learning from Human Feedback (RLHF) is effective for aligning Large Language Models (LLMs) with human preferences. However, RLHF’s complex process limits its ability to continually learn human feedback, making it impractical for real-world applications where the deployed model continuously receives feedback from users. The non-RL-based method, such as Direct Preference Optimization (DPO), is not primitively favorable for Continual Learning (CL). We observe that when combined with Experiment Relay (ER) for CL, DPO tends to significantly widen the gap in the probability of human-preferred and dispreferred responses. Consequently, this diminishes the diversity in model generation, potentially leading to model collapse. To overcome the above challenges, we propose the Continual Optimal Policy Regularization (COPR), a novel non-RL offline method to convert the historical optimal policies into optimization constraints when continually learning new preferences. We first derive a moderate reward function from the pairwise ranking loss and then use the moderate reward to calculate a new sampling distribution to construct novel learning objectives and constraints. We also provide formal proof of the learnability of COPR. The experimental results show that COPR outperforms strong CL baselines on our proposed benchmark, in terms of reward-based, GPT-4 evaluations and human assessment.
pdf
bib
abs
Robust Preference Optimization via Dynamic Target Margins
Jie Sun
|
Junkang Wu
|
Jiancan Wu
|
Zhibo Zhu
|
Xingyu Lu
|
Jun Zhou
|
Lintao Ma
|
Xiang Wang
The alignment of Large Language Models (LLMs) is crucial for ensuring their safety and reliability in practical applications. Direct Preference Optimization (DPO) has emerged as an efficient method that directly optimizes models using preference pairs, significantly reducing resource demands. However, the effectiveness of DPO heavily depends on the data quality, which is frequently compromised by noise. In this work, we propose 𝛾-PO, a dynamic target margin preference optimization algorithm that adjust reward margins at the pairwise level. By introducing instance-specific margin calibration, 𝛾-PO strategically prioritizes high-confidence pairs (those demonstrating higher reward margins) while suppressing potential noise from ambiguous pairs. Moreover, 𝛾-PO is a plug-and-play method, compatible with variants of DPO that rely on reward margin between preference pairs. Across benchmarks such as AlpacaEval2 and Arena-Hard, 𝛾-PO achieves an average 4.4% improvement over other baselines, setting new benchmarks for state-of-the-art performance. Additionally, 𝛾-PO requires minimal code changes and has a negligible impact on training efficiency, making it a robust solution for enhancing LLMs alignment. Our codes are available at https://github.com/sunjie279/gammaPO.
pdf
bib
abs
AdaReTaKe: Adaptive Redundancy Reduction to Perceive Longer for Video-language Understanding
Xiao Wang
|
Qingyi Si
|
Shiyu Zhu
|
Jianlong Wu
|
Li Cao
|
Liqiang Nie
Multimodal Large Language Models (MLLMs) have revolutionized video understanding, yet are still limited by context length when processing long videos. Recent methods compress videos by leveraging visual redundancy uniformly, yielding promising results. Nevertheless, our quantitative analysis shows that redundancy varies significantly across time and model layers, necessitating a more flexible compression strategy. We propose **AdaReTaKe**, a training-free method that flexibly reduces visual redundancy by allocating compression ratios among time and layers with theoretical guarantees. Integrated into state-of-the-art MLLMs, AdaReTaKe improves processing capacity from 256 to 2048 frames while preserving critical information. Experiments on VideoMME, MLVU, LongVideoBench, and LVBench datasets demonstrate that AdaReTaKe outperforms existing methods by 2.3% and 2.8% for 7B and 72B models, respectively, with even greater improvements of 5.9% and 6.0% on the longest LVBench.
pdf
bib
abs
Rethinking Stateful Tool Use in Multi-Turn Dialogues: Benchmarks and Challenges
Hongru Wang
|
Wenyu Huang
|
Yufei Wang
|
Yuanhao Xi
|
Jianqiao Lu
|
Huan Zhang
|
Nan Hu
|
Zeming Liu
|
Jeff Z. Pan
|
Kam-Fai Wong
Existing benchmarks that assess Language Models (LMs) as Language Agents (LAs) for tool use primarily focus on stateless, single-turn interactions or partial evaluations, such as tool selection in a single turn, overlooking the inherent stateful nature of interactions in multi-turn applications. To fulfill this gap, we propose DialogTool, a multi-turn dialogue dataset with stateful tool interactions considering the whole life cycle of tool use, across six key tasks in three stages: 1) tool creation; 2) tool utilization: tool awareness, tool selection, tool execution; and 3) role-consistent response: response generation and role play. Furthermore, we build VirtualMobile – an embodied virtual mobile evaluation environment to simulate API calls and assess the robustness of the created APIs. Taking advantage of these artifacts, we conduct comprehensive evaluation on 13 distinct open- and closed-source LLMs and provide detailed analysis at each stage, revealing that the existing state-of-the-art LLMs still cannot perform well to use tools over long horizons .
pdf
bib
abs
Open-Set Living Need Prediction with Large Language Models
Xiaochong Lan
|
Jie Feng
|
Yizhou Sun
|
Chen Gao
|
Jiahuan Lei
|
Xinleishi Xinleishi
|
Hengliang Luo
|
Yong Li
Living needs are the needs people generate in their daily lives for survival and well-being. On life service platforms like Meituan, user purchases are driven by living needs, making accurate living need predictions crucial for personalized service recommendations. Traditional approaches treat this prediction as a closed-set classification problem, severely limiting their ability to capture the diversity and complexity of living needs. In this work, we redefine living need prediction as an open-set classification problem and propose PIGEON, a novel system leveraging large language models (LLMs) for unrestricted need prediction. PIGEON first employs a behavior-aware record retriever to help LLMs understand user preferences, then incorporates Maslow’s hierarchy of needs to align predictions with human living needs. For evaluation and application, we design a recall module based on a fine-tuned text embedding model that links flexible need descriptions to appropriate life services. Extensive experiments on real-world datasets demonstrate that PIGEON significantly outperforms closed-set approaches on need-based life service recall by an average of 19.37%. Human evaluation validates the reasonableness and specificity of our predictions. Additionally, we employ instruction tuning to enable smaller LLMs to achieve competitive performance, supporting practical deployment.
pdf
bib
abs
Improve Rule Retrieval and Reasoning with Self-Induction and Relevance ReEstimate
Ziyang Huang
|
Wangtao Sun
|
Jun Zhao
|
Kang Liu
This paper systematically addresses the challenge of rule retrieval, a crucial yet underexplored area. Vanilla retrieval methods using sparse or dense retrievers to directly search for relevant rules to support downstream reasoning, often suffer from low accuracy. This is primarily due to a significant semantic gap between the instantiated facts in the queries and the abstract representations of the rules. Such misalignment results in suboptimal retrieval quality, which in turn negatively impacts reasoning performance. To overcome these challenges, we propose Self-Induction Augmented Retrieval (SIAR), a novel approach that utilizes Large Language Models (LLMs) to induce potential inferential rules that might offer benefits for reasoning by abstracting the underlying knowledge and logical structure in queries. These induced rules are then used for query augmentation to improve retrieval effectiveness. Additionally, we introduce Rule Relevance ReEstimate (R3), a method that re-estimates the relevance of retrieved rules by assessing whether the abstract knowledge they contain can be instantiated to align with the facts in the queries and the helpfulness for reasoning. Extensive experiments across various settings demonstrate the effectiveness and versatility of our proposed methods.
pdf
bib
abs
Beyond Words: Integrating Theory of Mind into Conversational Agents for Human-Like Belief, Desire, and Intention Alignment
Mehdi Jafari
|
Yuncheng Hua
|
Hao Xue
|
Flora D. Salim
Natural language interaction has long served as the primary medium through which humans exchange ideas. A key enabler of this communication is the human capacity for Theory of Mind (ToM)—the ability to infer and align with the mental states of others. ToM is usually modeled as components of desires, beliefs, and intentions. Research in linguistics and psychology has shown that people oftentimes reveal their ToM through pragmatic aspects of language. Considering the advancements in natural language generation and perception that Large Language Models (LLMs) have made in recent years, a critical question arises in relation to ToM: can LLM-powered agents develop similar abilities for inferring mental states during natural language communication? This study investigates the extent to which open-source LLaMA models can represent and retain ToM-related constructs, and whether these internal representations contribute to a coherent mental state modeling in a given conversation. Additionally, we explore the potential for manipulating ToM-related information to generate more aligned responses. Empirical evaluations of LLaMA-3 models (3B and 8B) demonstrate that ToM-informed alignment improves response quality, achieving win rates of 63% and 67%, respectively. These findings suggest that integrating ToM principles can enhance alignment in LLM-based conversational agents. For further details, refer to the [code repository](https://github.com/cruiseresearchgroup/ToM_and_Alignment).
pdf
bib
abs
Multimodal Causal Reasoning Benchmark: Challenging Multimodal Large Language Models to Discern Causal Links Across Modalities
Zhiyuan Li
|
Heng Wang
|
Dongnan Liu
|
Chaoyi Zhang
|
Ao Ma
|
Jieting Long
|
Weidong Cai
Multimodal Large Language Models (MLLMs) have showcased exceptional Chain-of-Thought (CoT) reasoning ability in complex textual inference tasks including causal reasoning. However, will these causalities remain straightforward when crucial hints hide in visual details? If not, what factors might influence cross-modal generalization? Whether we can effectively enhance their capacity for robust causal inference across both text and vision? Motivated by these, we introduce **MuCR** - a novel **Mu**ltimodal **C**ausal **R**easoning benchmark that leverages synthetic siamese images and text pairs to challenge MLLMs. Additionally, we develop tailored metrics from multiple perspectives, including image-level match, phrase-level understanding, and sentence-level explanation, to comprehensively assess MLLMs’ comprehension abilities. Our experiments reveal that current MLLMs fall short in multimodal causal reasoning compared to their performance in purely textual settings. Additionally, we find that identifying visual cues across images is key to effective cross-modal generalization. Finally, we propose the **VcCoT** strategy that better highlights visual cues, and our results confirm its efficacy in enhancing multimodal causal reasoning.
pdf
bib
abs
Context-Aware Hierarchical Merging for Long Document Summarization
Litu Ou
|
Mirella Lapata
Hierarchical Merging is a technique commonly used to summarize very long texts (>100K tokens) by breaking down the input into smaller sections, summarizing those sections individually, and then merging or combining those summaries into a final coherent summary. Although it helps address the limitations of large language models (LLMs) with fixed input length constraints, the recursive merging process can amplify LLM hallucinations, increasing the risk of factual inaccuracies. In this paper, we seek to mitigate hallucinations by enriching hierarchical merging with context from the source document. Specifically, we propose different approaches to contextual augmentation ranging from *replacing* intermediate summaries with relevant input context, to *refining* them while using the context as supporting evidence, and *aligning* them implicitly (via citations) to the input. Experimental results on datasets representing legal and narrative domains show that contextual augmentation consistently outperforms zero-shot and hierarchical merging baselines for the Llama 3.1 model family. Our analysis further reveals that refinement methods tend to perform best when paired with extractive summarization for identifying relevant input.
pdf
bib
abs
VCD: A Dataset for Visual Commonsense Discovery in Images
Xiangqing Shen
|
Fanfan Wang
|
Siwei Wu
|
Rui Xia
Visual commonsense plays a vital role in understanding and reasoning about the visual world. While commonsense knowledge bases like ConceptNet provide structured collections of general facts, they lack visually grounded representations. Scene graph datasets like Visual Genome, though rich in object-level descriptions, primarily focus on directly observable information and lack systematic categorization of commonsense knowledge. We present Visual Commonsense Dataset (VCD), a large-scale dataset containing over 100,000 images and 14 million object-commonsense pairs that bridges this gap. VCD introduces a novel three-level taxonomy for visual commonsense, integrating both Seen (directly observable) and Unseen (inferrable) commonsense across Property, Action, and Space aspects. Each commonsense is represented as a triple where the head entity is grounded to object bounding boxes in images, enabling scene-dependent and object-specific visual commonsense representation. To demonstrate VCD’s utility, we develop VCM, a generative model that combines a vision-language model with instruction tuning to discover diverse visual commonsense from images. Extensive evaluations demonstrate both the high quality of VCD and its value as a resource for advancing visually grounded commonsense understanding and reasoning. Our dataset and code will be released on https://github.com/NUSTM/VCD.
pdf
bib
abs
Self-Reasoning Language Models: Unfold Hidden Reasoning Chains with Few Reasoning Catalyst
Hongru Wang
|
Deng Cai
|
Wanjun Zhong
|
Shijue Huang
|
Jeff Z. Pan
|
Zeming Liu
|
Kam-Fai Wong
Inference-time scaling has attracted much attention which significantly enhance the performance of Large Language Models (LLMs) in complex reasoning tasks by increasing the length of Chain-of-Thought. These longer intermediate reasoning rationales embody various meta-reasoning skills in human cognition such as reflection and decomposition, being difficult to create and acquire. In this work, we introduce Self-Reasoning Language Model (SRLM), where the model itself can synthesize longer CoT data and iteratively improve performance through self-training. By incorporating a few demonstration examples (i.e., 1,000 samples) on how to unfold hidden reasoning chains from existing responses, which act as a reasoning catalyst, we demonstrate that SRLM not only enhances the model’s initial performance but also ensures more stable and consistent improvements in subsequent iterations. Our proposed SRLM achieves an average absolute improvement of more than +2.5 points across five reasoning tasks: MMLU, GSM8K, ARC-C, HellaSwag, and BBH on two backbone models. Moreover, it brings more improvements with more times of sampling during inference, such as absolute +7.89 average improvement with 64 sampling times, revealing the in-depth, diverse and creative reasoning paths in SRLM against the strong baseline .
pdf
bib
abs
HyperCRS: Hypergraph-Aware Multi-Grained Preference Learning to Burst Filter Bubbles in Conversational Recommendation System
Yongsen Zheng
|
Mingjie Qian
|
Guohua Wang
|
Yang Liu
|
Ziliang Chen
|
Mingzhi Mao
|
Liang Lin
|
Kwok-Yan Lam
The filter bubble is a notorious issue in Recommender Systems (RSs), characterized by users being confined to a limited corpus of information or content that strengthens and amplifies their pre-established preferences and beliefs. Most existing methods primarily aim to analyze filter bubbles in the relatively static recommendation environment. Nevertheless, the filter bubble phenomenon continues to exacerbate as users interact with the system over time. To address these issues, we propose a novel paradigm, Hypergraph-Aware Multi-Grained Preference Learning to Burst Filter Bubbles in Conversational Recommendation System (HyperCRS), aiming to burst filter bubbles by learning multi-grained user preferences during the dynamic user-system interactions via natural language conversations. HyperCRS develops Multi-Grained Hypergraph (user-, item-, and attribute-grained) to explore diverse relations and capture high-order connectivity. It employs Hypergraph-Empowered Policy Learning, which includes Multi-Grained Preference Modeling to model user preferences and Preference-based Decision Making to disrupt filter bubbles during user interactions. Extensive results on four publicly CRS-based datasets show that HyperCRS achieves new state-of-the-art performance, and the superior of bursting filter bubbles in the CRS.
pdf
bib
abs
Is LLM an Overconfident Judge? Unveiling the Capabilities of LLMs in Detecting Offensive Language with Annotation Disagreement
Junyu Lu
|
Kai Ma
|
Kaichun Wang
|
Kelaiti Xiao
|
Roy Ka-Wei Lee
|
Bo Xu
|
Liang Yang
|
Hongfei Lin
Large Language Models (LLMs) have become essential for offensive language detection, yet their ability to handle annotation disagreement remains underexplored. Disagreement samples, which arise from subjective interpretations, pose a unique challenge due to their ambiguous nature. Understanding how LLMs process these cases, particularly their confidence levels, can offer insight into their alignment with human annotators. This study systematically evaluates the performance of multiple LLMs in detecting offensive language at varying levels of annotation agreement. We analyze binary classification accuracy, examine the relationship between model confidence and human disagreement, and explore how disagreement samples influence model decision-making during few-shot learning and instruction fine-tuning. Our findings reveal that LLMs struggle with low-agreement samples, often exhibiting overconfidence in these ambiguous cases. However, utilizing disagreement samples in training improves both detection accuracy and model alignment with human judgment. These insights provide a foundation for enhancing LLM-based offensive language detection in real-world moderation tasks.
pdf
bib
abs
Language Repository for Long Video Understanding
Kumara Kahatapitiya
|
Kanchana Ranasinghe
|
Jongwoo Park
|
Michael S Ryoo
Language has become a prominent modality in computer vision with the rise of LLMs. Despite supporting long context-lengths, their effectiveness in handling long-term information gradually declines with input length. This becomes critical, especially in applications such as long-form video understanding. In this paper, we introduce a Language Repository (LangRepo) for LLMs, that maintains concise and structured information as an interpretable (i.e., all-textual) representation. Our repository is updated iteratively based on multi-scale video chunks. We introduce write and read operations that focus on pruning redundancies in text, and extracting information at various temporal scales. The proposed framework is evaluated on zero-shot visual question-answering benchmarks, showing state-of-the-art performance at its scale. Our code is available at https://github.com/kkahatapitiya/LangRepo.
pdf
bib
abs
Investigating Language Preference of Multilingual RAG Systems
Jeonghyun Park
|
Hwanhee Lee
Multilingual Retrieval-Augmented Generation (mRAG) systems enhance language models by integrating external multilingual information to produce context-aware responses. However, mRAG systems struggle with retrieving relevant information due to linguistic variations between queries and documents, generating inconsistent responses when multilingual sources conflict. In this work, we systematically investigate language preferences in both retrieval and generation of mRAG through a series of experiments. Our analysis indicates that retrievers tend to prefer high-resource and query languages, yet this preference does not consistently improve generation performance. Moreover, we observe that generators prefer the query language or Latin scripts, leading to inconsistent outputs. To overcome these issues, we propose Dual Knowledge Multilingual RAG (DKM-RAG), a simple yet effective framework that fuses translated multilingual passages with complementary model knowledge. Empirical results demonstrate that DKM-RAG mitigates language preference in generation and enhances performance across diverse linguistic settings. Code is available at
https://github.com/jeonghyunpark2002/LanguagePreference.gitpdf
bib
abs
FGDGNN: Fine-Grained Dynamic Graph Neural Network for Rumor Detection on Social Media
Mei Guo
|
Chen Chen
|
Chunyan Hou
|
Yike Wu
|
Xiaojie Yuan
Detecting rumors on social media has become a crucial issue.Propagation structure-based methods have recently attracted increasing attention.When the propagation structure is represented by the dynamic graph, temporal information is considered.However, existing rumor detection models using dynamic graph typically focus only on coarse-grained temporal information and ignore the fine-grained temporal dynamics within individual snapshots and across snapshots.In this paper, we propose a novel Fine-Grained Dynamic Graph Neural Network (FGDGNN) model, which can incorporate the fine-grained temporal information of dynamic propagation graph in the intra-snapshot and dynamic embedding update mechanism in the inter-snapshots into a unified framework for rumor detection.Specifically, we first construct the edge-weighted propagation graph and the edge-aware graph isomorphism network is proposed.To obtain fine-grained temporal representations across snapshots, we propose an embedding transformation layer to update node embeddings.Finally, we integrate the temporal information in the inter-snapshots at the graph level to enhance the effectiveness of the proposed model.Extensive experiments conducted on three public real-world datasets demonstrate that our FGDGNN model achieves significant improvements compared with the state-of-the-art baselines.
pdf
bib
abs
Self-Tuning: Instructing LLMs to Effectively Acquire New Knowledge through Self-Teaching
Xiaoying Zhang
|
Baolin Peng
|
Ye Tian
|
Jingyan Zhou
|
Yipeng Zhang
|
Haitao Mi
|
Helen M. Meng
Large language models (LLMs) often struggle to provide up-to-date information due to their one-time training and the constantly evolving nature of the world. To keep LLMs current, existing approaches typically involve continued pre-training on new documents. However, they frequently face difficulties in extracting stored knowledge. Motivated by the remarkable success of the Feynman Technique in efficient human learning, we introduce Self-Tuning, a learning framework aimed at improving an LLM’s ability to effectively acquire new knowledge from unseen raw documents through self-teaching. Specifically, we develop a Self-Teaching strategy that augments the documents with a set of knowledge-intensive tasks created in a self-supervised manner, focusing on three crucial aspects: memorization, comprehension, and self-reflection. Additionally, we introduce three Wiki-Newpages-2023-QA datasets to facilitate an in-depth analysis of an LLM’s knowledge acquisition ability concerning memorization, extraction, and reasoning. Extensive experimental results on various models, e.g., Llama2-7B reveal that Self-Tuning consistently exhibits superior performance across all knowledge acquisition tasks and excels in preserving previous knowledge.
pdf
bib
abs
QueryAttack: Jailbreaking Aligned Large Language Models Using Structured Non-natural Query Language
Qingsong Zou
|
Jingyu Xiao
|
Qing Li
|
Zhi Yan
|
Yuhang Wang
|
Li Xu
|
Wenxuan Wang
|
Kuofeng Gao
|
Ruoyu Li
|
Yong Jiang
Recent advances in large language models (LLMs) have demonstrated remarkable potential in the field of natural language processing. Unfortunately, LLMs face significant security and ethical risks. Although techniques such as safety alignment are developed for defense, prior researches reveal the possibility of bypassing such defenses through well-designed jailbreak attacks. In this paper, we propose QueryAttack, a novel framework to examine the generalizability of safety alignment. By treating LLMs as knowledge databases, we translate malicious queries in natural language into structured non-natural query language to bypass the safety alignment mechanisms of LLMs. We conduct extensive experiments on mainstream LLMs, and the results show that QueryAttack not only can achieve high attack success rates (ASRs), but also can jailbreak various defense methods. Furthermore, we tailor a defense method against QueryAttack, which can reduce ASR by up to 64% on GPT-4-1106. Our code is available at https://anonymous.4open.science/r/QueryAttack-334B.
pdf
bib
abs
Memory or Reasoning? Explore How LLMs Compute Mixed Arithmetic Expressions
Chengzhi Li
|
Heyan Huang
|
Ping Jian
|
Zhen Yang
|
Chenxu Wang
|
Yifan Wang
Large language models (LLMs) can solve complex multi-step math reasoning problems, but little is known about how these computations are implemented internally. Many recent studies have investigated the mechanisms of LLMs on simple arithmetic tasks (e.g., a+b, a× b), but how LLMs solve mixed arithmetic tasks still remains unexplored. This gap highlights the limitation of these findings in reflecting real-world scenarios. In this work, we take a step further to explore how LLMs compute mixed arithmetic expressions. We find that LLMs follow a similar workflow to mixed arithmetic calculations: first parsing the complete expression, then using attention heads to aggregate information to the last token position for result generation, without step-by-step reasoning at the token dimension. However, **for some specific expressions, the model generates the final result depends on the generation of intermediate results at the last token position, which is similar to human thinking.** Furthermore, we propose a **C**ausal **E**ffect **D**riven **F**ine-tuning method (CEDF) to adaptively enhance the identified key components used to execute mixed arithmetic calculations to improve LLMs reasoning ability.
pdf
bib
abs
PersonaX: A Recommendation Agent-Oriented User Modeling Framework for Long Behavior Sequence
Yunxiao Shi
|
Wujiang Xu
|
Zhang Zeqi
|
Xing Zi
|
Qiang Wu
|
Min Xu
User profile embedded in the prompt template of personalized recommendation agents play a crucial role in shaping their decision-making process. High-quality user profiles are essential for aligning agent behavior with real user interests. Typically, these profiles are constructed by leveraging LLMs for user profile modeling (LLM-UM). However, this process faces several challenges: (1) LLMs struggle with long user behaviors due to context length limitations and performance degradation. (2) Existing methods often extract only partial segments from full historical behavior sequence, inevitably discarding diverse user interests embedded in the omitted content, leading to incomplete modeling and suboptimal profiling. (3) User profiling is often tightly coupled with the inference context, requiring online processing, which introduces significant latency overhead. In this paper, we propose PersonaX, an agent-agnostic LLM-UM framework to address these challenges. It augments downstream recommendation agents to achieve better recommendation performance and inference efficiency. PersonaX (a) segments complete historical behaviors into clustered groups, (b) selects multiple sub-behavior sequences (SBS) with a balance of prototypicality and diversity to form a high-quality core set, (c) performs offline multi-persona profiling to capture diverse user interests and generate fine-grained, cached textual personas, and (d) decouples user profiling from online inference, enabling profile retrieval instead of real-time generation. Extensive experiments demonstrate its effectiveness: using only 30–50% of behavioral data (sequence length 480), PersonaX enhances AgentCF by 3–11% and Agent4Rec by 10–50%. As a scalable and model-agnostic LLM-UM solution, PersonaX sets a new benchmark in scalable user modeling. The code is available at URL .
pdf
bib
abs
Judge as A Judge: Improving the Evaluation of Retrieval-Augmented Generation through the Judge-Consistency of Large Language Models
Shuliang Liu
|
Xinze Li
|
Zhenghao Liu
|
Yukun Yan
|
Cheng Yang
|
Zheni Zeng
|
Zhiyuan Liu
|
Maosong Sun
|
Ge Yu
Retrieval-Augmented Generation (RAG) has proven its effectiveness in alleviating hallucinations for Large Language Models (LLMs). However, existing automated evaluation metrics cannot fairly evaluate the outputs generated by RAG models during training and evaluation. LLM-based judgment models provide the potential to produce high-quality judgments, but they are highly sensitive to evaluation prompts, leading to inconsistencies when judging the output of RAG models. This paper introduces the Judge-Consistency (ConsJudge) method, which aims to enhance LLMs to generate more accurate evaluations for RAG models. Specifically, ConsJudge prompts LLMs to generate different judgments based on various combinations of judgment dimensions, utilizes the judge-consistency to evaluate these judgments, and selects the chosen and rejected judgments for DPO training. Our experiments show that ConsJudge can effectively provide more accurate judgments for optimizing RAG models across various RAG models and datasets. Further analysis reveals that judgments generated by ConsJudge have a high agreement with the superior LLM. All codes are available at https://github.com/OpenBMB/ConsJudge.
pdf
bib
abs
Rationales Are Not Silver Bullets: Measuring the Impact of Rationales on Model Performance and Reliability
Chiwei Zhu
|
Benfeng Xu
|
An Yang
|
Junyang Lin
|
Quan Wang
|
Chang Zhou
|
Zhendong Mao
Training language models with rationales augmentation has been shown to be beneficial in many existing works. In this paper, we identify that such a prevailing view does not hold consistently. We conduct comprehensive investigations to thoroughly inspect the impact of rationales on model performance as well as a novel perspective of model reliability. The results lead to several key findings that add new insights upon existing understandings: 1) Rationales can, at times, deteriorate model performance; 2) Rationales can, at times, improve model reliability, even outperforming their untrained counterparts; 3) A linear correspondence exists in between the performance and reliability improvements, while both are driven by the intrinsic difficulty of the task. These findings provide informative regulations on the broad utilization of rationales and raise critical implications on the procedure of explicitly aligning language models with implicit human thoughts. Codes can be found in this anonymous link: https://anonymous.4open.science/r/rationales-CEE8.
pdf
bib
abs
CA-GAR: Context-Aware Alignment of LLM Generation for Document Retrieval
Heng Yu
|
Junfeng Kang
|
Rui Li
|
Qi Liu
|
Liyang He
|
Zhenya Huang
|
Shuanghong Shen
|
Junyu Lu
Information retrieval has evolved from traditional sparse and dense retrieval methods to approaches driven by large language models (LLMs). Recent techniques, such as Generation-Augmented Retrieval (GAR) and Generative Document Retrieval (GDR), leverage LLMs to enhance retrieval but face key challenges: GAR’s generated content may not always align with the target document corpus, while GDR limits the generative capacity of LLMs by constraining outputs to predefined document identifiers. To address these issues, we propose Context-Aware Generation-Augmented Retrieval (CA-GAR), which enhances LLMs by integrating corpus information into their generation process. CA-GAR optimizes token selection by incorporating relevant document information and leverages a Distribution Alignment Strategy to extract corpus information using a lexicon-based approach. Experimental evaluations on seven tasks from the BEIR benchmark and four non-English languages from Mr.TyDi demonstrate that CA-GAR outperforms existing methods.
pdf
bib
abs
AgentCourt: Simulating Court with Adversarial Evolvable Lawyer Agents
Guhong Chen
|
Liyang Fan
|
Zihan Gong
|
Nan Xie
|
Zixuan Li
|
Ziqiang Liu
|
Chengming Li
|
Qiang Qu
|
Hamid Alinejad-Rokny
|
Shiwen Ni
|
Min Yang
Current research in LLM-based simulation systems lacks comprehensive solutions for modeling real-world court proceedings, while existing legal language models struggle with dynamic courtroom interactions. We present **AgentCourt**, a comprehensive legal simulation framework that addresses these challenges through adversarial evolution of LLM-based agents. Our AgentCourt introduces a new adversarial evolutionary approach for agents called **AdvEvol**, which performs dynamic knowledge learning and evolution through structured adversarial interactions in a simulated courtroom program, breaking the limitations of the traditional reliance on static knowledge bases or manual annotations. By simulating 1,000 civil cases, we construct an evolving knowledge base that enhances the agents’ legal reasoning abilities. The evolved lawyer agents demonstrated outstanding performance on our newly introduced **CourtBench** benchmark, achieving a 12.1% improvement in performance compared to the original lawyer agents. Evaluations by professional lawyers confirm the effectiveness of our approach across three critical dimensions: cognitive agility, professional knowledge, and logical rigor. Beyond outperforming specialized legal models in interactive reasoning tasks, our findings emphasize the importance of adversarial learning in legal AI and suggest promising directions for extending simulation-based legal reasoning to broader judicial and regulatory contexts.
pdf
bib
abs
MLDebugging: Towards Benchmarking Code Debugging Across Multi-Library Scenarios
JinYang Huang
|
Xiachong Feng
|
Qiguang Chen
|
Hanjie Zhao
|
Zihui Cheng
|
Jiesong Bai
|
Jingxuan Zhou
|
Min Li
|
Libo Qin
Code debugging is a crucial task in software engineering, which attracts increasing attention. While remarkable success has been made in the era of large language models (LLMs), current research still focuses on the simple no-library or single-library setting, ignoring the complex multi-library scenario in real-world applications. To address this limitation, we make the first attempt to introduce MLDebugging (Multi-Library Debugging), a comprehensive benchmark designed to assess debugging challenges within multi-library Python code. Specifically, MLDebugging encompasses 126 distinct Python libraries, covering a wide range of multi-library code issues, categorized into seven distinct types. Furthermore, we conduct a thorough evaluation of MLDebugging using both mainstream open-source and closed-source LLMs and highlight that current LLMs still struggle to correctly perform code debugging across multi-library scenarios. We hope this work can uncover the potential of LLMs in multi-library debugging scenario and offer insights for future research.
pdf
bib
abs
An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Model is not a General Substitute for GPT-4
Hui Huang
|
Xingyuan Bu
|
Hongli Zhou
|
Yingqi Qu
|
Jing Liu
|
Muyun Yang
|
Bing Xu
|
Tiejun Zhao
Recently, there has been a growing trend of utilizing Large Language Model (LLM) to evaluate the quality of other LLMs. Many studies have fine-tuned judge models based on open-source LLMs for evaluation. While the fine-tuned judge models are claimed to achieve comparable evaluation capability with GPT-4, in this work, we conduct an empirical study of LLM-as-a-Judge. Our findings indicate that although the fine-tuned judge models achieve high performance on in-domain test sets, even surpassing GPT-4, they underperform GPT-4 across several dimensions, including generalizability, fairness and adaptability. We also reveal that the fine-tuned judge model inherently operates as a task-specific classifier, consequently imposing the limitations.
pdf
bib
abs
Expectation Confirmation Preference Optimization for Multi-Turn Conversational Recommendation Agent
Xueyang Feng
|
Jingsen Zhang
|
Jiakai Tang
|
Wei Li
|
Guohao Cai
|
Xu Chen
|
Quanyu Dai
|
Yue Zhu
|
Zhenhua Dong
Recent advancements in Large Language Models (LLMs) have significantly propelled the development of Conversational Recommendation Agents (CRAs). However, these agents often generate short-sighted responses that fail to sustain user guidance and meet expectations. Although preference optimization has proven effective in aligning LLMs with user expectations, it remains costly and performs poorly in multi-turn dialogue. To address this challenge, we introduce a novel multi-turn preference optimization (MTPO) paradigm **ECPO**, which leverages Expectation Confirmation Theory to explicitly model the evolution of user satisfaction throughout multi-turn dialogues, uncovering the underlying causes of dissatisfaction. These causes can be utilized to support targeted optimization of unsatisfactory responses, thereby achieving turn-level preference optimization. ECPO eliminates the significant sampling overhead of existing MTPO methods while ensuring the optimization process drives meaningful improvements. To support ECPO, we also introduce an LLM-based user simulator, **AILO**, to simulate user feedback and expectation confirmation during conversational recommendations. Experimental results show that ECPO significantly enhances CRA’s interaction capabilities, offering notable improvements in both efficiency and effectiveness over existing MTPO methods.
pdf
bib
abs
ProMedTS: A Self-Supervised, Prompt-Guided Multimodal Approach for Integrating Medical Text and Time Series
Shuai Niu
|
Jing Ma
|
Hongzhan Lin
|
Liang Bai
|
Zhihua Wang
|
V. W.
|
Richard Yi Da Xu
|
Guo Li
|
Xian Yang
Large language models (LLMs) have shown remarkable performance in vision-language tasks, but their application in the medical field remains underexplored, particularly for integrating structured time series data with unstructured clinical notes. In clinical practice, dynamic time series data, such as lab test results, capture critical temporal patterns, while clinical notes provide rich semantic context. Merging these modalities is challenging due to the inherent differences between continuous signals and discrete text. To bridge this gap, we introduce ProMedTS, a novel self-supervised multimodal framework that employs prompt-guided learning to unify these heterogeneous data types. Our approach leverages lightweight anomaly detection to generate anomaly captions that serve as prompts, guiding the encoding of raw time series data into informative prompt embeddings. These prompt embeddings are aligned with textual representations in a shared latent space, preserving fine-grained temporal nuances alongside semantic insights. Furthermore, our framework incorporates tailored self-supervised objectives to enhance both intra- and inter-modal alignment. We evaluate ProMedTS on disease diagnosis tasks using real-world datasets, and the results demonstrate that our method consistently outperforms state-of-the-art approaches.
pdf
bib
abs
CipherBank: Exploring the Boundary of LLM Reasoning Capabilities through Cryptography Challenge
Yu Li
|
Qizhi Pei
|
Mengyuan Sun
|
Honglin Lin
|
Chenlin Ming
|
Xin Gao
|
Jiang Wu
|
Conghui He
|
Lijun Wu
Large language models (LLMs) have demonstrated remarkable capabilities, especially the recent advancements in reasoning, such as o1 and o3, pushing the boundaries of AI. Despite these impressive achievements in mathematics and coding, the reasoning abilities of LLMs in domains requiring cryptographic expertise remain underexplored. In this paper, we introduce CipherBank, a comprehensive benchmark designed to evaluate the reasoning capabilities of LLMs in cryptographic decryption tasks. CipherBank comprises 2,358 meticulously crafted problems, covering 262 unique plaintexts across 5 domains and 14 subdomains, with a focus on privacy-sensitive and real-world scenarios that necessitate encryption. From a cryptographic perspective, CipherBank incorporates 3 major categories of encryption methods, spanning 9 distinct algorithms, ranging from classical ciphers to custom cryptographic techniques. We evaluate state-of-the-art LLMs on CipherBank, e.g., GPT-4o, DeepSeek-V3, and cutting-edge reasoning-focused models such as o1 and DeepSeek-R1. Our results reveal significant gaps in reasoning abilities not only between general-purpose chat LLMs and reasoning-focused LLMs but also in the performance of current reasoning-focused models when applied to classical cryptographic decryption tasks, highlighting the challenges these models face in understanding and manipulating encrypted data. Through detailed analysis and error investigations, we provide several key observations that shed light on the limitations and potential improvement areas for LLMs in cryptographic reasoning.These findings underscore the need for continuous advancements in LLM reasoning capabilities.
pdf
bib
abs
Which Retain Set Matters for LLM Unlearning? A Case Study on Entity Unlearning
Hwan Chang
|
Hwanhee Lee
Large language models (LLMs) risk retaining unauthorized or sensitive information from their training data, which raises privacy concerns. LLM unlearning seeks to mitigate these risks by selectively removing specified data while maintaining overall model performance. However, most existing work focuses on methods to achieve effective forgetting and does not provide a detailed analysis of the retain set, the portion of training data that is not targeted for removal. In this paper, we investigate the effects of unlearning on various subsets of the retain set through a case study on entity unlearning. We introduce the Syntactically Similar Neighbor Set, a group of queries that share similar syntactic structures with the data targeted for removal, and show that this subset suffers the greatest performance drop during unlearning. Moreover, when used for regularization, this set not only preserves performance on syntactically similar queries but also delivers comparable or improved results across other data subsets. Our results highlight that syntactic similarity is a critical factor, potentially more so than domain or entity relationships, in achieving effective and practical LLM unlearning.
pdf
bib
abs
Tell Me What You Don’t Know: Enhancing Refusal Capabilities of Role-Playing Agents via Representation Space Analysis and Editing
Wenhao Liu
|
Siyu An
|
Junru Lu
|
Muling Wu
|
Tianlong Li
|
Xiaohua Wang
|
Changze Lv
|
Xiaoqing Zheng
|
Di Yin
|
Xing Sun
|
Xuanjing Huang
Role-Playing Agents (RPAs) have shown remarkable performance in various applications, yet they often struggle to recognize and appropriately respond to hard queries that conflict with their role-play knowledge. To investigate RPAs’ performance when faced with different types of conflicting requests, we develop an evaluation benchmark that includes contextual knowledge conflicting requests, parametric knowledge conflicting requests, and non-conflicting requests to assess RPAs’ ability to identify conflicts and refuse to answer appropriately without over-refusing. Through extensive evaluation, we find that most RPAs behave significant performance gaps toward different conflict requests. To elucidate the reasons, we conduct an in-depth representation-level analysis of RPAs under various conflict scenarios. Our findings reveal the existence of rejection regions and direct response regions within the model’s forwarding representation, and thus influence the RPA’s final response behavior. Therefore, we introduce a lightweight representation editing approach that conveniently shifts conflicting requests to the rejection region, thereby enhancing the model’s refusal accuracy. The extensive experiments validate the effectiveness of our editing method, improving RPAs’ refusal ability of conflicting requests while maintaining their general role-playing capabilities.
pdf
bib
abs
LR²Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems
Jianghao Chen
|
Zhenlin Wei
|
Zhenjiang Ren
|
Ziyong Li
|
Jiajun Zhang
Recent progress in o1-like models has significantly enhanced the reasoning abilities of Large Language Models (LLMs), empowering them to tackle increasingly complex tasks through reflection capabilities, such as making assumptions, backtracking, and self-refinement. However, effectively evaluating such reflection capabilities remains challenging due to the lack of appropriate benchmarks. To bridge this gap, we introduce LR2Bench, a novel benchmark designed to evaluate the Long-chain Reflective Reasoning capabilities of LLMs. LR2Bench comprises 850 samples across six Constraint Satisfaction Problems (CSPs) where reflective reasoning is crucial for deriving solutions that meet all given constraints. Each type of task focuses on distinct constraint patterns, such as knowledge-based, logical, and spatial constraints, providing a comprehensive evaluation of diverse problem-solving scenarios. We conduct extensive evaluation on both conventional models and o1-like models. Our experimental results reveal that even the most advanced reasoning-specific models, such as DeepSeek-R1 and OpenAI o1-preview, struggle with tasks in LR2Bench, achieving an average Exact Match score of only 20.0% and 23.6%, respectively. These findings underscore the significant room for improvement in the reflective reasoning capabilities of current LLMs.
pdf
bib
abs
McBE: A Multi-task Chinese Bias Evaluation Benchmark for Large Language Models
Tian Lan
|
Xiangdong Su
|
Xu Liu
|
Ruirui Wang
|
Ke Chang
|
Jiang Li
|
Guanglai Gao
As large language models (LLMs) are increasingly applied to various NLP tasks, their inherent biases are gradually disclosed. Therefore, measuring biases in LLMs is crucial to mitigate its ethical risks. However, most existing bias evaluation datasets are focus on English andNorth American culture, and their bias categories are not fully applicable to other cultures. The datasets grounded in the Chinese language and culture are scarce. More importantly, these datasets usually only support single evaluation task and cannot evaluate the bias from multiple aspects in LLMs. To address these issues, we present a Multi-task Chinese Bias Evaluation Benchmark (McBE) that includes 4,077 bias evaluation instances, covering 12 single bias categories, 82 subcategories and introducing 5 evaluation tasks, providing extensive category coverage, content diversity, and measuring comprehensiveness. Additionally, we evaluate several popular LLMs from different series and with parameter sizes. In general, all these LLMs demonstrated varying degrees of bias. We conduct an in-depth analysis of results, offering novel insights into bias in LLMs.
pdf
bib
abs
MARK: Multi-agent Collaboration with Ranking Guidance for Text-attributed Graph Clustering
Yiwei Fu
|
Yuxing Zhang
|
Chunchun Chen
|
JianwenMa JianwenMa
|
Quan Yuan
|
Rong-Cheng Tu
|
Xinli Huang
|
Wei Ye
|
Xiao Luo
|
Minghua Deng
This paper studies the problem of text-attributed graph clustering, which aims to cluster each node into different groups using both textual attributes and structural information. Although graph neural networks (GNNs) have been proposed to solve this problem, their performance is usually limited when uncertain nodes are near the cluster boundaries due to label scarcity. In this paper, we introduce a new perspective of leveraging large language models (LLMs) to enhance text-attributed graph clustering and develop a novel approach named Multi-agent Collaboration with Ranking Guidance (MARK). The core of our MARK is to generate reliable guidance using the collaboration of three LLM-based agents as ranking-based supervision signals. In particular, we first conduct the coarse graph clustering, and utilize a concept agent to induce the semantics of each cluster. Then, we infer the robustness under perturbations to identify uncertain nodes and use a generation agent to produce synthetic text that closely aligns with their topology. An inference agent is adopted to provide the ranking semantics for each uncertain node in comparison to its synthetic counterpart. The consistent feedback between uncertain and synthetic texts is identified as reliable guidance for fine-tuning the clustering model within a ranking-based supervision objective. Experimental results on various benchmark datasets validate the effectiveness of the proposed MARK compared with competing baselines.
pdf
bib
abs
Can Language Models Capture Human Writing Preferences for Domain-Specific Text Summarization?
Jingbao Luo
|
Ming Liu
|
Ran Liu
|
Yongpan Sheng
|
Xin Hu
|
Gang Li
|
WupengNjust WupengNjust
With the popularity of large language models and their high-quality text generation capabilities, researchers are using them as auxiliary tools for text summary writing. Although summaries generated by these large language models are smooth and capture key information sufficiently, the quality of their output depends on the prompt, and the generated text is somewhat procedural to a certain extent. We construct LecSumm to verify whether language models truly capture human writing preferences, in which we recruit 200 college students to write summaries for lecture notes on ten different machine-learning topics and analyze writing preferences in real-world human summaries through the dimensions of length, content depth, tone & style, and summary format. We define the method of capturing human writing preferences by language models as finetuning pre-trained models with data and designing prompts to optimize the output of large language models. The results of translating the analyzed human writing preferences into prompts and conducting experiments show that both models still fail to capture human writing preferences effectively. Our LecSumm dataset brings new challenges to finetuned and prompt-based large language models on the task of human-centered text summarization.
pdf
bib
abs
Mitigate Position Bias in LLMs via Scaling a Single Hidden States Channel
Yijiong Yu
|
Huiqiang Jiang
|
Xufang Luo
|
Qianhui Wu
|
Chin-Yew Lin
|
Dongsheng Li
|
Yuqing Yang
|
Yongfeng Huang
|
Lili Qiu
Long-context language models (LCLMs) can process long context, but still exhibit position bias, also known as “lost in the middle”, which indicates placing key information in the middle of the context will significantly affect performance. To mitigating this, we first explore the micro-level manifestations of position bias, concluding that attention weights are a micro-level expression of position bias. Then we identify that, in addition to position embeddings, positional information in hidden states also contributes to position bias, and it manifests itself in specific channels of hidden states, called positional hidden states. Based on these, we propose a method to mitigate position bias by scaling positional hidden states. Experiments on NaturalQuestions Multi-document QA, KV retrieval and LongBench, using various models including RoPE models, context window-extended models, and Alibi models, demonstrate the effectiveness and generalizability of our approach. Our method can improve performance by up to 15.2% in “lost in the middle” benchmark by modifying just one channel of hidden states. Our code is available at https://aka.ms/PositionalHidden.
pdf
bib
abs
Self-attention-based Graph-of-Thought for Math Problem Solving
Ruiqiao Bai
|
Xue Han
|
Shuo Lei
|
Junlan Feng
|
Yanyan Luo
|
Chao Deng
Applying Large Language Models (LLM) to solve math problems is one of the hottest research topics at present. Traditional Chain-of-Thought-based methods typically generate the reasoning path in a chain structure, leading to unnecessary interference caused by non-zero self-attention among weakly related reasoning steps. Such a setting also differs from humans’ typical graph-structured reasoning habit (with an inter-step relationship graph in mind). To solve the problem, this paper proposes a novel decoding method for Transformer-based LLM, named Self-attention-based Graph-of-Thought (SaGoT). SaGoT constructs a thought graph simultaneously as an LLM inference (based on a newly defined inter-step self-attention indicator), and generates reasoning steps with a novel graph-structured self-attention mechanism. It is a significant contribution for SaGoT to enable an LLM’s graph-like reasoning ability by modifying its inner working operations, compared to SOTA prompting methods that are ex-post, rely on huge LLMs and redundant reasoning step generation to form a graph (inefficient & non-human-like). In addition, SaGoT is a training-free technique that can be seamlessly incorporated into pre-trained Transformer-based LLMs. Our experimental results have shown that SaGoT could significantly enhance mathematical reasoning accuracy without the reliance on huge computationally over-expensive LLMs. It also avoids SOTA methods’ performance degradation issues when the LLM is too small to comprehend complex prompts. Moreover, SaGoT integrates intrinsic interpretability into the LLM’s reasoning procedure, intuitively assisting humans in understanding how an LLM views the relationships among its reasoning steps, and why the LLM succeeds or fails.
pdf
bib
abs
BAR: A Backward Reasoning based Agent for Complex Minecraft Tasks
Weihong Du
|
Wenrui Liao
|
Binyu Yan
|
Hongru Liang
|
Anthony G Cohn
|
Wenqiang Lei
Large language model (LLM) based agents have shown great potential in following human instructions and automatically completing various tasks. To complete a task, the agent needs to decompose it into easily executed steps by planning. Existing studies mainly conduct the planning by inferring what steps should be executed next starting from the agent’s initial state. However, this forward reasoning paradigm doesn’t work well for complex tasks. We propose to study this issue in Minecraft, a virtual environment that simulates complex tasks based on real-world scenarios. We believe that the failure of forward reasoning is caused by the big perception gap between the agent’s initial state and task goal. To this end, we leverage backward reasoning and make the planning starting from the terminal state, which can directly achieve the task goal in one step. Specifically, we design a backward reasoning based agent (BAR). It is equipped with a recursive goal decomposition module, a state consistency maintaining module and a stage memory module to make robust, consistent, and efficient planning starting from the terminal state. Experimental results demonstrate the superiority of BAR over existing methods and the effectiveness of proposed modules.
pdf
bib
abs
KAPA: A Deliberative Agent Framework with Tree-Structured Knowledge Base for Multi-Domain User Intent Understanding
Jiakai Tang
|
Shiqi Shen
|
ZhipengWang ZhipengWang
|
Gong Zhi
|
Xueyang Feng
|
Zexu Sun
|
Haoran Tan
|
Xu Chen
Dialogue assistants have become ubiquitous in modern applications, fundamentally reshaping human daily communication patterns and information access behaviors. In real-world conversational interactions, however, user queries are often volatile, ambiguous, and diverse, making it difficult accurately and efficiently grasp the user’s underlying intentions. To address this challenge, we propose a simple yet effective deliberative agent framework that leverages human thought process to build high-level domain knowledge. To further achieve efficient knowledge accumulation and retrieval, we design a tree-structured knowledge base to store refined experience and data. Moreover, we construct a new benchmark, User-Intent-Understanding (UIU), which covers multi-domain, multi-tone, and sequential multi-turn personalized user queries. Extensive experiments demonstrate the effectiveness of our proposed method across multi-step evaluations.
pdf
bib
abs
RASD: Retrieval-Augmented Speculative Decoding
Guofeng Quan
|
Wenfeng Feng
|
Chuzhan Hao
|
Guochao Jiang
|
Yuewei Zhang
|
Hao Henry Wang
Speculative decoding accelerates inference in large language models (LLMs) by generating draft tokens for target model verification. Current approaches for obtaining draft tokens rely on lightweight draft models or additional model structures to generate draft tokens and retrieve context from databases. Due to the draft model’s small size and limited training data, model-based speculative decoding frequently becomes less effective in out-of-domain scenarios. Additionally, the time cost of the drafting phase results in a low upper limit on acceptance length during the verification step, limiting overall efficiency. This paper proposes RASD (Retrieval-Augmented Speculative Decoding), which adopts retrieval methods to enhance model-based speculative decoding. We introduce tree pruning and tree fusion to achieve this. Specifically, we develop a pruning method based on the draft model’s probability distribution to construct the optimal retrieval tree. Second, we employ the longest prefix matching algorithm to merge the tree generated by the draft model with the retrieval tree, resulting in a unified tree for verification. Experimental results demonstrate that RASD achieves state-of-the-art inference acceleration across tasks such as DocQA, Summary, Code, and In-Domain QA. Moreover, RASD exhibits strong scalability, seamlessly integrating with various speculative decoding approaches, including both generation-based and retrieval-based methods.
pdf
bib
abs
FRAG: A Flexible Modular Framework for Retrieval-Augmented Generation based on Knowledge Graphs
Zengyi Gao
|
Yukun Cao
|
Hairu Wang
|
Ao Ke
|
Yuan Feng
|
S Kevin Zhou
|
Xike Xie
To mitigate the hallucination and knowledge deficiency in large language models (LLMs), Knowledge Graph (KG)-based Retrieval-Augmented Generation (RAG) has shown promising potential by utilizing KGs as an external resource to enhance LLM reasoning.However, existing KG-RAG approaches struggle with a trade-off between flexibility and retrieval quality. Modular methods prioritize flexibility by avoiding the use of KG-fine-tuned models during retrieval, leading to fixed retrieval strategies and suboptimal retrieval quality. Conversely, coupled methods embed KG information within models to improve retrieval quality but at the expense of flexibility.In this paper, we propose a novel flexible modular KG-RAG framework, termed FRAG, which synergizes the advantages of both approaches. FRAG estimates the hop range of reasoning paths based solely on the query and classifies it as either simple or complex.To match the complexity of the query, tailored pipelines are applied to ensure efficient and accurate reasoning path retrieval, thus fostering the final reasoning process. By using the query text instead of the KG to infer the structural information of reasoning paths and employing adaptable retrieval strategies, FRAG improves retrieval quality while maintaining flexibility. Moreover, FRAG does not require extra LLM fine-tuning or calls, significantly boosting efficiency and conserving resources. Extensive experiments show that FRAG achieves state-of-the-art performance with high efficiency and low resource consumption. The code for our method is publicly available at https://github.com/gzy02/FRAG.
pdf
bib
abs
Reefknot: A Comprehensive Benchmark for Relation Hallucination Evaluation, Analysis and Mitigation in Multimodal Large Language Models
Kening Zheng
|
Junkai Chen
|
Yibo Yan
|
Xin Zou
|
Huiyu Zhou
|
Xuming Hu
Hallucination issues continue to affect multimodal large language models (MLLMs), with existing research mainly addressing object-level or attribute-level hallucinations, neglecting the more complex relation hallucinations that require advanced reasoning. Current benchmarks for relation hallucinations lack detailed evaluation and effective mitigation, and their datasets often suffer from biases due to systematic annotation processes. To address these challenges, we introduce Reefknot, a comprehensive benchmark targeting relation hallucinations, comprising over 20,000 real-world samples. We provide a systematic definition of relation hallucinations, integrating perceptive and cognitive perspectives, and construct a relation-based corpus using the Visual Genome scene graph dataset. Our comparative evaluation reveals significant limitations in current MLLMs’ ability to handle relation hallucinations. Additionally, we propose a novel confidence-based mitigation strategy, which reduces the hallucination rate by an average of 9.75% across three datasets, including Reefknot. Our work offers valuable insights for achieving trustworthy multimodal intelligence. The dataset and code are released at https://github.com/JackChen-seu/Reefknot.
pdf
bib
abs
Blessing of Multilinguality: A Systematic Analysis of Multilingual In-Context Learning
Yilei Tu
|
Andrew Xue
|
Freda Shi
While multilingual large language models generally perform adequately, and sometimes even rival English performance on high-resource languages (HRLs), they often significantly underperform on low-resource languages (LRLs). Among several prompting strategies aiming at bridging the gap, multilingual in-context learning (ICL) has been particularly effective when demonstration in target languages is unavailable. However, there lacks a systematic understanding when and why it works well.In this work, we systematically analyze multilingual ICL, using demonstrations in HRLs to enhance cross-lingual transfer. We show that demonstrations in mixed HRLs consistently outperform English-only ones across the board, particularly for tasks written in LRLs. Surprisingly, our ablation study show that the presence of irrelevant non-English sentences in the prompt yields measurable gains, suggesting the effectiveness of multilingual exposure itself. Our results highlight the potential of strategically leveraging multilingual resources to bridge the performance gap for underrepresented languages.
pdf
bib
abs
SEK: Self-Explained Keywords Empower Large Language Models for Code Generation
Lishui Fan
|
Mouxiang Chen
|
Zhongxin Liu
Large language models (LLMs) have achieved impressive performance in code generation. Despite the remarkable success, we observed that LLMs often misunderstand or overlook some problem-specific undertrained keywords during code generation, compromising the accuracy of the generated code. After explicitly explaining these undertrained keywords using well-trained terms in the prompt, LLMs are more likely to generate correct code implementation. Inspired by this observation, we propose a novel technique named SEK(Self-Explained Keywords), which empowers an LLM for better code generation by extracting and explaining the key terms in the problem description with the LLM itself. Comprehensive experiments across four benchmarks, i.e., HumanEval(+), MBPP(+), APPS and BigCodeBench, with five representative LLMs, show that SEK can significantly improve LLMs in code generation, yielding substantial and consistent gains. For instance, SEK improves the Pass@1 of DeepSeek-Coder-V2-Instruct from 85.4% to 93.3% on the Humaneval benchmark. Further analysis confirms that SEK enables the LLMs to shift their attention from low-frequency keywords to their corresponding explanations.
pdf
bib
abs
Why Not Act on What You Know? Unleashing Safety Potential of LLMs via Self-Aware Guard Enhancement
Peng Ding
|
Jun Kuang
|
ZongYu Wang
|
Xuezhi Cao
|
Xunliang Cai
|
Jiajun Chen
|
Shujian Huang
Large Language Models (LLMs) have shown impressive capabilities across various tasks but remain vulnerable to meticulously crafted jailbreak attacks. In this paper, we identify a critical safety gap: while LLMs are adept at detecting jailbreak prompts, they often produce unsafe responses when directly processing these inputs. Inspired by this insight, we propose SAGE(Self-Aware Guard Enhancement), a training-free defense strategy designed to align LLMs’ strong safety discrimination performance with their relatively weaker safety generation ability. SAGE consists of two core components: a Discriminative Analysis Module and a Discriminative Response Module, enhancing resilience against sophisticated jailbreak attempts through flexible safety discrimination instructions. Extensive experiments demonstrate SAGE’s effectiveness and robustness across various open-source and closed-source LLMs of different sizes and architectures, achieving an average 99% defense success rate against numerous complex and covert jailbreak methods while maintaining helpfulness on general benchmarks. We further conduct mechanistic interpretability analysis through hidden states and attention distributions, revealing the underlying mechanisms of this detection-generation discrepancy. Our work thus contributes to developing future LLMs with coherent safety awareness and generation behavior. Our code and datasets are publicly available at
https://github.com/NJUNLP/SAGE.
pdf
bib
abs
Explorer: Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents
Vardaan Pahuja
|
Yadong Lu
|
Corby Rosset
|
Boyu Gou
|
Arindam Mitra
|
Spencer Whitehead
|
Yu Su
|
Ahmed Hassan Awadallah
Recent success in large multimodal models (LMMs) has sparked promising applications of agents capable of autonomously completing complex web tasks. While open-source LMM agents have made significant advances in offline evaluation benchmarks, their performance still falls substantially short of human-level capabilities in more realistic online settings. A key bottleneck is the lack of diverse and large-scale trajectory-level datasets across various domains, which are expensive to collect. In this paper, we address this challenge by developing a scalable recipe to synthesize the largest and most diverse trajectory-level dataset to date, containing over 94K successful multimodal web trajectories, spanning 49K unique URLs, 720K screenshots, and 33M web elements. In particular, we leverage extensive web exploration and refinement to obtain diverse task intents. The average cost is 28 cents per successful trajectory, making it affordable to a wide range of users in the community. Leveraging this dataset, we train Explorer, a multimodal web agent, and demonstrate strong performance on both offline and online web agent benchmarks such as Mind2Web-Live, Multimodal-Mind2Web, and MiniWob++. Additionally, our experiments highlight data scaling as a key driver for improving web agent capabilities. We hope this study makes state-of-the-art LMM-based agent research at a larger scale more accessible.
pdf
bib
abs
Advancing General Multimodal Capability of Vision-language Models with Pyramid-descent Visual Position Encoding
Zhanpeng Chen
|
Mingxiao Li
|
Ziyang Chen
|
Nan Du
|
Xiaolong Li
|
Yuexian Zou
Vision-language Models (VLMs) have shown remarkable capabilities in advancing general artificial intelligence, yet the irrational encoding of visual positions persists in inhibiting the models’ comprehensive perception performance across different levels of granularity. In this work, we propose Pyramid-descent Visual Position Encoding (PyPE), a novel approach designed to enhance the perception of visual tokens within VLMs. By assigning visual position indexes from the periphery to the center and expanding the central receptive field incrementally, PyPE addresses the limitations of traditional raster-scan methods and mitigates the long-term decay effects induced by Rotary Position Embedding (RoPE). Our method reduces the relative distance between interrelated visual elements and instruction tokens, promoting a more rational allocation of attention weights and allowing for a multi-granularity perception of visual elements and countering the over-reliance on anchor tokens. Extensive experimental evaluations demonstrate that PyPE consistently improves the general capabilities of VLMs across various sizes. Code is available at https://anonymous.4open.science/r/PyPE-34EE.
pdf
bib
abs
P-React: Synthesizing Topic-Adaptive Reactions of Personality Traits via Mixture of Specialized LoRA Experts
Yuhao Dan
|
Jie Zhou
|
Qin Chen
|
Junfeng Tian
|
Liang He
Personalized large language models (LLMs) have attracted great attention in many applications, such as emotional support and role-playing. However, existing works primarily focus on modeling explicit character profiles, while ignoring the underlying personality traits that truly shape behaviors and decision-making, hampering the development of more anthropomorphic and psychologically-grounded AI systems. In this paper, we explore the modeling of Big Five personality traits, which is the most widely used trait theory in psychology, and propose P-React, a mixture of experts (MoE)-based personalized LLM. Particularly, we integrate a Personality Specialization Loss (PSL) to better capture individual trait expressions, providing a more nuanced and psychologically grounded personality simulacrum. To facilitate research in this field, we curate OCEAN-Chat, a high-quality, human-verified dataset designed to train LLMs in expressing personality traits across diverse topics. Extensive experiments demonstrate the effectiveness of P-React in maintaining consistent and real personality.
pdf
bib
abs
EssayJudge: A Multi-Granular Benchmark for Assessing Automated Essay Scoring Capabilities of Multimodal Large Language Models
Jiamin Su
|
Yibo Yan
|
Fangteng Fu
|
Zhang Han
|
Jingheng Ye
|
Xiang Liu
|
Jiahao Huo
|
Huiyu Zhou
|
Xuming Hu
Automated Essay Scoring (AES) plays a crucial role in educational assessment by providing scalable and consistent evaluations of writing tasks. However, traditional AES systems face three major challenges: (i) reliance on handcrafted features that limit generalizability, (ii) difficulty in capturing fine-grained traits like coherence and argumentation, and (iii) inability to handle multimodal contexts. In the era of Multimodal Large Language Models (MLLMs), we propose **EssayJudge**, the **first multimodal benchmark to evaluate AES capabilities across lexical-, sentence-, and discourse-level traits**. By leveraging MLLMs’ strengths in trait-specific scoring and multimodal context understanding, EssayJudge aims to offer precise, context-rich evaluations without manual feature engineering, addressing longstanding AES limitations. Our experiments with 18 representative MLLMs reveal gaps in AES performance compared to human evaluation, particularly in discourse-level traits, highlighting the need for further advancements in MLLM-based AES research. Our dataset and code will be available upon acceptance.
pdf
bib
abs
Streamlining the Collaborative Chain of Models into A Single Forward Pass in Generation-Based Tasks
Yuanjie Lyu
|
Chao Zhang
|
Yuhao Chen
|
Yong Chen
|
Tong Xu
In Retrieval-Augmented Generation (RAG) and agent-based frameworks, the “Chain of Models” approach is widely used, where multiple specialized models work sequentially on distinct sub-tasks. This approach is effective but increases resource demands as each model must be deployed separately. Recent advancements attempt to address this by applying prompt tuning, which allows a shared base model to adapt to multiple tasks with minimal parameter changes. However, a key challenge remains: intermediate outputs, passed between models as plain text, require recomputation of hidden states (i.e., Key and Value (KV) states in Transformers) during inference. In this paper, we introduce FTHSS, a novel prompt-tuning method that enables models to share KV hidden states, eliminating redundant forward passes and reducing KV cache storage. By modifying input and attention masks during training, FTHSS allows models to effectively utilize KV hidden states from prior models in both single- and multi-round scenarios. Empirical results on four tasks show that FTHSS matches the performance of traditional model chains while improving inference efficiency.
pdf
bib
abs
Self-Correction is More than Refinement: A Learning Framework for Visual and Language Reasoning Tasks
Jiayi He
|
Hehai Lin
|
Qingyun Wang
|
Yi R. Fung
|
Heng Ji
While Vision-Language Models (VLMs) have shown remarkable abilities, they invariably generate flawed responses. Self-correction that instructs models to refine their outputs presents a promising solution to this issue. Previous studies have mainly concentrated on Large Language Models (LLMs), while the self-correction abilities of VLMs, particularly concerning both visual and linguistic information, remain largely unexamined. This study investigates the self-correction capabilities of VLMs during both inference and fine-tuning stages. We introduce a Self-Correction Learning (SCL) approach that enables VLMs to learn from their self-generated self-correction data through Direct Preference Optimization (DPO) without relying on external feedback, facilitating self-improvement. Experimental results demonstrate that although VLMs struggle to self-correct effectively during iterative inference without additional fine-tuning and external feedback, they can enhance their performance and avoid previous mistakes through preference fine-tuning when their generated self-correction data are categorized into preferred and disfavored samples. This study emphasizes that self-correction is not merely a refinement process; rather, it should enhance models’ reasoning ability through additional training, enabling them to generate high-quality responses directly without further refinement.
pdf
bib
abs
Beyond Reactive Safety: Risk-Aware LLM Alignment via Long-Horizon Simulation
Chenkai Sun
|
Denghui Zhang
|
ChengXiang Zhai
|
Heng Ji
Given the growing influence of language model-based agents on high-stakes societal decisions, from public policy to healthcare, ensuring their beneficial impact requires understanding the far-reaching implications of their suggestions. We propose a proof-of-concept framework that projects how model-generated advice could propagate through societal systems on a macroscopic scale over time, enabling more robust alignment. To assess the long-term safety awareness of language models, we also introduce a dataset of 100 indirect harm scenarios, testing models’ ability to foresee adverse, non-obvious outcomes from seemingly harmless user prompts. Our approach achieves not only over 20% improvement on the new dataset but also an average win rate exceeding 70% against strong baselines on existing safety benchmarks (AdvBench, SafeRLHF, WildGuardMix), suggesting a promising direction for safer agents.
pdf
bib
abs
Probability-Consistent Preference Optimization for Enhanced LLM Reasoning
Yunqiao Yang
|
Houxing Ren
|
Zimu Lu
|
Ke Wang
|
Weikang Shi
|
Aojun Zhou
|
Junting Pan
|
Mingjie Zhan
|
Hongsheng Li
Recent advances in preference optimization have demonstrated significant potential for improving mathematical reasoning capabilities in large language models (LLMs). While current approaches leverage high-quality pairwise preference data through outcome-based criteria like answer correctness or consistency, they fundamentally neglect the internal logical coherence of responses. To overcome this, we propose Probability-Consistent Preference Optimization (PCPO), a novel framework that establishes dual quantitative metrics for preference selection: (1) surface-level answer correctness and (2) intrinsic token-level probability consistency across responses. Extensive experiments show that our PCPO consistently outperforms existing outcome-only criterion approaches across a diverse range of LLMs and benchmarks. Our code is publicly available at https://github.com/YunqiaoYang/PCPO.
pdf
bib
abs
IW-Bench: Evaluating Large Multimodal Models for Converting Image-to-Web
Hongcheng Guo
|
Wei Zhang
|
Junhao Chen
|
Yaonan Gu
|
Jian Yang
|
Junjia Du
|
Shaosheng Cao
|
Binyuan Hui
|
Tianyu Liu
|
Jianxin Ma
|
Chang Zhou
|
Zhoujun Li
Recently, advancements in large multimodal models have led to significant strides in image comprehension capabilities. Despite these advancements, there is a lack of a robust benchmark specifically for assessing the image‐to‐web conversion proficiency of these large models. It is essential to ensure the integrity of the web elements generated, which comprise both visible and invisible categories. Previous evaluation methods (e.g., BLEU) are notably susceptible to significant alterations due to the presence of invisible elements. Furthermore, it is crucial to measure the layout information of web pages—i.e., the positional relationships between elements—which has been overlooked by prior work. To address these challenges, we have curated and aligned a benchmark of images and corresponding web codes (IW-bench). Specifically, we propose Element Accuracy, which tests the completeness of elements by parsing the Document Object Model (DOM) tree. We also introduce Layout Accuracy to analyze positional relationships by converting the DOM tree into a common subsequence. In addition, we design a five‐hop multimodal Chain‐of‐Thought prompting strategy for improved performance, consisting of: 1) SoM prompt injection, 2) inferring elements, 3) inferring layout, 4) inferring web code, and 5) reflection. Our benchmark comprises 1,200 image–code pairs with varying levels of difficulty. We have conducted extensive experiments on existing large multimodal models, providing insights into their performance and identifying areas for improvement in the image‐to‐web domain.
pdf
bib
abs
TDCSA: LLM-Guided Top-Down Approach for Robust Citation Sentiment Analysis
Fan Gao
|
Jieyang Peng
|
Xiaoming Tao
|
Wang Youzheng
Citation Sentiment Analysis (CSA) plays a crucial role in understanding academic influence and knowledge diffusion. While pre-trained language models (PLMs) and large language models (LLMs) showed remarkable success in general sentiment analysis, they encounter specialized challenges in CSA due to the less significant and implicit sentiment expressions in academic writing, as well as complex sentiment transitions. % importance & limitations In order to address the challenges, We propose TDCSA, a Top-Down framework that leverages LLMs’ semantic understanding capabilities to enhance PLM-based CSA, which transforms the traditional bottom-up feature engineering paradigm into a top-down architecture. % what we do Our framework consists of three key components: (1) a Dual LLM Feature Generation module for robust quadruple extraction, (2) a Multi-view Feature Representation mechanism for neutral citation processing, and (3) a Quad Feature Enhanced PLM. % how we do Experiments demonstrate that TDCSA significantly outperforms existing methods, achieving state-of-the-art performance while maintaining robustness to quadruple quality variations.
pdf
bib
abs
DeepRTL2: A Versatile Model for RTL-Related Tasks
Yi Liu
|
Hongji Zhang
|
Yunhao Zhou
|
Zhengyuan Shi
|
Changran Xu
|
Qiang Xu
The integration of large language models (LLMs) into electronic design automation (EDA) has significantly advanced the field, offering transformative benefits, particularly in register transfer level (RTL) code generation and understanding. While previous studies have demonstrated the efficacy of fine-tuning LLMs for these generation-based tasks, embedding-based tasks, which are equally critical to EDA workflows, have been largely overlooked. These tasks, including natural language code search, RTL code functionality equivalence checking, and performance prediction, are essential for accelerating and optimizing the hardware design process. To address this gap, we present DeepRTL2, a family of versatile LLMs that unifies both generation- and embedding-based tasks related to RTL. By simultaneously tackling a broad range of tasks, DeepRTL2 represents the first model to provide a comprehensive solution to the diverse challenges in EDA. Through extensive experiments, we show that DeepRTL2 achieves state-of-the-art performance across all evaluated tasks.
pdf
bib
abs
The Self-Improvement Paradox: Can Language Models Bootstrap Reasoning Capabilities without External Scaffolding?
Yutao Sun
|
Mingshuai Chen
|
Tiancheng Zhao
|
Ruochen Xu
|
Zilun Zhang
|
Jianwei Yin
Self-improving large language models (LLMs) – i.e., to improve the performance of an LLM by fine-tuning it with synthetic data generated by itself – is a promising way to advance the capabilities of LLMs while avoiding extensive supervision. Existing approaches to self-improvement often rely on external supervision signals in the form of seed data and/or assistance from third-party models. This paper presents Crescent – a simple yet effective framework for generating high-quality synthetic question-answer data in a fully autonomous manner. Crescent first elicits the LLM to generate raw questions via a bait prompt, then diversifies these questions leveraging a rejection sampling-based self-deduplication, and finally feeds the questions to the LLM and collects the corresponding answers by means of majority voting. We show that Crescent sheds light on the potential of true self-improvement with zero external supervision signals for math reasoning; in particular, Crescent-generated question-answer pairs suffice to (i) improve the reasoning capabilities of an LLM while preserving its general performance (especially in the 0-shot setting); and (ii) distill LLM knowledge to weaker models more effectively than existing methods based on seed-dataset augmentation.
pdf
bib
abs
Cross-lingual Multimodal Sentiment Analysis for Low-Resource Languages via Language Family Disentanglement and Rethinking Transfer
Long Chen
|
Shuoyu Guan
|
Xiaohua Huang
|
Wen-Jing Wang
|
Cai Xu
|
Ziyu Guan
|
Wei Zhao
Existing multimodal sentiment analysis (MSA) methods have achieved significant success, leveraging cross-modal large-scale models (LLMs) and extensive pre-training data. However, these methods struggle to handle MSA tasks in low-resource languages. While multilingual LLMs enable cross-lingual transfer, they are limited to textual data and cannot address multimodal scenarios. To achieve MSA in low-resource languages, we propose a novel transfer learning framework named Language Family Disentanglement and Rethinking Transfer (LFD-RT). During pre-training, we establish cross-lingual and cross-modal alignments, followed by a language family disentanglement module that enhances the sharing of language universals within families while reducing noise from cross-family alignments. We propose a rethinking strategy for unsupervised fine-tuning that adapts the pre-trained model to MSA tasks in low-resource languages. Experimental results demonstrate the superiority of our method and its strong language-transfer capability on target low-resource languages. We commit to making our code and data publicly available, and the access link will be provided here.
pdf
bib
abs
Does Chain-of-Thought Reasoning Really Reduce Harmfulness from Jailbreaking?
Chengda Lu
|
Xiaoyu Fan
|
Yu Huang
|
Rongwu Xu
|
Jijie Li
|
Wei Xu
Jailbreak attacks have been observed to largely fail against recent reasoning models enhanced by Chain-of-Thought (CoT) reasoning. However, the underlying mechanism remains underexplored, and relying solely on reasoning capacity may raise security concerns. In this paper, we try to answer the question: Does CoT reasoning really reduce harmfulness from jailbreaking? Through rigorous theoretical analysis, we demonstrate that CoT reasoning has dual effects on jailbreaking harmfulness. Based on the theoretical insights, we propose a novel jailbreak method, FicDetail, whose practical performance validates our theoretical findings.
pdf
bib
abs
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model
Yuhang Zang
|
Xiaoyi Dong
|
Pan Zhang
|
Yuhang Cao
|
Ziyu Liu
|
Shengyuan Ding
|
Shenxi Wu
|
Yubo Ma
|
Haodong Duan
|
Wenwei Zhang
|
Kai Chen
|
Dahua Lin
|
Jiaqi Wang
Despite the promising performance of Large Vision Language Models (LVLMs) in visual understanding, they occasionally generate incorrect outputs. While reward models (RMs) with reinforcement learning or test-time scaling offer the potential for improving generation quality, a critical gap remains: publicly available multi-modal RMs for LVLMs are scarce, and the implementation details of proprietary models are often unclear. We bridge this gap with InternLM-XComposer2.5-Reward (IXC-2.5-Reward), a simple yet effective multi-modal reward model that aligns LVLMs with human preferences. To ensure the robustness and versatility of IXC-2.5-Reward, we set up a high-quality multi-modal preference corpus spanning text, image, and video inputs across diverse domains, such as instruction following, general understanding, text-rich documents, mathematical reasoning, and video understanding. IXC-2.5-Reward achieves excellent results on the latest multi-modal reward model benchmark and shows competitive performance on text-only reward model benchmarks. We further demonstrate three key applications of IXC-2.5-Reward: (1) Providing a supervisory signal for RL training. We integrate IXC-2.5-Reward with Proximal Policy Optimization (PPO) yields IXC-2.5-Chat, which shows consistent improvements in instruction following and multi-modal open-ended dialogue; (2) Selecting the best response from candidate responses for test-time scaling; and (3) Filtering outlier or noisy samples from existing image and video instruction tuning training data.
pdf
bib
abs
RATE-Nav: Region-Aware Termination Enhancement for Zero-shot Object Navigation with Vision-Language Models
Junjie Li
|
Nan Zhang
|
Xiaoyang Qu
|
Kai Lu
|
Guokuan Li
|
Jiguang Wan
|
Jianzong Wang
Object Navigation (ObjectNav) is a fundamental task in embodied artificial intelligence. Although significant progress has been made in semantic map construction and target direction prediction in current research, redundant exploration and exploration failures remain inevitable. A critical but underexplored direction is the timely termination of exploration to overcome these challenges. We observe a diminishing marginal effect between exploration steps and exploration rates and analyze the cost-benefit relationship of exploration. Inspired by this, we propose RATE-Nav, a Region-Aware Termination-Enhanced method. It includes a geometric predictive region segmentation algorithm and region-Based exploration estimation algorithm for exploration rate calculation. By leveraging the visual question answering capabilities of visual language models (VLMs) and exploration rates enables efficient termination.RATE-Nav achieves a success rate of 67.8% and an SPL of 31.3% on the HM3D dataset. And on the more challenging MP3D dataset, RATE-Nav shows approximately 10% improvement over previous zero-shot methods.
pdf
bib
abs
RMoA: Optimizing Mixture-of-Agents through Diversity Maximization and Residual Compensation
Zhentao Xie
|
Chengcheng Han
|
Jinxin Shi
|
Wenjun Cui
|
Xin Zhao
|
Xingjiao Wu
|
Jiabao Zhao
Although multi-agent systems based on large language models show strong capabilities on multiple tasks, they are still limited by high computational overhead, information loss, and robustness. Inspired by ResNet’s residual learning, we propose Residual Mixture-of-Agents (RMoA), integrating residual connections to optimize efficiency and reliability. To maximize information utilization from model responses while minimizing computational costs, we innovatively design an embedding-based diversity selection mechanism that greedily selects responses via vector similarity. Furthermore, to mitigate iterative information degradation, we introduce a Residual Extraction Agent to preserve cross-layer incremental information by capturing inter-layer response differences, coupled with a Residual Aggregation Agent for hierarchical information integration. Additionally, we propose an adaptive termination mechanism that dynamically halts processing based on residual convergence, further improving inference efficiency. RMoA achieves state-of-the-art performance on the benchmarks of across alignment, mathematical reasoning, code generation, and multitasking understanding, while significantly reducing computational overhead. Code is available at https://github.com/mindhunter01/RMoA.
pdf
bib
abs
Instruction-Tuning Data Synthesis from Scratch via Web Reconstruction
Yuxin Jiang
|
Yufei Wang
|
Chuhan Wu
|
Xinyi Dai
|
Yan Xu
|
Weinan Gan
|
Yasheng Wang
|
Xin Jiang
|
Lifeng Shang
|
Ruiming Tang
|
Wei Wang
The improvement of LLMs’ instruction-following capabilities depends critically on the availability of high-quality instruction-response pairs. While existing automatic data synthetic methods alleviate the burden of manual curation, they often rely heavily on either the quality of seed data or strong assumptions about the structure and content of web documents. To tackle these challenges, we propose Web Reconstruction (WebR), a fully automated framework for synthesizing high-quality instruction-tuning (IT) data directly from raw web documents with minimal assumptions. Leveraging the inherent diversity of raw web content, we conceptualize web reconstruction as an instruction-tuning data synthesis task via a novel dual-perspective paradigm—Web as Instruction and Web as Response—where each web document is designated as either the input or output role to trigger the reconstruction process. Comprehensive experiments show that datasets generated by WebR outperform state-of-the-art baselines by up to 16.65% across four instruction-following benchmarks. Notably, WebR demonstrates superior compatibility, data efficiency, and scalability, enabling enhanced domain adaptation with minimal effort.
pdf
bib
abs
RLKGF: Reinforcement Learning from Knowledge Graph Feedback Without Human Annotations
Lian Yan
|
Chen Tang
|
Yi Guan
|
Haotian Wang
|
Songyuan Wang
|
Haifeng Liu
|
Yang Yang
|
Jingchi Jiang
Reinforcement Learning from Human Feedback (RLHF) has been shown to effectively align large language models (LLMs) with human knowledge. However, the lack of human preference labels remains a significant bottleneck when applying RLHF to a downstream domain. Humans in RLHF play a critical role in injecting reasoning preferences into LLM, and we assume the reasoning process underlying human assessments may potentially be replaced by reasoning pathways derived from Knowledge Graphs (KGs). Inspired by this assumption, we propose Reinforcement Learning from Knowledge Graph Feedback (RLKGF), a novel method that leverages KG semantics and structure to derive RL rewards in the absence of manual annotations. Unlike Reinforcement Learning from AI Feedback (RLAIF), RLKGF directly integrates human priors encoded in KGs as the reward model, aligning LLM responses with expert knowledge without additional preference labeling or reward model training. RLKGF structures context-relevant facts into knowledge subgraphs and defines rewards by simulating information flow across semantic and logical connections between question and candidate response entities. Experiments on three public and one private medical dialogue dataset demonstrate that RLKGF significantly outperforms the competitive RLAIF in improving LLM diagnostic accuracy. The code is available at
https://github.com/YanPioneer/RLKGF.
pdf
bib
abs
Learning Task Representations from In-Context Learning
Baturay Saglam
|
Xinyang Hu
|
Zhuoran Yang
|
Dionysis Kalogerias
|
Amin Karbasi
Large language models (LLMs) have demonstrated remarkable proficiency in in-context learning (ICL), where models adapt to new tasks through example-based prompts without requiring parameter updates. However, understanding how tasks are internally encoded and generalized remains a challenge. To address some of the empirical and technical gaps in the literature, we introduce an automated formulation for encoding task information in ICL prompts as a function of attention heads within the transformer architecture. This approach computes a single task vector as a weighted sum of attention heads, with the weights optimized causally via gradient descent. Our findings show that existing methods fail to generalize effectively to modalities beyond text. In response, we also design a benchmark to evaluate whether a task vector can preserve task fidelity in functional regression tasks. The proposed method successfully extracts task-specific information from in-context demonstrations and excels in both text and regression tasks, demonstrating its generalizability across modalities.
pdf
bib
abs
CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal Representations
Xiaohu Li
|
Yunfeng Ning
|
Zepeng Bao
|
Mayi Xu
|
Jianhao Chen
|
Tieyun Qian
Security alignment enables the Large Language Model (LLM) to gain the protection against malicious queries, but various jailbreak attack methods reveal the vulnerability of this security mechanism. Previous studies have isolated LLM jailbreak attacks and defenses. We analyze the security protection mechanism of the LLM, and propose a framework that combines attack and defense. Our method is based on the linearly separable property of LLM intermediate layer embedding, as well as the essence of jailbreak attack, which aims to embed harmful problems and transfer them to the safe area. We utilize generative adversarial network (GAN) to learn the security judgment boundary inside the LLM to achieve efficient jailbreak attack and defense. The experimental results indicate that our method achieves an average jailbreak success rate of 88.85% across three popular LLMs, while the defense success rate on the state-of-the-art jailbreak dataset reaches an average of 84.17%. This not only validates the effectiveness of our approach but also sheds light on the internal security mechanisms of LLMs, offering new insights for enhancing model security.Warning: This paper contains some harmful text examples.
pdf
bib
abs
Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions
Yubo Li
|
Yidi Miao
|
Xueying Ding
|
Ramayya Krishnan
|
Rema Padman
Large Language Models (LLMs) have shown remarkable capabilities across various tasks, but their deployment in high-stake domains requires consistent and coherent behavior across multiple rounds of user interaction. This paper introduces a comprehensive framework for evaluating and improving LLM response consistency, making three key contributions . First, we introduce Position-Weighted Consistency (PWC), a metric designed to capture both the importance of early-stage stability and recovery patterns in multi-turn interactions. Second, we present MT-Consistency, a carefully curated benchmark dataset spanning diverse domains and difficulty levels, specifically designed to evaluate LLM consistency under various challenging follow-up scenarios. Third, we introduce Confidence-Aware Response Generation (CARG), a framework that significantly improves response stability by explicitly integrating internal model confidence scores during the generation process. Experimental results demonstrate that CARG significantly improves response stability without sacrificing accuracy, offering a practical path toward more dependable LLM behavior in critical, real-world deployments.
pdf
bib
abs
OS-Kairos: Adaptive Interaction for MLLM-Powered GUI Agents
Pengzhou Cheng
|
Zheng Wu
|
Zongru Wu
|
Tianjie Ju
|
Aston Zhang
|
Zhuosheng Zhang
|
Gongshen Liu
Autonomous graphical user interface (GUI) agents powered by multimodal large language models have shown great promise. However, a critical yet underexplored issue persists: over-execution, where the agent executes tasks in a fully autonomous way, without adequate assessment of its action confidence to compromise an adaptive human-agent collaboration. This poses substantial risks in complex scenarios, such as those involving ambiguous user instructions, unexpected interruptions, and environmental hijacks. To address the issue, we introduce OS-Kairos, an adaptive GUI agent capable of predicting confidence levels at each interaction step and efficiently deciding whether to act autonomously or seek human intervention. OS-Kairos is developed through two key mechanisms: (i) collaborative probing that annotates confidence scores at each interaction step; (ii) confidence-driven interaction that leverages these confidence scores to elicit the ability of adaptive interaction. Experimental results show that OS-Kairos substantially outperforms existing models on our curated dataset featuring complex scenarios, as well as on established benchmarks such as AITZ and Meta-GUI, with 24.59%~87.29% improvements in task success rate. OS-Kairos facilitates an adaptive human-agent collaboration, prioritizing effectiveness, generality, scalability, and efficiency for real-world GUI interaction. The dataset and codes are available at Anonymous.
pdf
bib
abs
Red-Teaming LLM Multi-Agent Systems via Communication Attacks
Pengfei He
|
Yuping Lin
|
Shen Dong
|
Han Xu
|
Yue Xing
|
Hui Liu
Large Language Model-based Multi-Agent Systems (LLM-MAS) have revolutionized complex problem-solving capability by enabling sophisticated agent collaboration through message-based communications. While the communication framework is crucial for agent coordination, it also introduces a critical yet unexplored security vulnerability. In this work, we introduce Agent-in-the-Middle (AiTM), a novel attack that exploits the fundamental communication mechanisms in LLM-MAS by intercepting and manipulating inter-agent messages. Unlike existing attacks that compromise individual agents, AiTM demonstrates how an adversary can compromise entire multi-agent systems by only manipulating the messages passing between agents. To enable the attack under the challenges of limited control and role-restricted communication format, we develop an LLM-powered adversarial agent with a reflection mechanism that generates contextually-aware malicious instructions. Our comprehensive evaluation across various frameworks, communication structures, and real-world applications demonstrates that LLM-MAS is vulnerable to communication-based attacks, highlighting the need for robust security measures in multi-agent systems.
pdf
bib
abs
Can We Trust AI Doctors? A Survey of Medical Hallucination in Large Language and Large Vision-Language Models
Zhihong Zhu
|
Yunyan Zhang
|
Xianwei Zhuang
|
Fan Zhang
|
Zhongwei Wan
|
Yuyan Chen
|
QingqingLong QingqingLong
|
Yefeng Zheng
|
Xian Wu
Hallucination has emerged as a critical challenge for large language models (LLMs) and large vision-language models (LVLMs), particularly in high-stakes medical applications. Despite its significance, dedicated research on medical hallucination remains unexplored. In this survey, we first provide a unified perspective on medical hallucination for both LLMs and LVLMs, and delve into its causes. Subsequently, we review recent advancements in detecting, evaluating, and mitigating medical hallucinations, offering a comprehensive overview of evaluation benchmarks, metrics, and strategies developed to tackle this issue. Moreover, we delineate the current challenges and delve into new frontiers, thereby shedding light on future research. We hope this work coupled with open-source resources can provide the community with quick access and spur breakthrough research in medical hallucination.
pdf
bib
abs
DRT: Deep Reasoning Translation via Long Chain-of-Thought
Jiaan Wang
|
Fandong Meng
|
Yunlong Liang
|
Jie Zhou
Recently, O1-like models have emerged as representative examples, illustrating the effectiveness of long chain-of-thought (CoT) in reasoning tasks such as math and coding tasks. In this paper, we introduce DRT, an attempt to bring the success of long CoT to neural machine translation (MT). Specifically, in view of the literature books that might involve similes and metaphors, translating these texts to a target language is very difficult in practice due to cultural differences. In such cases, literal translation often fails to convey the intended meaning effectively. Even for professional human translators, considerable thought must be given to preserving semantics throughout the translation process. To simulate LLMs’ long thought ability in MT, we first mine sentences containing similes or metaphors from existing literature books, and then develop a multi-agent framework to translate these sentences via long thought. In the multi-agent framework, a translator is used to iteratively translate the source sentence under the suggestions provided by an advisor. To ensure the effectiveness of the long thoughts, an evaluator is also employed to quantify the translation quality in each round. In this way, we collect tens of thousands of long-thought MT data, which is used to train our DRT. Using Qwen2.5 and LLama-3.1 as the backbones, DRT models can learn the thought process during machine translation, and outperform vanilla LLMs as well as LLMs which are simply fine-tuning on the paired sentences without long thought, showing its effectiveness.
pdf
bib
abs
CTPD: Cross-Modal Temporal Pattern Discovery for Enhanced Multimodal Electronic Health Records Analysis
Fuying Wang
|
Feng Wu
|
Yihan Tang
|
Lequan Yu
Integrating multimodal clinical records—such as Electronic Health Records (EHR) and free-text clinical reports—has shown great potential in predicting clinical outcomes. However, prior work has primarily focused on capturing temporal interactions within individual samples and fusing multimodal information, overlooking critical temporal patterns across different patients. These patterns, such as trends in vital signs like abnormal heart rate or blood pressure, can indicate deteriorating health or an impending critical event of any individual in a given population. Similarly, clinical notes often contain textual descriptions that reflect these changes. Identifying corresponding temporal patterns across different modalities is crucial for improving the accuracy of clinical outcome predictions, yet it remains a challenging task. To address this gap, we introduce a Cross-modal Temporal Pattern Discovery (CTPD) framework, designed to efficiently extract meaningful cross-modal temporal patterns from multimodal EHR data. Our approach introduces shared initial temporal pattern representations and refines them using slot attention to generate temporal semantic embeddings. To ensure rich cross-modal temporal semantics in the learned patterns, we introduce a Temporal Pattern Noise Contrastive Estimation (TP-NCE) loss for cross-modal alignment, along with two reconstruction losses to retain core information of each modality. Evaluations on two clinically critical tasks—48 hour in-hospital mortality and 24-hour phenotype classification—using the MIMIC-III database demonstrate the superiority of our method over existing approaches. The code is anonymously available at https://github.com/HKU-MedAI/CTPD.
pdf
bib
abs
Vision-aided Unsupervised Constituency Parsing with Multi-MLLM Debating
Dong Zhang
|
Haiyan Tian
|
Qingying Sun
|
Shoushan Li
This paper presents a novel framework for vision-aided unsupervised constituency parsing (VUCP), leveraging multimodal large language models (MLLMs) pre-trained on diverse image-text or video-text data. Unlike previous methods requiring explicit cross-modal alignment, our approach eliminates this need by using pre-trained models like Qwen-VL and VideoLLaVA, which seamlessly handle multimodal inputs. We introduce two multi-agent debating mechanisms—consensus-driven (CD) and round-driven (RD)—to enable cooperation between models with complementary strengths. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on both image-text and video-text datasets for VUCP, improving robustness and accuracy.
pdf
bib
abs
Inter-Passage Verification for Multi-evidence Multi-answer QA
Bingsen Chen
|
Shenji Wan
|
Xi Ye
|
Chen Zhao
Multi-answer question answering (QA), where questions can have many valid answers, presents a significant challenge for existing retrieval-augmented generation-based QA systems, as these systems struggle to retrieve and then synthesize a large number of evidence passages. To tackle these challenges, we propose a new multi-answer QA framework – Retrieval-augmented Independent Reading with Inter-passage Verification (RI²VER). Our framework retrieves a large set of passages and processes each passage individually to generate an initial high-recall but noisy answer set. Then we propose a new inter-passage verification pipeline that validates every candidate answer through (1) Verification Question Generation, (2) Gathering Additional Evidence, and (3) Verification with inter-passage synthesis. Evaluations on the QAMPARI and RoMQA datasets demonstrate that our framework significantly outperforms existing baselines across various model sizes, achieving an average F1 score improvement of 11.17%. Further analysis validates that our inter-passage verification pipeline enables our framework to be particularly beneficial for questions requiring multi-evidence synthesis.
pdf
bib
abs
PROMTEC: Fast LLM Inference Decoding using Prompt Multi-Lookup with Template Database and Common Sequences
Alan Chi-Man Lee
|
Wing-Sun Cheng
|
Calvin Chun-Kit Chan
We propose PROMTEC, a novel multi-faceted approach to accelerate the inference of large language models (LLMs) by leveraging three key techniques: Prompt Multi-Lookup, Template Datastore, and Common Sequences methods. Prompt Multi-Lookup enhances the autoregressive decoding efficiency by generating multiple candidate sequences from context. Template Datastore exploits structured patterns, particularly in mathematical and code generation tasks, to enable fast and accurate candidate generation. Common Sequences optimize inference by precomputing frequent short sequences in specialized domains. For mathematical generation, PROMTEC achieves a 3.91 × speedup on the miniF2F benchmark. For code generation, it achieves up to a 4.23 × speedup on the HumanEval benchmark. This work highlights the potential of integrated candidate generation to accelerate LLM inference while maintaining high-quality outputs.
pdf
bib
abs
Logical DA: Enhancing Data Augmentation for Logical Reasoning via a Multi-Agent System
Haoqi Zheng
|
DongWang DongWang
|
Silin Yang
|
Yunpeng Qi
|
Ruochun Jin
|
Liyang Xu
Recent advancements in large language models (LLMs) have highlighted the importance of improving their reasoning capabilities. A critical challenge lies in the scarcity of high-quality reasoning data—characterized by diversity and rich supervisory signals—necessary for robust model training. While data augmentation (DA) methods have been leveraged to mitigate this scarcity, prevailing approaches often introduce noise and exhibit logical inconsistencies, thereby diminishing their utility for complex reasoning tasks. Moreover, existing DA paradigms predominantly isolate data synthesis from label validation, failing to unify these complementary processes within a cohesive architecture.To address these limitations, we introduce Logical DA, a multi-agent framework for enhancing reasoning-focused data augmentation in few-shot learning scenarios. Our system includes four agents operating through two synergistic phases: (1) diverse data generation, and (2) label verification.The system incorporates a reflection mechanism to continuously improve data quality by leveraging feedback from logical validation. We demonstrate the effectiveness of Logical DA through experiments on various tasks and datasets, achieving the highest average improvement in task accuracy in both fine-tuning and in-context learning paradigms, with an average improvement of 7.61% when applied to fine-tuning.
pdf
bib
abs
Adapting General-Purpose Embedding Models to Private Datasets Using Keyword-based Retrieval
Yubai Wei
|
Jiale Han
|
Yi Yang
Text embedding models play a cornerstone role in AI applications, such as retrieval-augmented generation (RAG). While general-purpose text embedding models demonstrate strong performance on generic retrieval benchmarks, their effectiveness diminishes when applied to private datasets (e.g., company-specific proprietary data), which often contain specialized terminology and lingo. In this work, we introduce BMEmbed, a novel method for adapting general-purpose text embedding models to private datasets. By leveraging the well-established keyword-based retrieval technique (BM25), we construct supervisory signals from the ranking of keyword-based retrieval results to facilitate model adaptation. We evaluate BMEmbed across a range of domains, datasets, and models, showing consistent improvements in retrieval performance. Moreover, we provide empirical insights into how BM25-based signals contribute to improving embeddings by fostering alignment and uniformity, highlighting the value of this approach in adapting models to domain-specific data. We release the source code for the research community.
pdf
bib
abs
SQL Injection Jailbreak: A Structural Disaster of Large Language Models
Jiawei Zhao
|
Kejiang Chen
|
Weiming Zhang
|
Nenghai Yu
Large Language Models (LLMs) are susceptible to jailbreak attacks that can induce them to generate harmful content.Previous jailbreak methods primarily exploited the internal properties or capabilities of LLMs, such as optimization-based jailbreak methods and methods that leveraged the model’s context-learning abilities. In this paper, we introduce a novel jailbreak method, SQL Injection Jailbreak (SIJ), which targets the external properties of LLMs, specifically, the way LLMs construct input prompts. By injecting jailbreak information into user prompts, SIJ successfully induces the model to output harmful content. For open-source models, SIJ achieves near 100% attack success rates on five well-known LLMs on the AdvBench and HEx-PHI, while incurring lower time costs compared to previous methods. For closed-source models, SIJ achieves an average attack success rate over 85% across five models in the GPT and Doubao series. Additionally, SIJ exposes a new vulnerability in LLMs that urgently requires mitigation. To address this, we propose a simple adaptive defense method called Self-Reminder-Key to counter SIJ and demonstrate its effectiveness through experimental results. Our code is available at https://github.com/weiyezhimeng/SQL-Injection-Jailbreak.
pdf
bib
abs
TAMP: Token-Adaptive Layerwise Pruning in Multimodal Large Language Models
Jaewoo Lee
|
Keyang Xuan
|
Chanakya Ekbote
|
Sandeep Polisetty
|
Yi R. Fung
|
Paul Pu Liang
Multimodal Large Language Models (MLLMs) have shown remarkable versatility in understanding diverse multimodal data and tasks. However, these capabilities come with an increased model scale. While post-training pruning reduces model size in unimodal models, its application to MLLMs often yields limited success. Our analysis discovers that conventional methods fail to account for the unique token attributes across layers and modalities inherent to MLLMs. Inspired by this observation, we propose TAMP, a simple yet effective pruning framework tailored for MLLMs, featuring two key components: (1) Diversity-Aware Sparsity, which adjusts sparsity ratio per layer based on diversities among multimodal output tokens, preserving more parameters in high-diversity layers; and (2) Adaptive Multimodal Input Activation, which identifies representative multimodal input tokens using attention scores to guide unstructured weight pruning. We validate our method on two state-of-the-art MLLMs: LLaVA-NeXT, designed for vision-language tasks, and VideoLLaMA2, capable of processing audio, visual, and language modalities. Empirical experiments across various multimodal evaluation benchmarks demonstrate that each component of our approach substantially outperforms existing pruning techniques. Our code is available at https://github.com/G-JWLee/TAMP
pdf
bib
abs
Generative Music Models’ Alignment with Professional and Amateur Users’ Expectations
Zihao Wang
|
Jiaxing Yu
|
Haoxuan Liu
|
Zehui Zheng
|
Yuhang Jin
|
Shuyu Li
|
Shulei Ji
|
Kejun Zhang
Recent years have witnessed rapid advancements in text-to-music generation using large language models, yielding notable outputs. A critical challenge is understanding users with diverse musical expertise and generating music that meets their expectations, an area that remains underexplored.To address this gap, we introduce the novel task of Professional and Amateur Description-to-Song Generation. This task focuses on aligning generated content with human expressions from varying musical proficiency levels, aiming to produce songs that accurately meet auditory expectations and adhere to musical structural conventions. We utilized the MuChin dataset, which contains annotations from both professionals and amateurs for identical songs, as the source for these distinct description types. We also collected a pre-train dataset of over 1.5 million songs; lyrics were included for some, while for others, lyrics were generated using Automatic Speech Recognition (ASR) models.Furthermore, we propose MuDiT/MuSiT, a single-stage framework designed to enhance human-machine alignment in song generation. This framework employs Chinese MuLan (ChinMu) for cross-modal comprehension between natural language descriptions and auditory musical attributes, thereby aligning generated songs with user-defined outcomes. Concurrently, a DiT/SiT model facilitates end-to-end generation of complete songs audio, encompassing both vocals and instrumentation. We proposed metrics to evaluate semantic and auditory discrepancies between generated content and target music. Experimental results demonstrate that MuDiT/MuSiT outperforms baseline models and exhibits superior alignment with both professional and amateur song descriptions.
pdf
bib
abs
LLM-Forest: Ensemble Learning of LLMs with Graph-Augmented Prompts for Data Imputation
Xinrui He
|
Yikun Ban
|
Jiaru Zou
|
Tianxin Wei
|
Curtiss Cook
|
Jingrui He
Missing data imputation is a critical challenge in various domains, such as healthcare and finance, where data completeness is vital for accurate analysis. Large language models (LLMs), trained on vast corpora, have shown strong potential in data generation, making them a promising tool for data imputation. However, challenges persist in designing effective prompts for a finetuning-free process and in mitigating biases and uncertainty in LLM outputs. To address these issues, we propose a novel framework, LLM-Forest, which introduces a “forest” of few-shot learning LLM “trees” with their outputs aggregated via confidence-based weighted voting based on LLM self-assessment, inspired by the ensemble learning (Random Forest). This framework is established on a new concept of bipartite information graphs to identify high-quality relevant neighboring entries with both feature and value granularity. Extensive experiments on 9 real-world datasets demonstrate the effectiveness and efficiency of LLM-Forest. The implementation is available at https://github.com/Xinrui17/LLM-Forest
pdf
bib
abs
Task Calibration: Calibrating Large Language Models on Inference Tasks
Yingjie Li
|
Yun Luo
|
Xiaotian Xie
|
Yue Zhang
Large language models (LLMs) have exhibited impressive zero-shot performance on inference tasks. However, LLMs may suffer from spurious correlations between input texts and output labels, which limits LLMs’ ability to reason based purely on general language understanding. For example, in the natural language inference (NLI) task, LLMs may make predictions primarily based on premise or hypothesis, rather than both components. To address this problem that may lead to unexpected performance degradation, we propose task calibration (TC), a zero-shot and inference-only calibration method inspired by mutual information which recovers LLM performance through task reformulation. In NLI, TC encourages LLMs to reason based on both premise and hypothesis, while mitigating the models’ over-reliance on individual premise or hypothesis for inference. Experimental results show that TC achieves a substantial improvement on 13 different benchmarks in the zero-shot setup. We further validate the effectiveness of TC in few-shot setups and various natural language understanding tasks. Further analysis indicates that TC is also robust to prompt templates and has the potential to be integrated with other calibration methods. We publicly release our code to facilitate future research.
pdf
bib
abs
MiniELM: A Lightweight and Adaptive Query Rewriting Framework for E-Commerce Search Optimization
Duy A. Nguyen
|
Rishi Kesav Mohan
|
Shimeng Yang
|
Pritom Saha Akash
|
Kevin Chen-Chuan Chang
Query rewriting (QR) is a critical technique in e-commerce search, addressing the lexical gap between user queries and product descriptions to enhance search performance. Existing QR approaches typically fall into two categories: discriminative models and generative methods leveraging large language models (LLMs). Discriminative models often struggle with natural language understanding and offer limited flexibility in rewriting, while generative LLMs, despite producing high-quality rewrites, face high inference latency and cost in online settings. These limitations force offline deployment, making them vulnerable to issues like information staleness and semantic drift. To overcome these challenges, we propose a novel hybrid pipeline for QR that balances efficiency and effectiveness. Our approach combines **offline knowledge distillation** to create a lightweight but efficient student model with **online reinforcement learning (RL)** to refine query rewriting dynamically using real-time feedback. A key innovation is the use of LLMs as **simulated human feedback**, enabling scalable reward signals and cost-effective evaluation without manual annotations. Experimental results on Amazon ESCI dataset demonstrate significant improvements in query relevance, diversity, and adaptability, as well as positive feedback from the LLM simulation. This work contributes to advancing LLM capabilities for domain-specific applications, offering a robust solution for dynamic and complex e-commerce search environments.
pdf
bib
abs
Visibility as Survival: Generalizing NLP for Native Alaskan Language Identification
Ivory Yang
|
Chunhui Zhang
|
Yuxin Wang
|
Zhongyu Ouyang
|
Soroush Vosoughi
Indigenous languages remain largely invisible in commercial language identification (LID) systems, a stark reality exemplified by Google Translate’s LangID tool, which supports over 100 languages but excludes all 150 Indigenous languages of North America. This technological marginalization is particularly acute for Alaska’s 20 Native languages, all of which face endangerment despite their rich linguistic heritage. We present GenAlaskan, a framework demonstrating how both large language models and specialized classifiers can effectively identify these languages with minimal data. Working closely with Native Alaskan community members, we create Akutaq-2k, a carefully curated dataset of 2000 sentences spanning all 20 languages, named after the traditional Yup’ik dessert, symbolizing the blending of diverse elements. We design few-shot prompting on proprietary and open-source LLMs, achieving nearly perfect accuracy with just 40 examples per language. While initial zero-shot attempts show limited success, our systematic attention head pruning revealed critical architectural components for accurate language differentiation, providing insights into model decision-making for low-resource languages. Our results challenge the notion that effective Indigenous language identification requires massive resources or corporate infrastructure, demonstrating that targeted technological interventions can drive meaningful progress in preserving endangered languages in the digital age.
pdf
bib
abs
KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding
Zhangchen Xu
|
Yang Liu
|
Yueqin Yin
|
Mingyuan Zhou
|
Radha Poovendran
We introduce KodCode, a synthetic dataset that addresses the persistent challenge of acquiring high-quality, verifiable training data across diverse difficulties and domains for training Large Language Models for coding. Existing code-focused resources typically fail to ensure either the breadth of coverage (e.g., spanning simple coding tasks to advanced algorithmic problems) or verifiable correctness (e.g., unit tests). In contrast, KodCode comprises question–solution–test triplets that are systematically validated via a self-verification procedure. Our pipeline begins by synthesizing a broad range of coding questions, then generates solutions and test cases with additional attempts allocated to challenging problems. Finally, post-training data synthesis is done by rewriting questions into diverse formats and generating responses under a test-based reject sampling procedure from a reasoning model (DeepSeek R1). This pipeline yields a large-scale, robust and diverse coding dataset. It is suitable for supervised fine-tuning and the paired unit tests also provide great potential for RL tuning. Fine-tuning experiments on coding benchmarks (HumanEval(+), MBPP(+), BigCodeBench, and LiveCodeBench) demonstrate that KodCode-tuned models achieve state-of-the-art performance, surpassing models like Qwen2.5-Coder-32B-Instruct and DeepSeek-R1-Distill-Llama-70B.
pdf
bib
abs
Select, Read, and Write: A Multi-Agent Framework of Full-Text-based Related Work Generation
Xiaochuan Liu
|
Ruihua Song
|
Xiting Wang
|
Xu Chen
Automatic related work generation (RWG) can save people’s time and effort when writing a draft of related work section (RWS) for further revision. However, existing methods for RWG always suffer from shallow comprehension due to taking the limited portions of references papers as input and isolated explanation for each reference due to ineffective capturing the relationships among them. To address these issues, we focus on full-text-based RWG task and propose a novel multi-agent framework. Our framework consists of three agents: a selector that decides which section of the papers is going to read next, a reader that digests the selected section and updates a shared working memory, and a writer that generates RWS based on the final curated memory. To better capture the relationships among references, we also propose two graph-aware strategies for selector, enabling to optimize the reading order with constrains of the graph structure. Extensive experiments demonstrate that our framework consistently improves performance across three base models and various input configurations. The graph-aware selectors outperform alternative selectors, achieving state-of-the-art results. The code and data are available at https://github.com/1190200817/Full_Text_RWG.
pdf
bib
abs
Graph-Assisted Culturally Adaptable Idiomatic Translation for Indic languages
Pratik Rakesh Singh
|
Kritarth Prasad
|
Mohammadi Zaki
|
Pankaj Wasnik
Translating multi-word expressions (MWEs) and idioms requires a deep understanding of the cultural nuances of both the source and target languages. This challenge is further amplified by the one-to-many nature of idiomatic translations, where a single source idiom can have multiple target-language equivalents depending on cultural references and contextual variations. Traditional static knowledge graphs (KGs) and prompt-based approaches struggle to capture these complex relationships, often leading to suboptimal translations. To address this, we propose an IdiomCE, an adaptive graph neural network (GNN) based methodology that learns intricate mappings between idiomatic expressions, effectively generalizing to both seen and unseen nodes during training. Our proposed method enhances translation quality even in resource-constrained settings, facilitating improved idiomatic translation in smaller models. We evaluate our approach on multiple idiomatic translation datasets using reference-less metrics, demonstrating significant improvements in translating idioms from English to various Indian languages
pdf
bib
abs
Question Answering in Climate Adaptation for Agriculture: Model Development and Evaluation with Expert Feedback
Vincent Nguyen
|
Sarvnaz Karimi
|
Willow Hallgren
|
Mahesh Prakash
The generative capabilities of the large language models (LLMs) are deployed for domain-specific question answering systems. However, their ability to answer climate adaptation questions remains unclear. In particular, can they be used by agronomists and climate scientists to answer questions on the best climate adaptation strategies? Answering questions in this domain requires knowledge of climate data and its uncertainties, and the ability to link them to the broader climate literature while accommodating the unique constraints of users and experts. We investigate the generative and evaluative capabilities of several state-of-the-art LLMs, open-source and proprietary, on climate adaptation for agriculture questions posed by domain experts using evaluation criteria designed by the experts.We propose an iterative exploration framework that enables LLMs to dynamically aggregate information from heterogeneous sources, such as text from climate literature and structured tabular climate data from climate model projections and historical observations. Our experiments demonstrate that LLMs can aggregate heterogeneous data to (1) answer questions, but at a trade-off between presentation quality and epistemological accuracy; and, (2) evaluate answers, but are not as competent at identifying high-quality answers and erroneous information compared to domain experts.
pdf
bib
abs
AGRec: Adapting Autoregressive Decoders with Graph Reasoning for LLM-based Sequential Recommendation
Xinfeng Wang
|
Jin Cui
|
Fumiyo Fukumoto
|
Yoshimi Suzuki
Autoregressive decoders in large language models (LLMs) excel at capturing users’ sequential behaviors for generative recommendations. However, they inherently struggle to leverage graph-structured user-item interactions, which are widely recognized as beneficial. This paper presents AGRec, adapting LLMs’ decoders with graph reasoning for recommendation. We reveal that LLMs and graph neural networks (GNNs) manifest complementary strengths in distinct user domains. Building on this, we augment the decoding logits of LLMs with an auxiliary GNN model to optimize token generation. Moreover, we introduce a rankable finite state machine to tackle two challenges: (1) adjusting autoregressive generation with discriminative decoders that directly predict user-item similarity, and (2) token homogeneity, where LLMs often generate items with similar prefix tokens, narrowing the scope of beam search. This approach offers a novel perspective to enhance LLMs with graph knowledge. Our AGRec outperforms state-of-the-art models in sequential recommendations. Our code is available online.
pdf
bib
abs
Causal Denoising Prototypical Network for Few-Shot Multi-label Aspect Category Detection
Jin Cui
|
Xinfeng Wang
|
Yoshimi Suzuki
|
Fumiyo Fukumoto
The multi-label aspect category detection (MACD) task has attracted great attention in sentiment analysis. Many recent methods have formulated the MACD task by learning robust prototypes to represent categories with limited support samples. However, few of them address the noise categories in the support set that hinder their models from effective prototype generations. To this end, we propose a causal denoising prototypical network (CDPN) for few-shot MACD. We reveal the underlying relation between causal inference and contrastive learning, and present causal contrastive learning (CCL) using discrete and continuous noise as negative samples. We empirically found that CCL can (1) prevent models from overly predicting more categories and (2) mitigate semantic ambiguity issues among categories. Experimental results show that CDPN outperforms competitive baselines. Our code is available online.
pdf
bib
abs
RealHiTBench: A Comprehensive Realistic Hierarchical Table Benchmark for Evaluating LLM-Based Table Analysis
Pengzuo Wu
|
Yuhang Yang
|
Guangcheng Zhu
|
Chao Ye
|
Hong Gu
|
Xu Lu
|
Ruixuan Xiao
|
Bowen Bao
|
Yijing He
|
Liangyu Zha
|
Wentao Ye
|
Junbo Zhao
|
Haobo Wang
With the rapid advancement of Large Language Models (LLMs), there is an increasing need for challenging benchmarks to evaluate their capabilities in handling complex tabular data. However, existing benchmarks are either based on outdated data setups or focus solely on simple, flat table structures. In this paper, we introduce **RealHiTBench**, a comprehensive benchmark designed to evaluate the performance of both LLMs and Multimodal LLMs (MLLMs) across a variety of input formats for complex tabular data, including LaTeX, HTML, and PNG. RealHiTBench also includes a diverse collection of tables with intricate structures, spanning a wide range of task types. Our experimental results, using **25** state-of-the-art LLMs, demonstrate that RealHiTBench is indeed a challenging benchmark. Moreover, we also develop TreeThinker, a tree-based agent that organizes hierarchical headers into a tree structure for enhanced tabular reasoning, validating the importance of improving LLMs’ perception of table hierarchies. We hope that our work will inspire further research on tabular data reasoning and the development of more robust models. The code and data are available at https://github.com/cspzyy/RealHiTBench.
pdf
bib
abs
A Query-Response Framework for Whole-Page Complex-Layout Document Image Translation with Relevant Regional Concentration
Zhiyang Zhang
|
Yaping Zhang
|
Yupu Liang
|
Zhiyuan Chen
|
Lu Xiang
|
Yang Zhao
|
Yu Zhou
|
Chengqing Zong
Document Image Translation (DIT), which aims at translating documents in images from source language to the target, plays an important role in Document Intelligence. It requires a comprehensive understanding of document multi-modalities and a focused concentration on relevant textual regions during translation. However, most existing methods usually rely on the vanilla encoder-decoder paradigm, severely losing concentration on key regions that are especially crucial for complex-layout document translation. To tackle this issue, in this paper, we propose a new Query-Response DIT framework (QRDIT). QRDIT reformulates the DIT task into a parallel response/translation process of the multiple queries (i.e., relevant source texts), explicitly centralizing its focus toward the most relevant textual regions to ensure translation accuracy. A novel dynamic aggregation mechanism is also designed to enhance the text semantics in query features toward translation. Extensive experiments in four translation directions on three benchmarks demonstrate its state-of-the-art performance, showing significant translation quality improvements toward whole-page complex-layout document images.
pdf
bib
abs
DependEval: Benchmarking LLMs for Repository Dependency Understanding
Junjia Du
|
Yadi Liu
|
Hongcheng Guo
|
Jiawei Wang
|
Haojian Huang
|
Yunyi Ni
|
Zhoujun Li
While large language models (LLMs) have shown considerable promise in code generation, real-world software development demands advanced repository-level reasoning. This includes understanding dependencies, project structures, and managing multi-file changes. However, the ability of LLMs to effectively comprehend and handle complex code repositories has yet to be fully explored. To address these challenges, we introduce a hierarchical benchmark designed to evaluate repository dependency understanding(DependEval) for LLMs. The benchmark is based on 2683 repositories collected from real-world websites. It evaluates models on three core tasks: Dependency Recognition, Repository Construction, and Multi-file Editing, across 8 programming languages from actual code repositories. Our evaluation of over 25 LLMs reveals substantial performance gaps and provides valuable insights into repository-level code understanding.
pdf
bib
abs
A General Knowledge Injection Framework for ICD Coding
Xu Zhang
|
Kun Zhang
|
Wenxin Ma
|
Rongsheng Wang
|
Chenxu Wu
|
Yingtai Li
|
S Kevin Zhou
ICD Coding aims to assign a wide range of medical codes to a medical text document, which is a popular and challenging task in the healthcare domain. To alleviate the problems of long-tail distribution and the lack of annotations of code-specific evidence, many previous works have proposed incorporating code knowledge to improve coding performance. However, existing methods often focus on a single type of knowledge and design specialized modules that are complex and incompatible with each other, thereby limiting their scalability and effectiveness. To address this issue, we propose GKI-ICD, a novel, general knowledge injection framework that integrates three key types of knowledge, namely ICD Description, ICD Synonym, and ICD Hierarchy, without specialized design of additional modules. The comprehensive utilization of the above knowledge, which exhibits both differences and complementarity, can effectively enhance the ICD coding performance. Extensive experiments on existing popular ICD coding benchmarks demonstrate the effectiveness of GKI-ICD, which achieves the state-of-the-art performance on most evaluation metrics. Code is available at https://github.com/xuzhang0112/GKI-ICD.
pdf
bib
abs
MMUnlearner: Reformulating Multimodal Machine Unlearning in the Era of Multimodal Large Language Models
Jiahao Huo
|
Yibo Yan
|
Xu Zheng
|
Yuanhuiyi Lyu
|
Xin Zou
|
Zhihua Wei
|
Xuming Hu
Recent progress in Machine Unlearning (MU) has introduced solutions for the selective removal of private or sensitive information encoded within deep neural networks. Nonetheless, MU for Multimodal Large Language Models (MLLMs) remains in its nascent phase. Therefore, we propose to **reformulate the task of multimodal MU in the era of MLLMs**, which aims to erase only the visual patterns associated with a given entity while preserving the corresponding textual knowledge encoded within the original parameters of the language model backbone. Furthermore, we **develop a novel geometry-constrained gradient ascent method MMUnlearner**. It updates the weights of MLLMs with a weight saliency map jointly restricted by the remaining concepts and textual knowledge during unlearning, thereby preserving parameters essential for non-target knowledge. Extensive experiments demonstrate that MMUnlearner surpasses baselines that finetuning MLLMs with VQA data directly through Gradient Ascent (GA) or Negative Preference Optimization (NPO), across all evaluation dimensions. Our code will be released upon acceptance.
pdf
bib
abs
Generating Questions, Answers, and Distractors for Videos: Exploring Semantic Uncertainty of Object Motions
Wenjian Ding
|
Yao Zhang
|
Jun Wang
|
Adam Jatowt
|
Zhenglu Yang
Video Question-Answer-Distractors (QADs) show promising values for assessing the performance of systems in perceiving and comprehending multimedia content. Given the significant cost and labor demands of manual annotation, existing large-scale Video QADs benchmarks are typically generated automatically using video captions. Since video captions are incomplete representations of visual content and susceptible to error propagation, direct generation of QADs from video is crucial. This work first leverages a large vision-language model for video QADs generation. To enhance the consistency and diversity of the generated QADs, we propose utilizing temporal motion to describe the video objects. In addition, We design a selection mechanism that chooses diverse temporal object motions to generate diverse QADs focusing on different objects and interactions, maximizing overall semantic uncertainty for a given video. Evaluation on the NExT-QA and Perception Test benchmarks demonstrates that the proposed approach significantly improves both the consistency and diversity of QADs generated by a range of large vision-language models, thus highlighting its effectiveness and generalizability.
pdf
bib
abs
DiffSkip: Differential Layer Skipping in Large Language Models
Xuan Luo
|
Weizhi Wang
|
Xifeng Yan
Existing Large Language Models (LLMs) enforce uniform computation across all tokens. We analyze the correlation between the input-output difference of self-attention block and Feed-Forward Network (FFN) within the same transformer layer, and find that these two differential vectors are highly correlated. Thus, we propose to dynamically skip the FFN blocks based on the self-attention difference and introduce Diffential Layer Skipping (DiffSkip) to show that LLMs are inherently dynamic-depth models, capable of adjusting computational depth when generating different tokens. DiffSkip employs a lightweight router module to dynamically skip a set of FFN blocks in LLMs and only requires efficient fine-tuning while keeping the whole LLM frozen. Experimental results demonstrate that DiffSkip effectively enables dynamic FFN skipping in decoder-only language models, even in continuous token generation tasks where many layer-skipping methods struggle.
pdf
bib
abs
Towards Explainable Temporal Reasoning in Large Language Models: A Structure-Aware Generative Framework
Zihao Jiang
|
Ben Liu
|
Miao Peng
|
Wenjie Xu
|
Yao Xiao
|
Zhenyan Shan
|
Min Peng
While large language models (LLMs) show great potential in temporal reasoning, most existing work focuses heavily on enhancing performance, often neglecting the explainable reasoning processes underlying the results. To address this gap, we introduce a comprehensive benchmark covering a wide range of temporal granularities, designed to systematically evaluate LLMs’ capabilities in explainable temporal reasoning. Furthermore, our findings reveal that LLMs struggle to deliver convincing explanations when relying solely on textual information. To address challenge, we propose GETER, a novel structure-aware generative framework that integrates Graph structures with text for Explainable TEmporal Reasoning. Specifically, we first leverage temporal knowledge graphs to develop a temporal encoder that captures structural information for the query. Subsequently, we introduce a structure-text prefix adapter to map graph structure features into the text embedding space. Finally, LLMs generate explanation text by seamlessly integrating the soft graph token with instruction-tuning prompt tokens. Experimental results indicate that GETER achieves state-of-the-art performance while also demonstrating robust generalization capabilities. Our dataset and code are available at https://github.com/carryTatum/GETER.
pdf
bib
abs
A Bounding Box is Worth One Token - Interleaving Layout and Text in a Large Language Model for Document Understanding
Jinghui Lu
|
Haiyang Yu
|
Yanjie Wang
|
Yongjie Ye
|
Jingqun Tang
|
Ziwei Yang
|
Binghong Wu
|
Qi Liu
|
Hao Feng
|
Han Wang
|
Hao Liu
|
Can Huang
Recently, many studies have demonstrated that exclusively incorporating OCR-derived text and spatial layouts with large language models (LLMs) can be highly effective for document understanding tasks. However, existing methods that integrate spatial layouts with text have limitations, such as producing overly long text sequences or failing to fully leverage the autoregressive traits of LLMs. In this work, we introduce Interleaving Layout andText in a Large Language Model (LayTextLLM) for document understanding. LayTextLLM projects each bounding box to a single embedding and interleaves it with text, efficiently avoiding long sequence issues while leveraging autoregressive traits of LLMs. LayTextLLM not only streamlines the interaction of layout and textual data but also shows enhanced performance in KIE and VQA. Comprehensive benchmark evaluations reveal significant improvements of LayTextLLM, with a 15.2% increase on KIE tasks and 10.7% on VQA tasks compared to previous SOTA OCR-based LLMs. All resources are available at URL masked for anonymous review.
pdf
bib
abs
Self-Foveate: Enhancing Diversity and Difficulty of Synthesized Instructions from Unsupervised Text via Multi-Level Foveation
Mingzhe Li
|
Xin Lu
|
Yanyan Zhao
Large language models (LLMs) with instruction following capabilities have demonstrated impressive problem-solving abilities. While synthesizing instructional data from unsupervised text has become a common approach for training such models, conventional methods rely heavily on human effort for data annotation. Although existing automated synthesis paradigms have alleviated this constraint, they still exhibit significant limitations in ensuring adequate diversity and difficulty of synthesized instructions. To address these challenges, we propose Self-Foveate, an innovative LLM-driven method for instruction synthesis. This approach introduces a “Micro-Scatter-Macro” multi-level foveation methodology that effectively guides the LLM to deeply excavate fine-grained information embedded in unsupervised text, thereby enhancing both the diversity and difficulty of synthesized instructions. Comprehensive experiments across multiple unsupervised corpora and diverse model architectures validate the effectiveness and superiority of our proposed method. We publicly release our data and codes: https://github.com/Mubuky/Self-Foveate
pdf
bib
abs
TableDreamer: Progressive and Weakness-guided Data Synthesis from Scratch for Table Instruction Tuning
Mingyu Zheng
|
Zhifan Feng
|
Jia Wang
|
Lanrui Wang
|
Zheng Lin
|
Hao Yang
|
Weiping Wang
Despite the commendable progress of recent LLM-based data synthesis methods, they face two limitations in generating table instruction tuning data. First, they can not thoroughly explore the vast input space of table understanding tasks, leading to limited data diversity. Second, they ignore the weaknesses in table understanding ability of the target LLM and blindly pursue the increase of data quantity, resulting in suboptimal data efficiency. In this paper, we introduce a progressive and weakness-guided data synthesis framework tailored for table instruction tuning, named TableDreamer, to mitigate the above issues. Specifically, we first synthesize diverse tables and related instructions as seed data, and then perform an iterative exploration of the input space under the guidance of the newly identified weakness data, which eventually serve as the final training data for fine-tuning the target LLM. Extensive experiments on 10 tabular benchmarks demonstrate the effectiveness of the proposed framework, which boosts the average accuracy of Llama3.1-8B-instruct by 11.62% (49.07→60.69) with 27K GPT-4o synthetic data and outperforms state-of-the-art data synthesis baselines which use more training data.
pdf
bib
abs
Konooz: Multi-domain Multi-dialect Corpus for Named Entity Recognition
Nagham Hamad
|
Mohammed Khalilia
|
Mustafa Jarrar
We introduce , a novel multi-dimensional corpus covering 16 Arabic dialects across 10 domains, resulting in 160 distinct corpora. The corpus comprises about 777k tokens, carefully collected and manually annotated with 21 entity types using both nested and flat annotation schemes - using the Wojood guidelines. While is useful for various NLP tasks like domain adaptation and transfer learning, this paper primarily focuses on benchmarking existing Arabic Named Entity Recognition (NER) models, especially cross-domain and cross-dialect model performance. Our benchmarking of four Arabic NER models using reveals a significant drop in performance of up to 38% when compared to the in-distribution data. Furthermore, we present an in-depth analysis of domain and dialect divergence and the impact of resource scarcity. We also measured the overlap between domains and dialects using the Maximum Mean Discrepancy (MMD) metric, and illustrated why certain NER models perform better on specific dialects and domains. is open-source and publicly available at
https://sina.birzeit.edu/wojood/#downloadpdf
bib
abs
Self-Rewarding Large Vision-Language Models for Optimizing Prompts in Text-to-Image Generation
Hongji Yang
|
Yucheng Zhou
|
Wencheng Han
|
Jianbing Shen
Text-to-image models are powerful for producing high-quality images based on given text prompts, but crafting these prompts often requires specialized vocabulary. To address this, existing methods train rewriting models with supervision from large amounts of manually annotated data and trained aesthetic assessment models. To alleviate the dependence on data scale for model training and the biases introduced by trained models, we propose a novel prompt optimization framework, designed to rephrase a simple user prompt into a sophisticated prompt to a text-to-image model. Specifically, we employ the large vision language models (LVLMs) as the solver to rewrite the user prompt, and concurrently, employ LVLMs as a reward model to score the aesthetics and alignment of the images generated by the optimized prompt. Instead of laborious human feedback, we exploit the prior knowledge of the LVLM to provide rewards, i.e., AI feedback. Simultaneously, the solver and the reward model are unified into one model and iterated in reinforcement learning to achieve self-improvement by giving a solution and judging itself. Results on two popular datasets demonstrate that our method outperforms other strong competitors.
pdf
bib
abs
CodeV: Issue Resolving with Visual Data
Linhao Zhang
|
Daoguang Zan
|
Quanshun Yang
|
Zhirong Huang
|
Dong Chen
|
Bo Shen
|
Tianyu Liu
|
Yongshun Gong
|
Huang Pengjie
|
Xudong Lu
|
Guangtai Liang
|
Lizhen Cui
|
Qianxiang Wang
Large Language Models (LLMs) have advanced rapidly in recent years, with their applications in software engineering expanding to more complex repository-level tasks. GitHub issue resolving is a key challenge among these tasks. While recent approaches have made progress on this task, they focus on textual data within issues, neglecting visual data. However, this visual data is crucial for resolving issues as it conveys additional knowledge that text alone cannot. We propose CodeV, the first approach to leveraging visual data to enhance the issue-resolving capabilities of LLMs. CodeV resolves each issue by following a two-phase process: data processing and patch generation. To evaluate CodeV, we construct a benchmark for visual issue resolving, namely Visual SWE-bench. Through extensive experiments, we demonstrate the effectiveness of CodeV, as well as provide valuable insights into leveraging visual data to resolve GitHub issues.
pdf
bib
abs
A Survey of Large Language Models in Psychotherapy: Current Landscape and Future Directions
Hongbin Na
|
Yining Hua
|
Zimu Wang
|
Tao Shen
|
Beibei Yu
|
Lilin Wang
|
Wei Wang
|
John Torous
|
Ling Chen
Mental health is increasingly critical in contemporary healthcare, with psychotherapy demanding dynamic, context-sensitive interactions that traditional NLP methods struggle to capture. Large Language Models (LLMs) offer significant potential for addressing this gap due to their ability to handle extensive context and multi-turn reasoning. This review introduces a conceptual taxonomy dividing psychotherapy into interconnected stages–assessment, diagnosis, and treatment–to systematically examine LLM advancements and challenges. Our comprehensive analysis reveals imbalances in current research, such as a focus on common disorders, linguistic biases, fragmented methods, and limited theoretical integration. We identify critical challenges including capturing dynamic symptom fluctuations, overcoming linguistic and cultural biases, and ensuring diagnostic reliability. Highlighting future directions, we advocate for continuous multi-stage modeling, real-time adaptive systems grounded in psychological theory, and diversified research covering broader mental disorders and therapeutic approaches, aiming toward more holistic and clinically integrated psychotherapy LLMs systems.
pdf
bib
abs
Breaking the Reasoning Barrier A Survey on LLM Complex Reasoning through the Lens of Self-Evolution
Tao He
|
Hao Li
|
Jingchang Chen
|
Runxuan Liu
|
Yixin Cao
|
Lizi Liao
|
Zihao Zheng
|
Zheng Chu
|
Jiafeng Liang
|
Ming Liu
|
Bing Qin
The release of OpenAI’s O1 and subsequent projects like DeepSeek R1 has significantly advanced research on complex reasoning in LLMs. This paper systematically analyzes existing reasoning studies from the perspective of self-evolution, structured into three components: data evolution, model evolution, and self-evolution. Data evolution explores methods to generate higher-quality reasoning training data. Model evolution focuses on training strategies to boost reasoning capabilities. Self-evolution research autonomous system evolution via iterating cycles of data and model evolution. We further discuss the scaling law of self-evolution and analyze representative O1-like works through this lens. By summarizing advanced methods and outlining future directions, this paper aims to drive advancements in LLMs’ reasoning abilities.
pdf
bib
abs
SEE: Continual Fine-tuning with Sequential Ensemble of Experts
Zhilin Wang
|
Yafu Li
|
Xiaoye Qu
|
Yu Cheng
Continual fine-tuning of large language models (LLMs) suffers from catastrophic forgetting. Rehearsal-based methods mitigate this problem by retaining a small set of old data. Nevertheless, they still suffer inevitable performance loss. Although training separate experts for each task can help prevent forgetting, effectively assembling them remains a challenge. Some approaches use routers to assign tasks to experts, but in continual learning, they often require retraining for optimal performance. To address these challenges, we introduce the Sequential Ensemble of Experts (SEE) framework. SEE removes the need for an additional router, allowing each expert to independently decide whether a query should be handled. The framework employs distributed routing, and during continual fine-tuning, SEE only requires the training of new experts for incoming tasks, rather than retraining the entire system. Experiments reveal that the SEE outperforms prior approaches, including multi-task learning, in continual fine-tuning. It also demonstrates remarkable generalization ability, as the expert can effectively identify out-of-distribution queries, which can then be directed to a more generalized model for resolution. This work highlights the promising potential of integrating routing and response mechanisms within each expert, paving the way for the future of distributed model ensembling.
pdf
bib
abs
Boosting Policy and Process Reward Models with Monte Carlo Tree Search in Open-Domain QA
Chi-Min Chan
|
Chunpu Xu
|
Junqi Zhu
|
Jiaming Ji
|
Donghai Hong
|
Pengcheng Wen
|
Chunyang Jiang
|
Zhen Ye
|
Yaodong Yang
|
Wei Xue
|
Sirui Han
|
Yike Guo
The recent introduction of OpenAI’s O1/O3 model represents a significant milestone in developing strong reasoning capabilities in Large Language Models (LLMs). By introducing more computational budget during test-time, LLMs have the potential to explore more accurate and higher-quality solutions. However, such paradigms are primarily verified in domains that have well-defined criteria for responses, such as coding and mathematics. Inspired by the success of this paradigm, we aim to bridge it to more subtle open-domain question answering. Specifically, we utilize search mechanisms such as Monte Carlo Tree Search (MCTS) for both policy model improvement and reward model improvement that achieve better performance in test-time scaling strategies. Our contributions are summarized in two folds: For the training phase, we demonstrate that our approach surpasses previous SOTA automatic data annotation methods and various public instruction-tuning datasets, with fewer data points. This offers a more data-efficient solution for training robust models. For the inference phase, we utilize the intermediate values collected during training data construction to train a process reward model called PRM+. This model employs a novel two-stage training method to provide finer-grained guidance across the generation trajectory. This introduces no additional overhead during training data collection and further enhances performance by scaling test-time computation. Experimental results show that our method can effectively improve the performance of both the policy model and the reward model.
pdf
bib
abs
Investigating and Enhancing Vision-Audio Capability in Omnimodal Large Language Models
Rui Hu
|
Delai Qiu
|
Shuyu Wei
|
Jiaming Zhang
|
Yining Wang
|
Shengping Liu
|
Jitao Sang
Omnimodal Large Language Models (OLLMs) have shown significant progress in integrating vision and text, but still struggle with integrating vision and audio, often exhibiting suboptimal performance when processing audio queries compared to text queries. This disparity is primarily due to insufficient alignment between vision and audio modalities during training, leading to inadequate attention to visual information when using audio queries. To mitigate this issue, we propose a Self-Knowledge Distillation (Self-KD) training method where the vision-text component of the OLLM serves as the teacher and the vision-audio component as the student. This enables the model to process audio in a manner analogous to its text processing. Our experimental results demonstrate that Self-KD is an effective method for enhancing the vision-audio capabilities of OLLMs by learning from the vision-text components, which subsequently improves the interaction between audio and images and results in improved performance on multimodal tasks.
pdf
bib
abs
OpenHuEval: Evaluating Large Language Model on Hungarian Specifics
Haote Yang
|
Xingjian Wei
|
Jiang Wu
|
Noémi Ligeti-Nagy
|
Jiaxing Sun
|
Yinfan Wang
|
Győző Zijian Yang
|
Junyuan Gao
|
Jingchao Wang
|
Bowen Jiang
|
Shasha Wang
|
Nanjun Yu
|
Zihao Zhang
|
Shixin Hong
|
Hongwei Liu
|
Wei Li
|
Songyang Zhang
|
Dahua Lin
|
Lijun Wu
|
Gábor Prószéky
|
Conghui He
We introduce OpenHuEval, the first benchmark for LLMs focusing on the Hungarian language and specifics. OpenHuEval is constructed from a vast collection of Hungarian-specific materials sourced from multiple origins. In the construction, we incorporated the latest design principles for evaluating LLMs, such as using real user queries from the internet, emphasizing the assessment of LLMs’ generative capabilities, and employing LLM-as-judge to enhance the multidimensionality and accuracy of evaluations. Ultimately, OpenHuEval encompasses eight Hungarian-specific dimensions, featuring five tasks and 3953 questions. Consequently, OpenHuEval provides the comprehensive, in-depth, and scientifically accurate assessment of LLM performance in the context of the Hungarian language and its specifics. We evaluated current mainstream LLMs, including both traditional LLMs and recently developed Large Reasoning Models. The results demonstrate the significant necessity for evaluation and model optimization tailored to the Hungarian language and specifics. We also established the framework for analyzing the thinking processes of LRMs with OpenHuEval, revealing intrinsic patterns and mechanisms of these models in non-English languages, with Hungarian serving as a representative example. We will release OpenHuEval at https://github.com/opendatalab/OpenHuEval .
pdf
bib
abs
StructFact: Reasoning Factual Knowledge from Structured Data with Large Language Models
Sirui Huang
|
Yanggan Gu
|
Zhonghao Li
|
Xuming Hu
|
Li Qing
|
Guandong Xu
Large language models (LLMs) have made significant strides in natural language processing by leveraging their ability to comprehend and reason with factual knowledge. However, a significant amount of factual knowledge is stored in structured data, which has unique characteristics not typically encountered in the unstructured texts used for pretraining LLMs. To evaluate the capability of LLMs in handling facts structurally stored, we introduce a benchmark called StructFact, which includes meticulously annotated factual questions, spanning five tasks that reflect the intrinsic properties of structured data. This benchmark aims to delineate the strengths and limitations of LLMs in reasoning with structured data for knowledge-intensive tasks in practical applications. Extensive experiments conducted on 10 common LLMs have yielded several insights, one notable finding being that these models struggle significantly with the heterogeneity of structured data during reasoning.
pdf
bib
abs
From Imitation to Introspection: Probing Self-Consciousness in Language Models
Sirui Chen
|
Shu Yu
|
Shengjie Zhao
|
Chaochao Lu
Self-consciousness, the introspection of one’s existence and thoughts, represents a high-level cognitive process. As language models advance at an unprecedented pace, a critical question arises: Are these models becoming self-conscious? Drawing upon insights from psychological and neural science, this work presents a practical definition of self-consciousness for language models and refines ten core concepts. Our work pioneers an investigation into self-consciousness in language models by, for the first time, leveraging structural causal games to establish the functional definitions of the ten core concepts. Based on our definitions, we conduct a comprehensive four-stage experiment: quantification (evaluation of ten leading models), representation (visualization of self-consciousness within the models), manipulation (modification of the models’ representation), and acquisition (fine-tuning the models on core concepts). Our findings indicate that although models are in the early stages of developing self-consciousness, there is a discernible representation of certain concepts within their internal mechanisms. However, these representations of self-consciousness are hard to manipulate positively at the current stage, yet they can be acquired through targeted fine-tuning.
pdf
bib
abs
DocFusion: A Unified Framework for Document Parsing Tasks
Mingxu Chai
|
Ziyu Shen
|
Chong Zhang
|
Yue Zhang
|
Xiao Wang
|
Shihan Dou
|
Jihua Kang
|
Jiazheng Zhang
|
Qi Zhang
Document parsing involves layout element detection and recognition, essential for extracting information. However, existing methods often employ multiple models for these tasks, leading to increased system complexity and maintenance overhead. While some models attempt to unify detection and recognition, they often fail to address the intrinsic differences in data representations, thereby limiting performance in document processing. Our research reveals that recognition relies on discrete tokens, whereas detection relies on continuous coordinates, leading to challenges in gradient updates and optimization. To bridge this gap, we propose the Gaussian-Kernel Cross-Entropy Loss (GK-CEL), enabling generative frameworks to handle both tasks simultaneously. Building upon GK-CEL, we propose DocFusion, a unified document parsing model with only 0.28B parameters. Additionally, we construct the DocLatex-1.6M dataset to provide high-quality training support. Experimental results show that DocFusion, equipped with GK-CEL, performs competitively across four core document parsing tasks, validating the effectiveness of our unified approach.
pdf
bib
abs
Hierarchical Safety Realignment: Lightweight Restoration of Safety in Pruned Large Vision-Language Models
Yue Li
|
Xin Yi
|
Dongsheng Shi
|
Gerard De Melo
|
Xiaoling Wang
|
Linlin Wang
With the increasing size of Large Vision-Language Models (LVLMs), network pruning techniques aimed at compressing models for deployment in resource-constrained environments have garnered significant attention. However, we observe that pruning often leads to a degradation in safety performance. To address this issue, we present a novel and lightweight approach, termed Hierarchical Safety Realignment (HSR). HSR operates by first quantifying the contribution of each attention head to safety, identifying the most critical ones, and then selectively restoring neurons directly within these attention heads that play a pivotal role in maintaining safety. This process hierarchically realigns the safety of pruned LVLMs, progressing from the attention head level to the neuron level. We validate HSR across various models and pruning strategies, consistently achieving notable improvements in safety performance. To our knowledge, this is the first work explicitly focused on restoring safety in LVLMs post-pruning.
pdf
bib
abs
LongDPO: Unlock Better Long-form Generation Abilities for LLMs via Critique-augmented Stepwise Information
Bowen Ping
|
Jiali Zeng
|
Fandong Meng
|
Shuo Wang
|
Jie Zhou
|
Shanghang Zhang
Recent advancements in large language models (LLMs) have markedly improved their capacity to handle long text inputs; however, current models, including GPT-4o, still exhibit unsatisfactory performance in long-form generation. Generating high-quality long-form content still remains a significant challenge. In this paper, we present LongDPO, a novel approach designed to enhance long-form text generation through step-level supervision. By leveraging Monte Carlo Tree Search (MCTS) to collect stepwise preference pairs and employing a global memory pool to maintain factual accuracy, LongDPO effectively mitigates issues such as inconsistencies that are prevalent in long-context LLMs. Furthermore, we integrate critique-augmented generation to refine the selected preference pairs. Following the collection of stepwise preference pairs, we apply stepwise preference learning for fine-grained optimization. Experimental results demonstrate that our method enhances performance on long-form generation benchmarks (e.g. LongBench-Write) while maintaining nearly lossless performance on several general benchmarks.
pdf
bib
abs
Reinforcing Compositional Retrieval: Retrieving Step-by-Step for Composing Informative Contexts
Quanyu Long
|
Jianda Chen
|
Zhengyuan Liu
|
Nancy F. Chen
|
Wenya Wang
|
Sinno Jialin Pan
Large Language Models (LLMs) have demonstrated remarkable capabilities across numerous tasks, yet they often rely on external context to handle complex tasks. While retrieval-augmented frameworks traditionally focus on selecting top-ranked documents in a single pass, many real-world scenarios demand compositional retrieval, where multiple sources must be combined in a coordinated manner. In this work, we propose a tri-encoder sequential retriever that models this process as a Markov Decision Process (MDP), decomposing the probability of retrieving a set of elements into a sequence of conditional probabilities and allowing each retrieval step to be conditioned on previously selected examples. We train the retriever in two stages: first, we efficiently construct supervised sequential data for initial policy training; we then refine the policy to align with the LLM’s preferences using a reward grounded in the structural correspondence of generated programs. Experimental results show that our method consistently and significantly outperforms baselines, underscoring the importance of explicitly modeling inter-example dependencies. These findings highlight the potential of compositional retrieval for tasks requiring multiple pieces of evidence or examples.
pdf
bib
abs
Towards A Better Initial Policy Model For Scalable Long-CoT Reinforcement Learning
Bofei Gao
|
Yejie Wang
|
Yibo Miao
|
Ruoyu Wu
|
Feifan Song
|
Longhui Yu
|
Tianyu Liu
|
Baobao Chang
Long-CoT reasoning combined with reinforcement learning for large language models demonstrates remarkable performance and scalability. However, we observe that the initial policy model could significantly influence the final performance as well as the token efficiency. Additionally, there is a lack of systematic guidelines for obtaining a better initial policy model. To bridge this gap, we initiate a comprehensive investigation by activating the initial model using a variety of datasets with different data volumes and reasoning patterns. Then, we conduct a thorough analysis and comparison of the RL process for different initial models from the perspectives of upper bounds, diversity, and token efficiency, providing a deeper understanding and insight into the long-CoT RL. Based on our empirical results, we propose a systematic guideline and a novel Re-RFT method for constructing a better RL start point. Our experiment results based on the 14B model surpass the DeepSeek-R1-Distill-Qwen-14B by an average of 4.6%, demonstrating our approach’s effectiveness and superiority.
pdf
bib
abs
Topic Modeling for Short Texts via Optimal Transport-Based Clustering
Tu Vu
|
Manh Do
|
Tung Nguyen
|
Linh Ngo Van
|
Sang Dinh
|
Thien Huu Nguyen
Discovering topics and learning document representations in topic space are two crucial aspects of topic modeling, particularly in the short-text setting, where inferring topic proportions for individual documents is highly challenging. Despite significant progress in neural topic modeling, effectively distinguishing document representations as well as topic embeddings remains an open problem. In this paper, we propose a novel method called **En**hancing Global **C**lustering with **O**ptimal **T**ransport in Topic Modeling (EnCOT). Our approach utilizes an abstract global clusters concept to capture global information and then employs the Optimal Transport framework to align document representations in the topic space with global clusters, while also aligning global clusters with topics. This dual alignment not only enhances the separation of documents in the topic space but also facilitates learning of latent topics. Through extensive experiments, we demonstrate that our method outperforms state-of-the-art techniques in short-text topic modeling across commonly used metrics.
pdf
bib
abs
Lemmatisation & Morphological Analysis of Unedited Greek: Do Simple Tasks Need Complex Solutions?
Colin Swaelens
|
Ilse De Vos
|
Els Lefever
Fine-tuning transformer-based models for part-of-speech tagging of unedited Greek text has outperformed traditional systems. However, when applied to lemmatisation or morphological analysis, fine-tuning has not yet achieved competitive results. This paper explores various approaches to combine morphological features to both reduce label complexity and enhance multi-task training. Specifically, we group three nominal features into a single label, and combine the three most distinctive features of verbs into another unified label. These combined labels are used to fine-tune DBBERT, a BERT model pre-trained on both ancient and modern Greek. Additionally, we experiment with joint training – both among these labels and in combination with POS tagging – within a multi-task framework to improve performance by transferring parameters. To evaluate our models, we use a manually annotated gold standard from the Database of Byzantine Book Epigrams. Our results show a nearly 9 pp. improvement, demonstrating that multi-task learning is a promising approach for linguistic annotation in less standardised corpora.
pdf
bib
abs
FRAME: Feedback-Refined Agent Methodology for Enhancing Medical Research Insights
Chengzhang Yu
|
Yiming Zhang
|
Zhixin Liu
|
Zenghui Ding
|
Yining Sun
|
Zhanpeng Jin
The automation of scientific research through large language models (LLMs) presents significant opportunities but faces critical challenges in knowledge synthesis and quality assurance. We introduce Feedback-Refined Agent Methodology (FRAME), a novel framework that enhances medical paper generation through iterative refinement and structured feedback. Our approach comprises three key innovations: (1) A structured dataset construction method that decomposes 4,287 medical papers into essential research components through iterative refinement; (2) A tripartite architecture integrating Generator, Evaluator, and Reflector agents that progressively improve content quality through metric-driven feedback; and (3) A comprehensive evaluation framework that combines statistical metrics with human-grounded benchmarks. Experimental results demonstrate FRAME’s effectiveness, achieving significant improvements over conventional approaches across multiple models (9.91% average gain with DeepSeek V3, comparable improvements with GPT-4o Mini) and evaluation dimensions. Human evaluation confirms that FRAME-generated papers achieve quality comparable to human-authored works, with particular strength in synthesizing future research directions. The results demonstrated our work could efficiently assist medical research by building a robust foundation for automated medical research paper generation while maintaining rigorous academic standards.
pdf
bib
abs
Chain-of-Scrutiny: Detecting Backdoor Attacks for Large Language Models
Xi Li
|
Ruofan Mao
|
Yusen Zhang
|
Renze Lou
|
Chen Wu
|
Jiaqi Wang
Large Language Models (LLMs), especially those accessed via APIs, have demonstrated impressive capabilities across various domains. However, users without technical expertise often turn to (untrustworthy) third-party services, such as prompt engineering, to enhance their LLM experience, creating vulnerabilities to adversarial threats like backdoor attacks. Backdoor-compromised LLMs generate malicious outputs to users when inputs contain specific “triggers” set by attackers. Traditional defense strategies, originally designed for small-scale models, are impractical for API-accessible LLMs due to limited model access, high computational costs, and data requirements. To address these limitations, we propose Chain-of-Scrutiny (CoS) which leverages LLMs’ unique reasoning abilities to mitigate backdoor attacks. It guides the LLM to generate reasoning steps for a given input and scrutinizes for consistency with the final output – any inconsistencies indicating a potential attack. It is well-suited for the popular API-only LLM deployments, enabling detection at minimal cost and with little data. User-friendly and driven by natural language, it allows non-experts to perform the defense independently while maintaining transparency. We validate the effectiveness of CoS through extensive experiments on various tasks and LLMs, with results showing greater benefits for more powerful LLMs.
pdf
bib
abs
Relevance Scores Calibration for Ranked List Truncation via TMP Adapter
Pavel Posokhov
|
Sergei Masliukhin
|
Skrylnikov Stepan
|
Danil Tirskikh
|
Olesia Makhnytkina
The ranked list truncation task involves determining a truncation point to retrieve the relevant items from a ranked list. Despite current advancements, truncation methods struggle with limited capacity, unstable training and inconsistency of selected threshold. To address these problems we introduce TMP Adapter, a novel approach that builds upon the improved adapter model and incorporates the Threshold Margin Penalty (TMP) as an additive loss function to calibrate ranking model relevance scores for ranked list truncation. We evaluate TMP Adapter’s performance on various retrieval datasets and observe that TMP Adapter is a promising advancement in the calibration methods, which offers both theoretical and practical benefits for ranked list truncation.
pdf
bib
abs
Neuron Activation Modulation for Text Style Transfer: Guiding Large Language Models
Chaona Kong
|
Jianyi Liu
|
Yifan Tang
|
Ru Zhang
Text style transfer (TST) aims to flexibly adjust the style of text while preserving its core content. Although large language models (LLMs) excel in TST tasks, they often face unidirectional issues due to imbalanced training data and their tendency to generate safer responses. These challenges present a significant obstacle in achieving effective style transfer. To address this issue, we propose a novel method for text style transfer based on neuron activation modulation (NAM-TST). This approach identifies neurons related to style through gradient-based activation difference analysis and calculates the activation differences between the source and target styles. During text generation, we use the activation difference to align the activation values of style-related neurons with those of the target style to guide the model in performing the transfer. This strategy enables the model to generate text that satisfies specific style requirements, effectively mitigating the unidirectional issue inherent in LLMs during style transfer. Experiments on benchmark datasets demonstrate that NAM-TST significantly enhances style transfer quality while preserving content consistency.
pdf
bib
abs
MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering
Jingqun Tang
|
Qi Liu
|
Yongjie Ye
|
Jinghui Lu
|
Shu Wei
|
An-Lan Wang
|
Chunhui Lin
|
Hao Feng
|
Zhen Zhao
|
Yanjie Wang
|
Yuliang Liu
|
Hao Liu
|
Xiang Bai
|
Can Huang
Text-Centric Visual Question Answering (TEC-VQA) in its proper format not only facilitates human-machine interaction in text-centric visual environments but also serves as a de facto gold proxy to evaluate AI models in the domain of text-centric scene understanding. Nonetheless, most existing TEC-VQA benchmarks focus on high-resource languages like English and Chinese. Despite pioneering works expanding multilingual QA pairs in non-text-centric VQA datasets through translation engines, the translation-based protocol encounters a substantial “visual-textual misalignment” problem when applied to TEC-VQA. Specifically, it prioritizes the text in question-answer pairs while disregarding the visual text present in images. Moreover, it fails to address complexities related to nuanced meaning, contextual distortion, language bias, and question-type diversity. In this work, we tackle multilingual TEC-VQA by introducing MTVQA, the first benchmark featuring high-quality human expert annotations across 9 diverse languages, consisting of 6,778 question-answer pairs across 2,116 images. Further, by comprehensively evaluating numerous state-of-the-art Multimodal Large Language Models (MLLMs), including Qwen2.5-VL, InternVL-2.5, GPT-4o, GPT-4V, Claude3, and Gemini, on the MTVQA benchmark, it is evident that there is still a large room for performance improvement (InternVL-2.5 scoring 32.2 versus 79.7 for human performance), underscoring the value of MTVQA. By providing a dataset with nuanced multilingual annotations, MTVQA aims to set a new standard for benchmarks, fostering advancements in multilingual visual text comprehension.
pdf
bib
abs
HICD: Hallucination-Inducing via Attention Dispersion for Contrastive Decoding to Mitigate Hallucinations in Large Language Models
Xinyan Jiang
|
Hang Ye
|
Yongxin Zhu
|
Xiaoying Zheng
|
Zikang Chen
|
Jun Gong
Large Language Models (LLMs) often generate hallucinations, producing outputs that are contextually inaccurate or factually incorrect. We introduce HICD, a novel method designed to induce hallucinations for contrastive decoding to mitigate hallucinations. Unlike existing contrastive decoding methods, HICD selects attention heads crucial to the model’s prediction as inducing heads, then induces hallucinations by dispersing attention of these inducing heads and compares the hallucinated outputs with the original outputs to obtain the final result. Our approach significantly improves performance on tasks requiring contextual faithfulness, such as context completion, reading comprehension, and question answering. It also improves factuality in tasks requiring accurate knowledge recall. We demonstrate that our inducing heads selection and attention dispersion method leads to more “contrast-effective” hallucinations for contrastive decoding, outperforming other hallucination-inducing methods. Our findings provide a promising strategy for reducing hallucinations by inducing hallucinations in a controlled manner, enhancing the performance of LLMs in a wide range of tasks.
pdf
bib
abs
Understanding the Repeat Curse in Large Language Models from a Feature Perspective
Junchi Yao
|
Shu Yang
|
Jianhua Xu
|
Lijie Hu
|
Mengdi Li
|
Di Wang
Large language models (LLMs) have made remarkable progress in various domains, yet they often suffer from repetitive text generation, a phenomenon we refer to as the ”Repeat Curse”. While previous studies have proposed decoding strategies to mitigate repetition, the underlying mechanism behind this issue remains insufficiently explored. In this work, we investigate the root causes of repetition in LLMs through the lens of mechanistic interpretability. Inspired by recent advances in Sparse Autoencoders (SAEs), which enable monosemantic feature extraction, we propose a novel approach—”Duplicatus Charm”—to induce and analyze the Repeat Curse. Our method systematically identifies “Repetition Features” -the key model activations responsible for generating repetitive outputs. First, we locate the layers most involved in repetition through logit analysis. Next, we extract and stimulate relevant features using SAE-based activation manipulation. To validate our approach, we construct a repetition dataset covering token and paragraph level repetitions and introduce an evaluation pipeline to quantify the influence of identified repetition features. Furthermore, by deactivating these features, we have effectively mitigated the Repeat Curse.
pdf
bib
abs
Code-Switching Curriculum Learning for Multilingual Transfer in LLMs
Haneul Yoo
|
Cheonbok Park
|
Sangdoo Yun
|
Alice Oh
|
Hwaran Lee
Large language models (LLMs) now exhibit near human-level performance in various tasks, but their performance drops drastically after a handful of high-resource languages due to the imbalance in pre-training data. Inspired by the human process of second language acquisition, particularly code-switching—the practice of language alternation in a conversation—we propose code-switching curriculum learning (CSCL) to enhance cross-lingual transfer for LLMs. CSCL mimics the stages of human language learning by progressively training models with a curriculum consisting of 1) token-level code-switching, 2) sentence-level code-switching, and 3) monolingual corpora. Using Qwen 2 as our underlying model, we demonstrate the efficacy of the CSCL in improving language transfer to Korean, achieving significant performance gains compared to monolingual continual pre-training methods. Ablation studies reveal that both token- and sentence-level code-switching significantly enhance cross-lingual transfer and that curriculum learning amplifies these effects. We also extend our findings into various languages, including Japanese (high-resource) and Indonesian (low-resource), and using two additional models (Gemma 2 and Phi 3.5). We further show that CSCL mitigates spurious correlations between language resources and safety alignment, presenting a robust, efficient framework for more equitable language transfer in LLMs. We observe that CSCL is effective for low-resource settings where high-quality, monolingual corpora for language transfer are hardly available.
pdf
bib
abs
A Mousetrap: Fooling Large Reasoning Models for Jailbreak with Chain of Iterative Chaos
Yang Yao
|
Xuan Tong
|
Ruofan Wang
|
Yixu Wang
|
Lujundong Li
|
Liang Liu
|
Yan Teng
|
Yingchun Wang
Large Reasoning Models (LRMs) have significantly advanced beyond traditional Large Language Models (LLMs) with their exceptional logical reasoning capabilities, yet these improvements introduce heightened safety risks. When subjected to jailbreak attacks, their ability to generate more targeted and organized content can lead to greater harm. Although some studies claim that reasoning enables safer LRMs against existing LLM attacks, they overlook the inherent flaws within the reasoning process itself. To address this gap, we propose the first jailbreak attack targeting LRMs, exploiting their unique vulnerabilities stemming from the advanced reasoning capabilities. Specifically, we introduce a Chaos Machine, a novel component to transform attack prompts with diverse one-to-one mappings. The chaos mappings iteratively generated by the machine are embedded into the reasoning chain, which strengthens the variability and complexity and also promotes a more robust attack. Based on this, we construct the Mousetrap framework, which makes attacks projected into nonlinear-like low sample spaces with mismatched generalization enhanced. Also, due to the more competing objectives, LRMs gradually maintain the inertia of unpredictable iterative reasoning and fall into our trap. Success rates of the Mousetrap attacking o1-mini, Claude-Sonnet and Gemini-Thinking are as high as 96%, 86% and 98% respectively on our toxic dataset Trotter. On benchmarks such as AdvBench, StrongREJECT, and HarmBench, attacking Claude-Sonnet, well-known for its safety, Mousetrap can astonishingly achieve success rates of 87.5%, 86.58% and 93.13% respectively. Attention: This paper contains inappropriate, offensive and harmful content.
pdf
bib
abs
Tag-Evol: Achieving Efficient Instruction Evolving via Tag Injection
Yixuan Wang
|
Shiqi Zhou
|
Chuanzhe Guo
|
Qingfu Zhu
Evol-Instruct has made significant improvements as a data synthesis method in several areas. Existing methods typically rely on a fixed set of strategies to evolve, which require manual design and are monolithic in form. In addition, iterative evolution also makes the acquisition of hard samples expensive. In view of this, we propose the Tag-Evol framework, a more diverse and efficient instruction evolving method. Specifically, Tag-Evol uses diverse and specific knowledge tags as strategies to achieve controlled evolution by injecting different combinations of tags into the original instructions. Experiments with multiple backbones in mathematical and code domain benchmarks show that the proposed method generates significantly better evolved data than other methods. Furthermore, we conduct a thorough analysis of the evolved data, demonstrating that Tag-Evol is not only efficient but also generates more diverse and challenging data.
pdf
bib
abs
Breaking the Ceiling: Exploring the Potential of Jailbreak Attacks through Expanding Strategy Space
Yao Huang
|
Yitong Sun
|
Shouwei Ruan
|
Yichi Zhang
|
Yinpeng Dong
|
Xingxing Wei
Large Language Models (LLMs), despite advanced general capabilities, still suffer from numerous safety risks, especially jailbreak attacks that bypass safety protocols. Understanding these vulnerabilities through black-box jailbreak attacks, which better reflect real-world scenarios, offers critical insights into model robustness. While existing methods have shown improvements through various prompt engineering techniques, their success remains limited against safety-aligned models, overlooking a more fundamental problem: the effectiveness is inherently bounded by the predefined strategy spaces. However, expanding this space presents significant challenges in both systematically capturing essential attack patterns and efficiently navigating the increased complexity. To better explore the potential of expanding the strategy space, we address these challenges through a novel framework that decomposes jailbreak strategies into essential components based on the Elaboration Likelihood Model (ELM) theory and develops genetic-based optimization with intention evaluation mechanisms. To be striking, our experiments reveal unprecedented jailbreak capabilities by expanding the strategy space: we achieve over 90% success rate on Claude-3.5 where prior methods completely fail, while demonstrating strong cross-model transferability and surpassing specialized safeguard models in evaluation accuracy. The code is open-sourced at: https://github.com/Aries-iai/CL-GSO.
pdf
bib
abs
GeNRe: A French Gender-Neutral Rewriting System Using Collective Nouns
Enzo Doyen
|
Amalia Todirascu
A significant portion of the textual data used in the field of Natural Language Processing (NLP) exhibits gender biases, particularly due to the use of masculine generics (masculine words that are supposed to refer to mixed groups of men and women), which can perpetuate and amplify stereotypes. Gender rewriting, an NLP task that involves automatically detecting and replacing gendered forms with neutral or opposite forms (e.g., from masculine to feminine), can be employed to mitigate these biases. While such systems have been developed in a number of languages (English, Arabic, Portuguese, German, French), automatic use of gender neutralization techniques (as opposed to inclusive or gender-switching techniques) has only been studied for English. This paper presents GeNRe, the very first French gender-neutral rewriting system using collective nouns, which are gender-fixed in French. We introduce a rule-based system (RBS) tailored for the French language alongside two fine-tuned language models trained on data generated by our RBS. We also explore the use of instruct-based models to enhance the performance of our other systems and find that Claude 3 Opus combined with our dictionary achieves results close to our RBS. Through this contribution, we hope to promote the advancement of gender bias mitigation techniques in NLP for French.
pdf
bib
abs
LGAR: Zero-Shot LLM-Guided Neural Ranking for Abstract Screening in Systematic Literature Reviews
Christian Jaumann
|
Andreas Wiedholz
|
Annemarie Friedrich
The scientific literature is growing rapidly, making it hard to keep track of the state-of-the-art. Systematic literature reviews (SLRs) aim to identify and evaluate all relevant papers on a topic. After retrieving a set of candidate papers, the abstract screening phase determines initial relevance. To date, abstract screening methods using large language models (LLMs) focus on binary classification settings; existing question answering (QA) based ranking approaches suffer from error propagation. LLMs offer a unique opportunity to evaluate the SLR’s inclusion and exclusion criteria, yet, existing benchmarks do not provide them exhaustively. We manually extract these criteria as well as research questions for 57 SLRs, mostly in the medical domain, enabling principled comparisons between approaches. Moreover, we propose LGAR, a zero-shot LLM Guided Abstract Ranker composed of an LLM based graded relevance scorer and a dense re-ranker. Our extensive experiments show that LGAR outperforms existing QA-based methods by 5-10 pp. in mean average precision. Our code and data is publicly available.
pdf
bib
abs
LCHAIM - Investigating Long Context Reasoning in Hebrew
Ehud Malul
|
Oriel Perets
|
Ziv Mor
|
Yigal Kassel
|
Elior Sulem
Natural Language Inference (NLI) has gained significant attention recently due to its importance in understanding how machines comprehend and reason about language. While English has received tremendous interest, Morphologically Rich Languages (MRLs) like Hebrew, require more research. In this paper, we address the evaluation of Hebrew NLI models by introducing LCHAIM, a dataset designed to evaluate these models on tasks involving long premises and complex reasoning. The dataset, created by translating and validating the English ConTRoL dataset, consists of 8,325 context-hypothesis pairs that require coreferential, temporal, logical and analytical reasoning. Our experiments show the difficulty of contextual reasoning in Hebrew, as evidenced by the performance of different models. Fine-tuning the LongHero model on both the shorter premise Hebrew NLI and the LCHAIM datasets yielded a mean accuracy of 52%, that is 35% less than human performance. Similarly, Large language Models (LLMs) like Gemma-9B, Dicta-LM-2.0-7B, and GPT-4o achieved a top mean accuracy of 60.12% in few-shot setting.
pdf
bib
abs
CLeVeR: Multi-modal Contrastive Learning for Vulnerability Code Representation
Jiayuan Li
|
Lei Cui
|
Sen Zhao
|
Yun Yang
|
Lun Li
|
Hongsong Zhu
Automated vulnerability detection has become increasingly important. Many existing methods utilize deep learning models to obtain code representations for vulnerability detection. However, these approaches predominantly capture the overall semantics of the code rather than its intrinsic vulnerability-specific semantics. To address this issue, we propose CLeVeR, the first approach that leverages contrastive learning to generate precise vulnerability code representations under the supervision of vulnerability descriptions. Specifically, we introduce an Adapter, a Representation Refinement module, and a Description Simulator to mitigate the challenges of semantic misalignment and imbalance between code and descriptions, and input data inconsistency between pre-training and fine-tuning stages, respectively. For vulnerability detection and classification tasks, CLeVeR achieves F1 scores of 72.82% (real-world dataset) and 80.34%, outperforming state-of-the-art methods (SOTAs) by 11.85% and 13.61%. Additionally, CLeVeR also outperforms SOTAs in zero-shot inference, demonstrating the transferability of its generated vulnerability code representations.
pdf
bib
abs
MEMIT-Merge: Addressing MEMIT’s Key-Value Conflicts in Same-Subject Batch Editing for LLMs
Zilu Dong
|
Xiangqing Shen
|
Rui Xia
As large language models (LLMs) continue to scale up, knowledge editing techniques that modify models’ internal knowledge without full retraining have gained significant attention. MEMIT, a prominent batch editing algorithm, stands out for its capability to perform mass knowledge modifications. However, we uncovers a critical limitation that MEMIT’s editing efficacy significantly deteriorates when processing batches containing multiple edits sharing the same subject. Our analysis reveals the root cause lies in MEMIT’s key-value modeling framework: when multiple facts with the same subject in a batch are modeled through MEMIT’s key-value mechanism, identical keys (derived from the shared subject) are forced to represent different values (corresponding to distinct knowledge), resulting in update conflicts during editing. Addressing this issue, we propose MEMIT-Merge, an enhanced approach that merges value computation processes for facts sharing the same subject, effectively resolving the performance degradation in same-subject batch editing scenarios. Experimental results demonstrate that at a batch size of 5, while the original MEMIT’s success rate drops to 46%, MEMIT-Merge maintains a 98% editing success rate, showcasing remarkable robustness to subject entity collisions.
pdf
bib
abs
Large Language Models for Predictive Analysis: How Far Are They?
Qin Chen
|
Yuanyi Ren
|
Xiaojun Ma
|
Yuyang Shi
Predictive analysis is a cornerstone of modern decision-making, with applications in various domains. Large Language Models (LLMs) have emerged as powerful tools in enabling nuanced, knowledge-intensive conversations, thus aiding in complex decision-making tasks. With the burgeoning expectation to harness LLMs for predictive analysis, there is an urgent need to systematically assess their capability in this domain. However, there are no relevant evaluations in existing studies. To bridge this gap, we introduce the PredictiQ benchmark, which integrates 1130 sophisticated predictive analysis queries originating from 44 real-world datasets of 8 diverse fields. We design an evaluation protocol considering text analysis, code generation, and their alignment. Twelve renowned LLMs are evaluated, offering insights into their practical use in predictive analysis.
pdf
bib
abs
Think More, Hallucinate Less: Mitigating Hallucinations via Dual Process of Fast and Slow Thinking
Xiaoxue Cheng
|
Junyi Li
|
Xin Zhao
|
Ji-Rong Wen
Large language models (LLMs) demonstrate exceptional capabilities, yet still face the hallucination issue. Typical text generation approaches adopt an auto-regressive generation without deliberate reasoning, often leading to untrustworthy and factually inaccurate responses. In this paper, we propose HaluSearch, a novel framework that incorporates tree search-based algorithms (e.g., MCTS) to enable an explicit slow thinking generation process for mitigating hallucinations during inference. Specifically, HaluSearch frames text generation as a step-by-step reasoning process, using a self-evaluation reward model to score each generation step and guide the tree search towards the most reliable generation pathway. To balance efficiency and quality, we introduce a hierarchical system switch mechanism, which dynamically switches between fast and slow thinking modes at both instance and step levels. We conduct extensive experiments on both English and Chinese datasets, and the results show that our approach significantly outperforms baseline approaches.
pdf
bib
abs
Towards Adaptive Memory-Based Optimization for Enhanced Retrieval-Augmented Generation
Qitao Qin
|
Yucong Luo
|
Yihang Lu
|
Zhibo Chu
|
Xiaoman Liu
|
Xianwei Meng
Retrieval-Augmented Generation (RAG), by integrating non-parametric knowledge from external knowledge bases into models, has emerged as a promising approach to enhancing response accuracy while mitigating factual errors and hallucinations. This method has been widely applied in tasks such as Question Answering (QA). However, existing RAG methods struggle with open-domain QA tasks because they perform independent retrieval operations and directly incorporate the retrieved information into generation without maintaining a summarizing memory or using adaptive retrieval strategies, leading to noise from redundant information and insufficient information integration.To address these challenges, we propose Adaptive memory-based optimization for enhanced RAG (Amber) for open-domain QA tasks, which comprises an Agent-based Memory Updater, an Adaptive Information Collector, and a Multi-granular Content Filter, working together within an iterative memory updating paradigm. Specifically, Amber integrates and optimizes the language model’s memory through a multi-agent collaborative approach, ensuring comprehensive knowledge integration from previous retrieval steps. It dynamically adjusts retrieval queries and decides when to stop retrieval based on the accumulated knowledge, enhancing retrieval efficiency and effectiveness. Additionally, it reduces noise by filtering irrelevant content at multiple levels, retaining essential information to improve overall model performance. We conduct extensive experiments on several open-domain QA datasets, and the results demonstrate the superiority and effectiveness of our method and its components. The source code is available .
pdf
bib
abs
Enhancing Cross-Tokenizer Knowledge Distillation with Contextual Dynamical Mapping
Yijie Chen
|
Yijin Liu
|
Fandong Meng
|
Yufeng Chen
|
Jinan Xu
|
Jie Zhou
Knowledge Distillation (KD) has emerged as a prominent technique for model compression. However, conventional KD approaches primarily focus on homogeneous architectures with identical tokenizers, constraining their applicability in cross-architecture scenarios. As for the cross-tokenizer KD, the differences in the tokenizers give rise to two fundamental challenges: (1) sequence misalignment caused by divergent tokenization strategies, and (2) mismatched vocabulary size and composition. While existing probability-matching methods attempt to address these issues, their efficacy remains limited due to suboptimal alignment in both the sequence and vocabulary aspects. To overcome these limitations, we propose Contextual Dynamic Mapping (CDM), a novel cross-tokenizer distillation framework that employs contextual information to enhance sequence alignment precision and dynamically improves vocabulary mapping. We evaluated the effectiveness of our approach across five advanced and widely-used model families (i.e,LLama3, Phi3, Gemma2, OPT and Qwen2), which were configured into three distinct teacher-student pairs. Our method shows significant advantages over existing cross-tokenizer distillation baselines across diverse benchmarks, including instruction-following, code generation and math. Notably, our analysis reveals that combining conventional same-tokenizer distillation and cross-tokenizer distillation through CDM yields further performance improvements.
pdf
bib
abs
A Semantic-Aware Layer-Freezing Approach to Computation-Efficient Fine-Tuning of Language Models
Jian Gu
|
Aldeida Aleti
|
Chunyang Chen
|
Hongyu Zhang
Finetuning language models (LMs) is crucial for adapting the models to downstream data and tasks. However, full finetuning is usually costly. Existing work, such as parameter-efficient finetuning (PEFT), often focuses on how to finetune but neglects the issue of where to finetune. As a pioneering work on reducing the cost of backpropagation (at the layer level) by answering where to finetune, we conduct a semantic analysis of the LM inference process. We first propose using transition traces of the latent representation to compute deviations (or loss). Then, using a derived formula of scaling law, we estimate the gain of each layer in reducing deviation (or loss). Further, we narrow down the scope for finetuning, and also, study the cost-benefit balance of LM finetuning. We perform extensive experiments across well-known LMs and datasets. The results show that our approach is effective and efficient, and outperforms the existing baselines. Our approach is orthogonal to other techniques for improving finetuning efficiency, such as PEFT methods, offering practical values on LM finetuning.
pdf
bib
abs
CNNSum: Exploring Long-Context Summarization with Large Language Models in Chinese Novels
Lingxiao Wei
|
He Yan
|
Lu Xiangju
|
Junmin Zhu
|
Jun Wang
|
Wei Zhang
Large language models (LLMs) have been well-researched in various long-context tasks. However, the scarcity of long-context summarization datasets hinders progress in this area. To address this, we introduce CNNSum, a multi-scale long-context summarization benchmark based on Chinese novels, featuring human-driven annotations across four subsets totaling 695 samples, with lengths ranging from 16k to 128k. We benchmark numerous LLMs and conduct detailed human assessments to summarize abnormal output types. Furthermore, we extensively explore how to improve long-context summarization. In our study: (1) Advanced LLMs may generate much subjective commentary, leading to vague summaries. (2) Currently, long-context summarization mainly relies on memory ability. The advantages of Large LLMs are hard to utilize, thus small LLMs are more cost-effective. (3) Different prompt types paired with various version models may cause large performance gaps. In further fine-tuning, these can be mitigated, and the Base version models perform better. (4) LLMs with RoPE-base scaled exhibit strong extrapolation potential; using short-context data can significantly improve long-context summarization performance. However, further applying other interpolation methods requires careful selection. (5) CNNSum provides more reliable evaluation results than other benchmarks. We release CNNSum to advance future research.
pdf
bib
abs
Document Segmentation Matters for Retrieval-Augmented Generation
Zhitong Wang
|
Cheng Gao
|
Chaojun Xiao
|
Yufei Huang
|
Shuzheng Si
|
Kangyang Luo
|
Yuzhuo Bai
|
Wenhao Li
|
Tangjian Duan
|
Chuancheng Lv
|
Guoshan Lu
|
Gang Chen
|
Fanchao Qi
|
Maosong Sun
Retrieval-augmented generation (RAG) enhances large language models (LLMs) by integrating external knowledge. A critical yet underexplored challenge in RAG is document segmentation, also known as document chunking. Existing widely-used rule-based chunking methods usually lead to suboptimal splits, where overly large chunks introduce irrelevant information and small chunks lack semantic coherence. Existing semantic-based approaches either require costly LLM calls or fail to adaptively group contextually related sentences. To address these limitations, we propose PIC, Pseudo-Instruction for document Chunking), a simple yet effective method that leverages document summaries as pseudo-instructions to guide chunking. By computing semantic similarity between sentences and the summary, PIC dynamically groups sentences into chunks that align with the document’s key themes, ensuring semantic completeness and relevance to potential user instructions. Experiments on multiple open-domain question-answering benchmarks demonstrate that PIC can significantly improve retrieval accuracy (Hits@k) and end-to-end QA performance (Exact Match) without any additional training.
pdf
bib
abs
UBench: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions
Xunzhi Wang
|
Zhuowei Zhang
|
Gaonan Chen
|
Qiongyu Li
|
Bitong Luo
|
Zhixin Han
|
Haotian Wang
|
Zhiyu Li
|
Hang Gao
|
Mengting Hu
Despite recent progress in systematic evaluation frameworks, benchmarking the uncertainty of large language models (LLMs) remains a highly challenging task. Existing methods for benchmarking the uncertainty of LLMs face three key challenges: the need for internal model access, additional training, or high computational costs. This is particularly unfavorable for closed-source models. To this end, we introduce UBench, a new benchmark for evaluating the uncertainty of LLMs. Unlike other benchmarks, UBench is based on confidence intervals. It encompasses 11,978 multiple-choice questions spanning knowledge, language, understanding, and reasoning capabilities. Based on this, we conduct extensive experiments. This includes comparisons with other advanced uncertainty estimation methods, the assessment of the uncertainty of 20 LLMs, and an exploration of the effects of Chain-of-Thought (CoT) prompts, role-playing (RP) prompts, and temperature on model uncertainty. Our analysis reveals several crucial insights: 1) Our confidence interval-based methods are highly effective for uncertainty quantification; 2) Regarding uncertainty, outstanding open-source models show competitive performance versus closed-source models; 3) CoT and RP prompts present potential ways to improve model reliability, while the influence of temperature changes follows no universal rule. Our implementation is available at https://github.com/Cyno2232/UBENCH.
pdf
bib
abs
Embracing Large Language Models in Traffic Flow Forecasting
Yusheng Zhao
|
Xiao Luo
|
Haomin Wen
|
Zhiping Xiao
|
Wei Ju
|
Ming Zhang
Traffic flow forecasting aims to predict future traffic flows based on historical traffic conditions and the road network. It is an important problem in intelligent transportation systems, with a plethora of methods being proposed. Existing efforts mainly focus on capturing and utilizing spatio-temporal dependencies to predict future traffic flows. Though promising, they fall short in adapting to test-time environmental changes in traffic conditions. To tackle this challenge, we propose to introduce large language models (LLMs) to help traffic flow forecasting and design a novel method named Large Language Model Enhanced Traffic Flow Predictor (LEAF). LEAF adopts two branches, capturing different spatio-temporal relations using graph and hypergraph structures, respectively. The two branches are first pre-trained individually, and during test time, they yield different predictions. Based on these predictions, a large language model is used to select the most likely result. Then, a ranking loss is applied as the learning objective to enhance the prediction ability of the two branches. Extensive experiments on several datasets demonstrate the effectiveness of LEAF. Our code is available at https://github.com/YushengZhao/LEAF.
pdf
bib
abs
Flow2Code: Evaluating Large Language Models for Flowchart-based Code Generation Capability
Mengliang He
|
Jiayi Zeng
|
Yankai Jiang
|
Wei Zhang
|
Zeming Liu
|
Xiaoming Shi
|
Aimin Zhou
While large language models (LLMs) show promise in code generation, existing benchmarks neglect the flowchart-based code generation. To promote further research on flowchart-based code generation, this work presents Flow2Code, a novel benchmark for flowchart-based code generation evaluation. The evaluation dataset spans 15 programming languages and includes 5,622 code segments paired with 16,866 flowcharts of three types: code, UML, and pseudocode. Extensive experiments with 13 multimodal LLMs reveal that current LLMs can not generate code based on flowcharts perfectly. Besides, experiment results show that the supervised fine-tuning technique contributes greatly to the models’ performance. The dataset will be publicly available.
pdf
bib
abs
Smarter, Not Harder: Training-Free Adaptive Computation for Transformers
Romain Storaï
|
Jaeseong Lee
|
Seung-won Hwang
Adaptive Computation in Transformers (ACT) has been pursued in two directions: efficiency- and performance-focused. We study performance-focused ACT, or PACT, which invests more computation on hard steps to improve performance, such as by adding forward passes. We first discuss beam search and hesitation-based methods as PACT and their limitations. While the hesitation-based approach outperforms beam search by perturbing input embeddings, it suffers from inefficiency due to invalidating KVCache and exhibits instability due to its reliance on randomness. To address this, we propose IMPACT, a novel PACT method that perturbs network weights rather than input embeddings. This approach enables the reuse of KVCache, offers deterministic predictions, and significantly improves memory and computational efficiency. By achieving a better balance between performance and efficiency, IMPACT makes PACT accessible to communities with consumer-grade hardware.
pdf
bib
abs
UCS-SQL: Uniting Content and Structure for Enhanced Semantic Bridging In Text-to-SQL
Zhenhe Wu
|
Zhongqiu Li
|
JieZhangChinaTele JieZhangChinaTele
|
Zhongjiang He
|
Jian Yang
|
Yu Zhao
|
Ruiyu Fang
|
Bing Wang
|
Hongyan Xie
|
Shuangyong Song
|
Zhoujun Li
With the rapid advancement of large language models (LLMs), recent researchers have increasingly focused on the superior capabilities of LLMs in text/code understanding and generation to tackle text-to-SQL tasks. Traditional approaches adopt schema linking to first eliminate redundant tables and columns and prompt LLMs for SQL generation. However, they often struggle with accurately identifying corresponding tables and columns, due to discrepancies in naming conventions between natural language questions (NL) and database schemas. Besides, existing methods overlook the challenge of effectively transforming structure information from NL into SQL. To address these limitations, we introduce UCS-SQL, a novel text-to-SQL framework, uniting both content and structure pipes to bridge the gap between NL and SQL. Specifically, the content pipe focuses on identifying key content within the original content, while the structure pipe is dedicated to transforming the linguistic structure from NL to SQL. Additionally, we strategically selects few-shot examples by considering both the SQL Skeleton and Question Expression (SS-QE selection method), thus providing targeted examples for SQL generation. Experimental results on BIRD and Spider demonstrate the effectiveness of our UCS-SQL framework.
pdf
bib
abs
CodePRM: Execution Feedback-enhanced Process Reward Model for Code Generation
Qingyao Li
|
Xinyi Dai
|
Xiangyang Li
|
Weinan Zhang
|
Yasheng Wang
|
Ruiming Tang
|
Yong Yu
Code generation is a critical reasoning task for large language models (LLMs). Recent advancements have focused on optimizing the thought process of code generation, achieving significant improvements. However, such thought process lacks effective process supervision, making it hard to optimize the thoughts. Although Process Reward Models (PRMs) have been widely established in mathematical reasoning, building a code PRM is still not trivial for the gap between thoughts to code. In this paper, we propose CodePRM, a novel approach that leverages the code execution feedback to build a code PRM. Specifically, we first collect a large dataset of thought traces, where each thought step is labeled with their derived code’ pass rates, accompanied by the corresponding code snippets, and execution feedback. During training, we train a PRM to take both the reasoning process and code execution feedback as input to score individual thought steps, enabling it to leverage code execution results to distinguish between high-quality and low-quality thought steps. Finally, to use the PRM during inference, we develop a Generate-Verify-Refine (GVR) pipeline where the CodePRM serves as a process verifier to dynamically identify and correct errors in the thought process during code search. Experimental results demonstrate that CodePRM with the inference algorithm outperforms strong baselines, significantly enhancing code generation performance. Further analysis reveals the key factors for building a code PRM.
pdf
bib
abs
STEM-POM: Evaluating Language Models Math-Symbol Reasoning in Document Parsing
Jiaru Zou
|
Qing Wang
|
Pratyush Thakur
|
Nickvash Kani
Advances in large language models (LLMs) have spurred research into enhancing their reasoning capabilities, particularly in math-rich STEM (Science, Technology, Engineering, and Mathematics) documents.While LLMs can generate equations or solve math-related queries, their ability to fully understand and interpret abstract mathematical symbols in long, math-rich documents remains limited. In this paper, we introduce STEM-PoM, a comprehensive benchmark dataset designed to evaluate LLMs’ reasoning abilities on math symbols within contextual scientific text. The dataset, sourced from real-world ArXiv documents, contains over 2K math symbols classified as main attributes of variables, constants, operators, and unit descriptors, with additional sub-attributes including scalar/vector/matrix for variables and local/global/discipline-specific labels for both constants and operators. Our extensive experiments demonstrate that state-of-the-art LLMs achieve an average accuracy of 20-60% under in-context learning and 50-60% with fine-tuning, highlighting a substantial gap in their ability to classify mathematical symbols. By improving LLMs’ mathematical symbol classification, STEM-PoM further enhances models’ downstream mathematical reasoning capabilities. The code and data are available at https://github.com/jiaruzouu/STEM-PoM.
pdf
bib
abs
Retrieval Visual Contrastive Decoding to Mitigate Object Hallucinations in Large Vision-Language Models
Jihoon Lee
|
Min Song
Despite significant advancements in Large Vision-Language Models, Object Hallucination (OH) remains a persistent challenge. Building upon prior studies on contrastive decoding that address this issue without requiring additional model training, we introduce RVCD (Retrieval Visual Contrastive Decoding), an advanced method to suppress OH. RVCD leverages both negative and positive images at the logit level, explicitly referencing AI-generated images designed to represent a single concept. Our approach demonstrates substantial improvements over existing decoding-based methods.
pdf
bib
abs
Leveraging LLMs for Bangla Grammar Error Correction: Error Categorization, Synthetic Data, and Model Evaluation
Pramit Bhattacharyya
|
Arnab Bhattacharya
Large Language Models (LLMs) perform exceedingly well in Natural Language Understanding (NLU) tasks for many languages including English. However, despite being the fifth most-spoken language globally, Grammatical Error Correction (GEC) in Bangla remains underdeveloped. In this work, we investigate how LLMs can be leveraged for improving Bangla GEC. For that, we first do an extensive categorization of 12 error classes in Bangla, and take a survey of native Bangla speakers to collect real-world errors. We next devise a rule-based noise injection method to create grammatically incorrect sentences corresponding to correct ones. The Vaiyākaraṇa dataset, thus created, consists of 5,67,422 sentences of which 2,27,119 are erroneous. This dataset is then used to instruction-tune LLMs for the task of GEC in Bangla. Evaluations show that instruction-tuning with Vaiyākaraṇa improves GEC performance of LLMs by 3-7 percentage points as compared to the zero-shot setting, and makes them achieve human-like performance in grammatical error identification. Humans, though, remain superior in error correction. The data and code are available from https://github.com/Bangla-iitk/Vaiyakarana.
pdf
bib
abs
Think Both Ways: Teacher-Student Bidirectional Reasoning Enhances MCQ Generation and Distractor Quality
Yimiao Qiu
|
Yang Deng
|
Quanming Yao
|
Zhimeng Zhang
|
Zhiang Dong
|
Chang Yao
|
Jingyuan Chen
Generating high-quality Multiple Choice Questions (MCQs) remains challenging for educational tools due to the need for contextual relevance and plausible distractors. Existing methods still struggle with these dual requirements, leading to questions that lack depth and distractors that are either too obvious or irrelevant. In this paper, we propose BiFlow, a novel framework that integrates bidirectional reasoning perspectives: teacher reasoning generates contextually relevant questions and plausible distractors, while student reasoning evaluates question clarity and the misleading nature of the distractors. To further enhance reasoning, we introduce PathFinder, a mechanism that employs breadth-first search and Chain-of-Thought (CoT) strategies to explore diverse reasoning paths, improving both the quality and diversity of generated questions and distractors. Additionally, we enrich the FairytaleQA dataset to FairytaleMCQ with high-quality distractors, providing a robust benchmark for MCQ generation. Experimental results demonstrate that BiFlow outperforms existing methods, particularly in generating text-grounded questions and high-quality distractors for narrative contexts, highlighting its value in educational applications.
pdf
bib
abs
mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data
Haonan Chen
|
Liang Wang
|
Nan Yang
|
Yutao Zhu
|
Ziliang Zhao
|
Furu Wei
|
Zhicheng Dou
Multimodal embedding models have gained significant attention for their ability to map data from different modalities, such as text and images, into a unified representation space. However, the limited labeled multimodal data often hinders embedding performance. Recent approaches have leveraged data synthesis to address this problem, yet the quality of synthetic data remains a critical bottleneck. In this work, we identify three criteria for high-quality synthetic multimodal data. First, broad scope ensures that the generated data covers diverse tasks and modalities, making it applicable to various downstream scenarios. Second, robust cross-modal alignment makes different modalities semantically consistent. Third, high fidelity ensures that the synthetic data maintains realistic details to enhance its reliability. Guided by these principles, we synthesize datasets that: (1) cover a wide range of tasks, modality combinations, and languages, (2) are generated via a deep thinking process within a single pass of a multimodal large language model, and (3) incorporate real-world images with accurate and relevant texts, ensuring fidelity through self-evaluation and refinement. Leveraging these high-quality synthetic and labeled datasets, we train a multimodal multilingual E5 model mmE5. Extensive experiments demonstrate that mmE5 achieves state-of-the-art performance on the MMEB Benchmark and superior multilingual performance on the XTD benchmark. Our codes, datasets, and models are released in https://github.com/haon-chen/mmE5.
pdf
bib
abs
Word2Passage: Word-level Importance Re-weighting for Query Expansion
Jeonghwan Choi
|
Minjeong Ban
|
Minseok Kim
|
Hwanjun Song
Retrieval-augmented generation (RAG) enhances the quality of LLM generation by providing relevant chunks, but retrieving accurately from external knowledge remains challenging due to missing contextually important words in query. We present Word2Passage, a novel approach that improves retrieval accuracy by optimizing word importance in query expansion. Our method generates references at word, sentence, and passage levels for query expansion, then determines word importance by considering both their reference level origin and characteristics derived from query types and corpus analysis. Specifically, our method assigns distinct importance scores to words based on whether they originate from word, sentence, or passage-level references. Extensive experiments demonstrate that Word2Passage outperforms existing methods across various datasets and LLM configurations, effectively enhancing both retrieval accuracy and generation quality. The code is publicly available at https://github.com/DISL-Lab/Word2Passage
pdf
bib
abs
MECoT: Markov Emotional Chain-of-Thought for Personality-Consistent Role-Playing
Yangbo Wei
|
Zhen Huang
|
Fangzhou Zhao
|
Qi Feng
|
Wei W. Xing
Large Language Models (LLMs) have shown remarkable capabilities in role-playing dialogues, yet they often struggle to maintain emotionally consistent and psychologically plausible character personalities. We present MECoT (Markov Emotional Chain-of-Thought), a framework that enhances LLMs’ ability to generate authentic personality-driven dialogues through stochastic emotional transitions. Inspired by dual-process theory, MECoT combines a Markov-chain-driven emotional processor for intuitive responses with an LLM-based reasoning mechanism for rational regulation, mapped onto a 12-dimensional Emotion Circumplex Model. The framework dynamically adjusts emotional transitions using personality-weighted matrices and historical context, ensuring both emotional coherence and character consistency. We introduce the Role-playing And Personality Dialogue (RAPD) dataset, featuring diverse character interactions with fine-grained emotional annotations, along with novel metrics for evaluating emotional authenticity and personality alignment. Experimental results demonstrate MECoT’s effectiveness, achieving 93.3% emotional accuracy on RAPD and substantially outperforming existing approaches. Our analysis reveals optimal emotional granularity (12-16 categories) and validates our data-driven personality optimization approach. Code and data are available at
https://anonymous.4open.science/r/MECoTpdf
bib
abs
FiDeLiS: Faithful Reasoning in Large Language Models for Knowledge Graph Question Answering
Yuan Sui
|
Yufei He
|
Nian Liu
|
Xiaoxin He
|
Kun Wang
|
Bryan Hooi
Large Language Models (LLMs) are often challenged by generating erroneous or hallucinated responses, especially in complex reasoning tasks. Leveraging Knowledge Graphs (KGs) as external knowledge sources has emerged as a viable solution. However, existing KG-enhanced methods, either retrieval-based or agent-based, encounter difficulties in accurately retrieving knowledge and efficiently traversing KGs at scale. In this paper, we propose a unified framework, FiDeLiS, designed to improve the factuality of LLM responses by anchoring answers to verifiable reasoning steps retrieved from KGs. To achieve this, we leverage step-wise beam search with a deductive scoring function, allowing the LLM to validate reasoning process step by step, and halt the search once the question is deducible. In addition, we propose a Path-RAG module to pre-select a smaller candidate set for each beam search step, reducing computational costs by narrowing the search space. Extensive experiments show that our method, as a training-free framework, not only improve the performance but also enhance the factuality and interpretability across different benchmarks.
pdf
bib
abs
REALM: A Dataset of Real-World LLM Use Cases
Jingwen Cheng
|
Kshitish Ghate
|
Wenyue Hua
|
William Yang Wang
|
Hong Shen
|
Fei Fang
Large Language Models (LLMs), such as the GPT series, have driven significant industrial applications, leading to economic and societal transformations. However, a comprehensive understanding of their real-world applications remains limited.To address this, we introduce **REALM**, a dataset of over 94,000 LLM use cases collected from Reddit and news articles. **REALM** captures two key dimensions: the diverse applications of LLMs and the demographics of their users. It categorizes LLM applications and explores how users’ occupations relate to the types of applications they use.By integrating real-world data, **REALM** offers insights into LLM adoption across different domains, providing a foundation for future research on their evolving societal roles. An interactive dashboard ([https://realm-e7682.web.app/](https://realm-e7682.web.app/)) is provided for easy exploration of the dataset.
pdf
bib
abs
BABELEDITS: A Benchmark and a Modular Approach for Robust Cross-lingual Knowledge Editing of Large Language Models
Tommaso Green
|
Félix Gaschi
|
Fabian David Schmidt
|
Simone Paolo Ponzetto
|
Goran Glavaš
With Large Language Models (LLMs) becoming increasingly multilingual, effective knowledge editing (KE) needs to propagate edits across languages. Evaluation of the existing methods for cross-lingual knowledge editing (CKE) is limited both w.r.t. edit effectiveness: benchmarks do not account for entity aliases and use faulty entity translations; as well as robustness: existing work fails to report on downstream generation and task-solving abilities of LLMs after editing. In this work, we aim to (i) maximize the effectiveness of CKE while at the same time (ii) minimizing the extent of downstream model collapse due to the edits. To accurately measure the effectiveness of CKE methods, we introduce BabelEdits, a new CKE benchmark covering 60 languages that combines high-quality multilingual synsets from BabelNet with marker-based translation to ensure entity translation quality. Unlike existing CKE benchmarks, BabelEdits accounts for the rich variety of entity aliases within and across languages. We then propose BabelReFT, a modular CKE approach based on representation fine-tuning (ReFT) which learns entity-scope ReFT modules, applying them to all multilingual aliases at inference. Our experimental results show that not only is BabelReFT more effective in CKE than state-of-the-art methods, but, owing to its modular design, much more robust against downstream model collapse when subjected to many sequential edits.
pdf
bib
abs
CDS: Data Synthesis Method Guided by Cognitive Diagnosis Theory
Haokun Zhao
|
Jinyi Han
|
Jiaqing Liang
|
Yanghua Xiao
|
Xiaojun Meng
|
Jiansheng Wei
Large Language Models (LLMs) have achieved significant advancements, but the increasing complexity of tasks and higher performance demands highlight the need for continuous improvement. Some approaches utilize synthetic data generated by advanced LLMs based on evaluation results to train models. However, conventional evaluation methods fail to provide detailed, fine-grained profiles of LLMs, limiting their guidance for data synthesis. In this paper, we introduce the **Cognitive Diagnostic Synthesis** (CDS) method, which incorporates a diagnostic process inspired by **Cognitive Diagnosis Theory** (CDT) to refine evaluation results and characterize model profiles at the knowledge component level. Based on these diagnostics, we propose two diagnosis-synthesis strategies for weakness-targeted data synthesis. Additionally, we present an enhanced data augmentation and selection pipeline to improve the quality and diversity of synthesized data. Our experiments with several open-source models show significant improvements across multiple benchmarks, achieving up to 6.00% improvement in code generation, 13.10% in mathematical reasoning, and 5.43% in academic exams. Code and data are available on GitHub https://anonymous.4open.science/r/cds-04D1.
pdf
bib
abs
Problem-Solving Logic Guided Curriculum In-Context Learning for LLMs Complex Reasoning
Xuetao Ma
|
Wenbin Jiang
|
Hua Huang
In-context learning (ICL) can significantly enhance the complex reasoning capabilities of large language models (LLMs), with the key lying in the selection and ordering of demonstration examples. Previous methods typically relied on simple features to measure the relevance between examples. We argue that these features are not sufficient to reflect the intrinsic connections between examples. In this study, we propose a curriculum ICL strategy guided by problem-solving logic. We select demonstration examples by analyzing the problem-solving logic and order them based on curriculum learning. Specifically, we constructed a problem-solving logic instruction set based on the BREAK dataset and fine-tuned a language model to analyze the problem-solving logic of examples. Subsequently, we selected appropriate demonstration examples based on problem-solving logic and assessed their difficulty according to the number of problem-solving steps. In accordance with the principles of curriculum learning, we ordered the examples from easy to hard to serve as contextual prompts. Experimental results on multiple benchmarks indicate that our method outperforms previous ICL approaches in terms of performance and efficiency, effectively enhancing the complex reasoning capabilities of LLMs. Our project will be publicly available subsequently.
pdf
bib
abs
BESSTIE: A Benchmark for Sentiment and Sarcasm Classification for Varieties of English
Dipankar Srirag
|
Aditya Joshi
|
Jordan Painter
|
Diptesh Kanojia
Despite large language models (LLMs) being known to exhibit bias against non-mainstream varieties, there are no known labeled datasets for sentiment analysis of English. To address this gap, we introduce BESSTIE, a benchmark for sentiment and sarcasm classification for three varieties of English: Australian (en-AU), Indian (en-IN), and British (en-UK). Using web-based content from two domains, namely, Google Place reviews and Reddit comments, we collect datasets for these language varieties using two methods: location-based and topic-based filtering. Native speakers of the language varieties manually annotate the datasets with sentiment and sarcasm labels. To assess whether the dataset accurately represents these varieties, we conduct two validation steps: (a) manual annotation of language varieties and (b) automatic language variety prediction. We perform an additional annotation exercise to validate the reliance of the annotated labels. Subsequently, we fine-tune nine large language models (LLMs) (representing a range of encoder/decoder and mono/multilingual models) on these datasets, and evaluate their performance on the two tasks. Our results reveal that the models consistently perform better on inner-circle varieties (i.e., en-AU and en-UK), with significant performance drops for en-IN, particularly in sarcasm detection. We also report challenges in cross-variety generalisation, highlighting the need for language variety-specific datasets such as ours. BESSTIE promises to be a useful evaluative benchmark for future research in equitable LLMs, specifically in terms of language varieties. The BESSTIE datasets, code, and models will be publicly available upon acceptance.
pdf
bib
abs
NavRAG: Generating User Demand Instructions for Embodied Navigation through Retrieval-Augmented LLM
Zihan Wang
|
Yaohui Zhu
|
Gim Hee Lee
|
Yachun Fan
Vision-and-Language Navigation (VLN) is an essential skill for embodied agents, allowing them to navigate in 3D environments following natural language instructions. High-performance navigation models require a large amount of training data, the high cost of manually annotating data has seriously hindered this field. Therefore, some previous methods translate trajectory videos into step-by-step instructions for expanding data, but such instructions do not match well with users’ communication styles that briefly describe destinations or state specific needs. Moreover, local navigation trajectories overlook global context and high-level task planning. To address these issues, we propose NavRAG, a retrieval-augmented generation (RAG) framework that generates user demand instructions for VLN. NavRAG leverages LLM to build a hierarchical scene description tree for 3D scene understanding from global layout to local details, then simulates various user roles with specific demands to retrieve from the scene tree, generating diverse instructions with LLM. We annotate over 2 million navigation instructions across 861 scenes and evaluate the data quality and navigation performance of trained models. The model trained on our NavRAG dataset achieves SOTA performance on the REVERIE benchmark.
pdf
bib
abs
SQLForge: Synthesizing Reliable and Diverse Data to Enhance Text-to-SQL Reasoning in LLMs
Yu Guo
|
Dong Jin
|
Shenghao Ye
|
Shuangwu Chen
|
Jianyang Jianyang
|
Xiaobin Tan
Large Language models (LLMs) have demonstrated significant potential in text-to-SQL reasoning tasks, yet a substantial performance gap persists between existing open-source models and their closed-source counterparts. In this paper, we introduce SQLForge, a novel approach for synthesizing reliable and diverse data to enhance text-to-SQL reasoning in LLMs. We improve data reliability through SQL syntax constraints and SQL-to-question reverse translation, ensuring data logic at both structural and semantic levels. We also propose an SQL template enrichment and iterative data domain exploration mechanism to boost data diversity. Building on the augmented data, we fine-tune a variety of open-source models with different architectures and parameter sizes, resulting in a family of models termed SQLForge-LM. SQLForge-LM achieves the state-of-the-art performance on the widely recognized Spider and BIRD benchmarks among the open-source models. Specifically, SQLForge-LM achieves EX accuracy of 85.7% on Spider Dev and 59.8% on BIRD Dev, significantly narrowing the performance gap with closed-source methods.
pdf
bib
abs
Retrieval-Augmented Process Reward Model for Generalizable Mathematical Reasoning
Jiachen Zhu
|
Congmin Zheng
|
Jianghao Lin
|
Kounianhua Du
|
Ying Wen
|
Yong Yu
|
Jun Wang
|
Weinan Zhang
While large language models (LLMs) have significantly advanced mathematical reasoning, Process Reward Models (PRMs) have been developed to evaluate the logical validity of reasoning steps. However, PRMs still struggle with out-of-distribution (OOD) challenges. This paper identifies the OOD issues including step OOD, arising from differences in reasoning patterns across model types and sizes, and question OOD, due to dataset shifts between training and real-world problems. To address these issues, we introduce Retrieval-Augmented Process Reward Model (RetrievalPRM), a novel framework designed to tackle these OOD issues. By utilizing a two-stage retrieval-enhanced mechanism, RetrievalPRM retrieves semantically similar questions and steps for PRM as a warmup to stimulate its potential to judge target steps, improving generalization and reasoning consistency across different models and problem types. Our extensive experiments demonstrate that RetrievalPRM outperforms existing baselines across multiple real-world datasets. Our open-source contributions include a retrieval-enhanced dataset, a tuning framework for PRM training, and the RetreivalPRM model, establishing a new standard for PRM performance.
pdf
bib
abs
Contrastive Learning for Task-Independent SpeechLLM-Pretraining
Maike Züfle
|
Jan Niehues
Large language models (LLMs) excel in natural language processing but adapting these LLMs to speech processing tasks efficiently is not straightforward. Direct task-specific fine-tuning is limited by overfitting risks, data requirements, and computational costs. To address these challenges, we propose a scalable, two-stage training approach: (1) A task-independent speech pretraining stage using contrastive learning to align text and speech representations over all layers, followed by (2) a task-specific fine-tuning stage requiring minimal data. This approach outperforms traditional ASR pretraining and enables the model to surpass models specialized on speech translation and question answering while being trained on only 10% of the task-specific data.
pdf
bib
abs
QiMeng-Attention: SOTA Attention Operator is generated by SOTA Attention Algorithm
Qirui Zhou
|
Shaohui Peng
|
Weiqiang Xiong
|
Haixin Chen
|
Yuanbo Wen
|
Haochen Li
|
Ling Li
|
Qi Guo
|
Yongwei Zhao
|
Ke Gao
|
Ruizhi Chen
|
Yanjun Wu
|
Zhao Chen
|
Yunji Chen
The attention operator remains a critical performance bottleneck in large language models (LLMs), particularly for long-context scenarios. While FlashAttention is the most widely used and effective GPU-aware acceleration algorithm, it must require time-consuming and hardware-specific manual implementation, limiting adaptability across GPU architectures. Existing LLMs have shown a lot of promise in code generation tasks, but struggle to generate high-performance attention code. The key challenge is it cannot comprehend the complex data flow and computation process of the attention operator and utilize low-level primitive to exploit GPU performance.To address the above challenge, we propose an LLM-friendly Thinking Language (LLM-TL) to help LLMs decouple the generation of high-level optimization logic and low-level implementation on GPU, and enhance LLMs’ understanding of attention operator.Along with a 2-stage reasoning workflow, TL-Code generation and translation, the LLMs can automatically generate FlashAttention implementation on diverse GPUs, establishing a self-optimizing paradigm for generating high-performance attention operators in attention-centric algorithms.Verified on A100, RTX8000, and T4 GPUs, the performance of our methods significantly outshines that of vanilla LLMs, achieving a speed-up of up to 35.16×.Besides, our method not only surpasses human-optimized libraries (cuDNN and official library) in most scenarios but also extends support to unsupported hardware and data types, reducing development time from months to minutes compared with human experts.
pdf
bib
abs
ALW: Adaptive Layer-Wise contrastive decoding enhancing reasoning ability in Large Language Models
Yuechi Zhou
|
Chuyue Zhou
|
Jianxin Zhang
|
Juntao Li
|
Min Zhang
Large language models (LLMs) have achieved remarkable performance across various reasoning tasks. However, many LLMs still encounter challenges in reasoning, especially for LLMs with fewer parameters or insufficient pre-training data. Through our experiments, we identify that noise accumulation across layers often leads to unstable token predictions during reasoning. We find that contrasting the probability distributions across layers effectively mitigates this interference. Building on this insight, we propose Adaptive Layer-Wise contrastive decoding (ALW), a novel framework that enhances reasoning ability by dynamically disentangling noise in shallow layers from critical signals in deep layers. Extensive experiments on several reasoning benchmarks demonstrate that ALW consistently improves answer accuracy across multiple LLMs while maintaining inference efficiency. For example, we achieve a 48% improvement on the Gsm8k using the LLaMA-7B model and an absolute accuracy increase of 5.2 points on the BBH evaluation benchmark with the LLaMA-65B model.
pdf
bib
abs
Mixture of Decoding: An Attention-Inspired Adaptive Decoding Strategy to Mitigate Hallucinations in Large Vision-Language Models
Xinlong Chen
|
Yuanxing Zhang
|
Qiang Liu
|
Junfei Wu
|
Fuzheng Zhang
|
Tieniu Tan
Large Vision-Language Models (LVLMs) have exhibited impressive capabilities across various visual tasks, yet they remain hindered by the persistent challenge of hallucinations. To address this critical issue, we propose Mixture of Decoding (MoD), a novel approach for hallucination mitigation that dynamically adapts decoding strategies by evaluating the correctness of the model’s attention on image tokens. Specifically, MoD measures the consistency between outputs generated from the original image tokens and those derived from the model’s attended image tokens, to distinguish the correctness aforementioned. If the outputs are consistent, indicating correct attention, MoD employs a complementary strategy to amplify critical information. Conversely, if the outputs are inconsistent, suggesting erroneous attention, MoD utilizes a contrastive strategy to suppress misleading information. Extensive experiments demonstrate that MoD significantly outperforms existing decoding methods across multiple mainstream benchmarks, effectively mitigating hallucinations in LVLMs. Code is available at https://github.com/xlchen0205/MoD.
pdf
bib
abs
VidCapBench: A Comprehensive Benchmark of Video Captioning for Controllable Text-to-Video Generation
Xinlong Chen
|
Yuanxing Zhang
|
Chongling Rao
|
Yushuo Guan
|
Jiaheng Liu
|
Fuzheng Zhang
|
Chengru Song
|
Qiang Liu
|
Di Zhang
|
Tieniu Tan
The training of controllable text-to-video (T2V) models relies heavily on the alignment between videos and captions, yet little existing research connects video caption evaluation with T2V generation assessment. This paper introduces VidCapBench, a video caption evaluation scheme specifically designed for T2V generation, agnostic to any particular caption format. VidCapBench employs a data annotation pipeline, combining expert model labeling and human refinement, to associate each collected video with key information spanning video aesthetics, content, motion, and physical laws. VidCapBench then partitions these key information attributes into automatically assessable and manually assessable subsets, catering to both the rapid evaluation needs of agile development and the accuracy requirements of thorough validation. By evaluating numerous state-of-the-art captioning models, we demonstrate the superior stability and comprehensiveness of VidCapBench compared to existing video captioning evaluation approaches. Verification with off-the-shelf T2V models reveals a significant positive correlation between scores on VidCapBench and the T2V quality evaluation metrics, indicating that VidCapBench can provide valuable guidance for training T2V models. The project is available at https://github.com/VidCapBench/VidCapBench.
pdf
bib
abs
Mitigating Demonstration Bias through Global Coevolutionary Reasoning
Chuan Gou
|
Bangwei Li
|
Jianhua Dai
|
Xiaoyang Han
|
Ming Cai
Recent advances in large language models (LLMs) have demonstrated the effectiveness of chain-of-thought (CoT) prompting. Few-Shot-CoT relies on task-specific, manually labeled demonstrations, limiting its generalization to unseen tasks. While Zero-Shot-CoT eliminates this reliance, it often underperforms. To address this, existing methods aim to automatically generate demonstrations in zero-shot settings. However, these generated demonstrations face challenges due to demonstration bias: 1) selected demonstrations may contain errors, and 2) they may not be suitable or representative enough for all questions. To mitigate these biases, we propose Global Coevolutionary Reasoning (GCR). The method first applies Zero-Shot-CoT to answer all questions, then clusters the results. For each cluster, a random sample is selected, and these selected samples serve as demonstrations for each other. The model then iteratively re-answers the questions and updates their rationales based on these demonstrations, enabling coevolutionary reasoning to progressively improve the quality of the answers. This process of random sampling and coevolutionary reasoning is repeated until all questions have been re-answered. Experimental results on ten datasets using GPT-3.5-turbo and GPT-4o-mini show that GCR outperforms baseline methods without any performance degradation caused by demonstration bias. Additionally, GCR is orthogonal to existing methods and can be seamlessly integrated with them. The code is available at: https://github.com/GouChuan/GCR.
pdf
bib
abs
A Representation Level Analysis of NMT Model Robustness to Grammatical Errors
Abderrahmane Issam
|
Yusuf Can Semerci
|
Jan Scholtes
|
Gerasimos Spanakis
Understanding robustness is essential for building reliable NLP systems. Unfortunately, in the context of machine translation, previous work mainly focused on documenting robustness failures or improving robustness. In contrast, we study robustness from a model representation perspective by looking at internal model representations of ungrammatical inputs and how they evolve through model layers. For this purpose, we perform Grammatical Error Detection (GED) probing and representational similarity analysis. Our findings indicate that the encoder first detects the grammatical error, then corrects it by moving its representation toward the correct form. To understand what contributes to this process, we turn to the attention mechanism where we identify what we term *Robustness Heads*. We find that *Robustness Heads* attend to interpretable linguistic units when responding to grammatical errors, and that when we fine-tune models for robustness, they tend to rely more on *Robustness Heads* for updating the ungrammatical word representation.
pdf
bib
abs
T2DR: A Two-Tier Deficiency-Resistant Framework for Incomplete Multimodal Learning
Han Lin
|
Xiu Tang
|
Huan Li
|
Wenxue Cao
|
Sai Wu
|
Chang Yao
|
Lidan Shou
|
Gang Chen
Multimodal learning is garnering significant attention for its capacity to represent diverse human perceptions (e.g., linguistic, acoustic, and visual signals), achieving more natural and intuitive interactions with technology.However, the frequent occurrence of incomplete data, either within a single modality (intra-modality) or across different modalities (inter-modality), presents substantial challenges in reliable semantic interpretation and model reasoning.Furthermore, there is currently no robust representation learning mechanism capable of managing both intra-modality and inter-modality real-data deficiencies.To address this challenge, we present
T2DR, a two-tier deficiency-resistant framework for incomplete multimodal learning, which comprises two main modules:(1) Intra-Modal Deficiency-Resistant module (IADR): To address fine-grained deficiencies, we introduce Intra-Attn to focus on the available data while avoiding excessive suppression of the missing regions.(2) Inter-Modal Deficiency-Resistant module (IEDR): To handle coarse-grained deficiencies, we propose the shared feature prediction (SFP) to leverage cross-modal shared features for preliminary data imputation. Subsequently, we apply Inter-Attn to allocate appropriate attention to each modality based on the results from the capability-aware scorer (CAS).Extensive experiments are performed on two well-known multimodal benchmarks, CMU-MOSI and CMU-MOSEI, across various missing scenarios for sentiment analysis. Experimental results show that
T2DR significantly outperforms the SOTA models. Code is available at
https://github.com/LH019/T2DR.
pdf
bib
abs
From Specific-MLLMs to Omni-MLLMs: A Survey on MLLMs Aligned with Multi-modalities
Shixin Jiang
|
Jiafeng Liang
|
Jiyuan Wang
|
Xuan Dong
|
Heng Chang
|
Weijiang Yu
|
Jinhua Du
|
Ming Liu
|
Bing Qin
To tackle complex tasks in real-world scenarios, more researchers are focusing on Omni-MLLMs, which aim to achieve omni-modal understanding and generation. Beyond the constraints of any specific non-linguistic modality, Omni-MLLMs map various non-linguistic modalities into the embedding space of LLMs and enable the interaction and understanding of arbitrary combinations of modalities within a single model. In this paper, we systematically investigate relevant research and provide a comprehensive survey of Omni-MLLMs. Specifically, we first explain the four core components of Omni-MLLMs for unified multi-modal modeling with a meticulous taxonomy that offers novel perspectives. Then, we introduce the effective integration achieved through two-stage training and discuss the corresponding datasets as well as evaluation. Furthermore, we summarize the main challenges of current Omni-MLLMs and outline future directions. We hope this paper serves as an introduction for beginners and promotes the advancement of related research. Resources have been made publicly availableat https://github.com/threegold116/Awesome-Omni-MLLMs.
pdf
bib
abs
Analyzing the Effect of Linguistic Similarity on Cross-Lingual Transfer: Tasks and Experimental Setups Matter
Verena Blaschke
|
Masha Fedzechkina
|
Maartje Ter Hoeve
Cross-lingual transfer is a popular approach to increase the amount of training data for NLP tasks in a low-resource context. However, the best strategy to decide which cross-lingual data to include is unclear. Prior research often focuses on a small set of languages from a few language families and/or a single task. It is still an open question how these findings extend to a wider variety of languages and tasks. In this work, we analyze cross-lingual transfer for 263 languages from a wide variety of language families. Moreover, we include three popular NLP tasks: POS tagging, dependency parsing, and topic classification. Our findings indicate that the effect of linguistic similarity on transfer performance depends on a range of factors: the NLP task, the (mono- or multilingual) input representations, and the definition of linguistic similarity.
pdf
bib
abs
Agents generalize to novel levels of abstraction by using adaptive linguistic strategies
Kristina Kobrock
|
Xenia Ohmer
|
Elia Bruni
|
Nicole Gotzner
We study abstraction in an emergent communication paradigm. In emergent communication, two artificial neural network agents develop a language while solving a communicative task. In this study, the agents play a concept-level reference game. This means that the speaker agent has to describe a concept to a listener agent, who has to pick the correct target objects that satisfy the concept. Concepts consist of multiple objects and can be either more specific, i.e. the target objects share many attributes, or more generic, i.e. the target objects share fewer attributes. We tested two directions of zero-shot generalization to novel levels of abstraction: When generalizing from more generic to very specific concepts, agents utilized a compositional strategy. When generalizing from more specific to very generic concepts, agents utilized a more flexible linguistic strategy that involves reusing many messages from training. Our results provide evidence that neural network agents can learn robust concepts based on which they can generalize using adaptive linguistic strategies. We discuss how this research provides new hypotheses on abstraction and informs linguistic theories on efficient communication.
pdf
bib
abs
The Linguistic Connectivities Within Large Language Models
Dan Wang
|
Boxi Cao
|
Ning Bian
|
Xuanang Chen
|
Yaojie Lu
|
Hongyu Lin
|
Jia Zheng
|
Le Sun
|
Shanshan Jiang
|
Bin Dong
|
Xianpei Han
Large language models (LLMs) have demonstrated remarkable multilingual abilities in various applications. Unfortunately, recent studies have discovered that there exist notable disparities in their performance across different languages. Understanding the underlying mechanisms behind such disparities is crucial ensuring equitable access to LLMs for a global user base. Therefore, this paper conducts a systematic investigation into the behaviors of LLMs across 27 different languages on 3 different scenarios, and reveals a Linguistic Map correlates with the richness of available resources and linguistic family relations. Specifically, high-resource languages within specific language family exhibit greater knowledge consistency and mutual information dissemination, while isolated or low-resource languages tend to remain marginalized. Our research sheds light on a deep understanding of LLM’s cross-language behavior, highlights the inherent biases in LLMs within multilingual environments and underscores the need to address these inequities.
pdf
bib
abs
XFinBench: Benchmarking LLMs in Complex Financial Problem Solving and Reasoning
Zhihan Zhang
|
Yixin Cao
|
Lizi Liao
Solving financial problems demands complex reasoning, multimodal data processing, and a broad technical understanding, presenting unique challenges for current large language models (LLMs). We introduce **XFinBench**, a novel benchmark with 4,235 examples designed to evaluate LLM’s ability in solving comple**X**, knowledge-intensive **Fin**ancial problems across diverse graduate-level finance topics with multi-modal context. We identify five core capabilities of LLMs using XFinBench, i.e., _terminology understanding_, _temporal reasoning_, _future forecasting_, _scenario planning_, and _numerical modelling_. Upon XFinBench, we conduct extensive experiments on 18 leading models. The result shows that o1 is the best-performing text-only model with an overall accuracy of 67.3%, but still lags significantly behind human experts with 12.5%, especially in temporal reasoning and scenario planning capabilities. We further construct a knowledge bank with 3,032 finance terms for knowledge augmentation analysis, and find that relevant knowledge to the question only brings consistent accuracy improvements to small open-source model. Additionally, our error analysis reveals that rounding errors during calculation and blindness to position and intersection of curves in the image are two primary issues leading to model’s poor performance in calculating and visual-context questions, respectively.
pdf
bib
abs
Align2LLaVA: Cascaded Human and Large Language Model Preference Alignment for Multi-modal Instruction Curation
Hongzhe Huang
|
Jiang Liu
|
Zhewen Yu
|
Li Cai
|
Dian Jiao
|
Wenqiao Zhang
|
Siliang Tang
|
Juncheng Li
|
Hao Jiang
|
Haoyuan Li
|
Yueting Zhuang
Recent advances in Multi-modal Large Language Models (MLLMs), such as LLaVA-series models, are driven by massive machine-generated instruction-following data tuning. Such automatic instruction collection pipelines, however, inadvertently introduce significant variability in data quality. This paper introduces a novel instruction curation algorithm, derived from two unique perspectives, human and LLM preference alignment, to compress this vast corpus of machine-generated multimodal instructions to a compact and high-quality form: (i) For human preference alignment, we have collected a machine-generated multimodal instruction dataset and established a comprehensive set of both subjective and objective criteria to guide the data quality assessment critically from human experts. By doing so, a reward model was trained on the annotated dataset to internalize the nuanced human understanding of instruction alignment. (ii) For LLM preference alignment, given the instruction selected by the reward model, we propose leveraging the inner LLM used in MLLM to align the writing style of visual instructions with that of the inner LLM itself, resulting in LLM-aligned instruction improvement. Extensive experiments demonstrate that we can maintain or even improve model performance by compressing synthetic multimodal instructions by up to 90%. Impressively, by aggressively reducing the training instructions from 158k to 14k (9× smaller), our model consistently outperforms its full-size dataset counterpart across various MLLM benchmarks. Our project is available at https://github.com/DCDmllm/Align2LLaVA.
pdf
bib
abs
Achieving binary weight and activation for LLMs using Post-Training Quantization
Siqing Song
|
Chuang Wang
|
Rui-Qi Wang
|
Yi Yang
|
Xu-Yao Zhang
Quantizing large language models (LLMs) to 1-bit precision significantly reduces computational costs, but existing quantization techniques suffer from noticeable performance degradation when using weight and activation precisions below 4 bits (W4A4). In this paper, we propose a post-training quantization framework with W(1+1)A(1×4) configuration, where weights are quantized to 1 bit with an additional 1 bit for fine-grain grouping and activations are quantized to 1 bit with a 4-fold increase in the number of channels. For weight quantization, we propose utilizing Hessian-aware fine-grained grouping along with an EM-based quantization scheme. For activation quantization, we decompose INT4-quantized activations into a 4 × INT1 format equivalently and simultaneously smooth the scaling factors based on quantization errors, which further reduces the quantization errors in activations. Our method surpasses state-of-the-art (SOTA) LLM quantization baselines on W2A4 across multiple tasks, pushing the boundaries of existing LLM quantization methods toward fully binarized models. Code is available at https://github.com/JimmyCrave/LLM-PTQ-binarization.
pdf
bib
abs
Mitigating Negative Interference in Multilingual Knowledge Editing through Null-Space Constraints
Wei Sun
|
Tingyu Qu
|
Mingxiao Li
|
Jesse Davis
|
Marie-Francine Moens
Efficiently updating multilingual knowledge in large language models (LLMs) without disrupting coherent factual representations across languages remains a significant challenge. While deploying separate editing systems for each language might seem viable, this approach incurs substantial costs due to the need to manage multiple models. A more efficient solution involves integrating knowledge updates across all languages into a unified model. However, sequential edits across languages often lead to destructive parameter interference, significantly degrading multilingual generalization and the accuracy of injected knowledge. To address this issue, we propose LangEdit, a novel null-space constrained framework designed to precisely isolate language-specific knowledge updates. The core innovation of LangEdit lies in its ability to project parameter updates for each language onto the orthogonal complement of other languages’ subspaces. This approach mathematically guarantees update independence while preserving multilingual generalization capabilities. We conduct a comprehensive evaluation across three model architectures, six languages, and four downstream tasks, demonstrating that LangEdit effectively mitigates parameter interference and outperforms existing state-of-the-art editing methods. Our results highlight its potential for enabling efficient and accurate multilingual knowledge updates in LLMs.
pdf
bib
abs
From Awareness to Adaptability: Enhancing Tool Utilization for Scientific Reasoning
Wenjing Xie
|
Xiaobo Liang
|
Juntao Li
|
Wanfu Wang
|
Kehai Chen
|
Qiaoming Zhu
|
Min Zhang
As large language models (LLMs) are increasingly applied to complex scientific problem-solving, their effectiveness is often limited by unconscious or failed tool usage. To address this issue, we introduce the Tool-Awareness Training (TAT) method, designed to enhance scientific reasoning. This approach leverages both forward and backward data generation strategies to strengthen the model’s conscious and selective tool utilization in multi-step reasoning tasks. Our method unfolds in three stages: (1) developing tool-knowledge through backward tooluse data generation (2) enhancing tool-awareness in multi-step reasoning by utilizing forward reasoning data, and (3) improving domain adaptability through large-scale domain-specific data for multi-task learning. These three stages progressively establish the foundation for tool learning and scientific reasoning, effectively integrating both, enabling the model to tackle multi-domain scientific tasks while optimizing tool usage. Our experimental results demonstrate that TAT significantly enhances LLM performance in mathematical and scientific reasoning tasks, particularly by improving the model’s tool utilization capabilities, including proactivity and execution success rates.
pdf
bib
abs
AMoPO: Adaptive Multi-objective Preference Optimization without Reward Models and Reference Models
Qi Liu
|
Jingqing Ruan
|
Hao Li
|
Haodong Zhao
|
Desheng Wang
|
Jiansong Chen
|
Wan Guanglu
|
Xunliang Cai
|
Zhi Zheng
|
Tong Xu
Existing multi-objective preference alignment methods for large language models (LLMs) face limitations: (1) the inability to effectively balance various preference dimensions, and (2) reliance on auxiliary reward/reference models introduces computational complexity. To address these challenges, we propose Adaptive Multi-objective Preference Optimization (AMoPO), a novel framework that achieves dynamic balance across preference dimensions. By introducing the multi-objective optimization paradigm to use the dimension-aware generation metrics as implicit rewards, AMoPO aligns LLMs with diverse preferences without additional reward models or reference models. We introduce an adaptive weight assignment mechanism that models the generation space as a Gaussian distribution, allowing dynamic prioritization of preference dimensions. Empirical results demonstrate that AMoPO outperforms state-of-the-art baselines by 28.5%, and the experiments on 7B, 14B, and 32B models reveal the scaling ability of AMoPO. Moreover, additional analysis of multiple dimensions verifies its adaptability and effectiveness. These findings validate AMoPO’s capability to achieve dimension-aware preference alignment, highlighting its superiority. Our codes and datasets are available at https://github.com/Javkonline/AMoPO.
pdf
bib
abs
Supervised Optimism Correction: Be Confident When LLMs Are Sure
Junjie Zhang
|
Rushuai Yang
|
Shunyu Liu
|
Ting-En Lin
|
Fei Huang
|
Yi Chen
|
Yongbin Li
|
Dacheng Tao
In this work, we establish a novel theoretical connection between supervised fine-tuning and offline reinforcement learning under the token-level Markov decision process, revealing that large language models indeed learn an implicit Q-function for inference.Through this theoretical lens, we demonstrate that the widely used beam search method suffers from unacceptable over-optimism, where inference errors are inevitably amplified due to inflated Q-value estimations of suboptimal steps. To address this limitation, we propose **S**upervised **O**ptimism **C**orrection (SOC), which introduces a simple yet effective auxiliary loss for token-level Q-value estimations during supervised fine-tuning. Specifically, the auxiliary loss employs implicit value regularizationto boost model confidence in expert-demonstrated responses, thereby suppressing over-optimism toward insufficiently supervised responses.Extensive experiments on mathematical reasoning benchmarks, including GSM8K, MATH, and GAOKAO, showcase the superiority of the proposed SOC with beam search across a series of open-source models.
pdf
bib
abs
Offline Reinforcement Learning for LLM Multi-step Reasoning
Huaijie Wang
|
Shibo Hao
|
Hanze Dong
|
Shenao Zhang
|
Yilin Bao
|
Ziran Yang
|
Yi Wu
Improving the multi-step reasoning ability of large language models (LLMs) with offline reinforcement learning (RL) is essential for quickly adapting them to complex tasks. While Direct Preference Optimization (DPO) has shown promise in aligning LLMs with human preferences, it is less suitable for multi-step reasoning tasks because (1) DPO relies on paired preference data, which is not readily available for multi-step reasoning tasks, and (2) it treats all tokens uniformly, making it ineffective for credit assignment in multi-step reasoning tasks, which often come with sparse reward. In this work, we propose OREO (Offline REasoning Optimization), an offline RL method for enhancing LLM multi-step reasoning. Building on insights from previous works of maximum entropy reinforcement learning, it jointly learns a policy model and value function by optimizing the soft Bellman Equation. We show in principle that it reduces the need to collect pairwise data and enables better credit assignment. Empirically, OREO surpasses existing offline learning methods on multi-step reasoning benchmarks, including mathematical reasoning tasks (GSM8K, MATH), and embodied agent control (ALFWorld). The approach can be extended to a multi-iteration framework when additional resources are available. Furthermore, the learned value function can be leveraged to guide the tree search for free, which can further boost the performance during test time.
pdf
bib
abs
Sampling-based Pseudo-Likelihood for Membership Inference Attacks
Masahiro Kaneko
|
Youmi Ma
|
Yuki Wata
|
Naoaki Okazaki
Large Language Models (LLMs) are trained on large-scale web data, which makes it difficult to grasp the contribution of each text. This poses the risk of leaking inappropriate data such as benchmarks, personal information, and copyrighted texts in the training data. Membership Inference Attacks (MIA), which determine whether a given text is included in the model’s training data, have been attracting attention. Previous studies of MIAs revealed that likelihood-based classification is effective for detecting leaks in LLMs. However, the existing likelihood-based methods cannot be applied to some proprietary models like ChatGPT or Claude 3 because the likelihood for input text is unavailable to the user. In this study, we propose a Sampling-based Pseudo-Likelihood (SPL) method for MIA (SaMIA) that calculates SPL using only the text generated by an LLM to detect leaks. The SaMIA treats the target text as the reference text and multiple outputs from the LLM as text samples, calculates the degree of n-gram match as SPL, and determines the membership of the text in the training data. Even without likelihoods, SaMIA performed on par with existing likelihood-based methods.
pdf
bib
abs
AgentStore: Scalable Integration of Heterogeneous Agents As Specialized Generalist Computer Assistant
Chengyou Jia
|
Minnan Luo
|
Zhuohang Dang
|
Qiushi Sun
|
Fangzhi Xu
|
Junlin Hu
|
Tianbao Xie
|
Zhiyong Wu
Digital agents capable of automating complex computer tasks have attracted considerable attention. However, existing agent methods exhibit deficiencies in their generalization and specialization capabilities, especially in handling open-ended computer tasks in real-world environments. Inspired by the rich functionality of the App store, we present AgentStore, a scalable platform designed to dynamically integrate heterogeneous agents for automating computer tasks. AgentStore allows the system to continuously enrich its capabilities and adapt to rapidly evolving operating systems. Additionally, we propose a novel core MetaAgent with the AgentToken strategy to efficiently manage diverse agents and utilize their specialized and generalist abilities for both domain-specific and system-wide tasks. Extensive experiments on three interactive real-world benchmarks demonstrate that AgentStore significantly expands the capability boundaries of agent systems in both generalization and specialization, underscoring its potential for developing the specialized generalist computer assistant.
pdf
bib
abs
Boosting Vulnerability Detection of LLMs via Curriculum Preference Optimization with Synthetic Reasoning Data
Xin-Cheng Wen
|
Yijun Yang
|
Cuiyun Gao
|
Yang Xiao
|
Deheng Ye
Large language models (LLMs) demonstrate considerable proficiency in numerous coding-related tasks; however, their capabilities in detecting software vulnerabilities remain limited. This limitation primarily stems from two factors: (1) the absence of reasoning data related to vulnerabilities, which hinders the models’ ability to capture underlying vulnerability patterns; and (2) their focus on learning semantic representations rather than the reason behind them, thus failing to recognize semantically similar vulnerability samples. Furthermore, the development of LLMs specialized in vulnerability detection is challenging, particularly in environments characterized by the scarcity of high-quality datasets. In this paper, we propose a novel framework ReVD that excels at mining vulnerability patterns through reasoning data synthesizing and vulnerability-specific preference optimization. Specifically, we construct forward and backward reasoning processes for vulnerability and corresponding fixed code, ensuring the synthesis of high-quality reasoning data. Moreover, we design the triplet supervised fine-tuning followed by curriculum online preference optimization for enabling ReVD to better understand vulnerability patterns. The extensive experiments conducted on PrimeVul and SVEN datasets demonstrate that ReVD sets new state-of-the-art for LLM-based software vulnerability detection, e.g., 12.24%-22.77% improvement in the accuracy. The source code and data are available at https://github.com/Xin-Cheng-Wen/PO4Vul.
pdf
bib
abs
GA-S3: Comprehensive Social Network Simulation with Group Agents
Yunyao Zhang
|
Zikai Song
|
Hang Zhou
|
Wenfeng Ren
|
Yi-Ping Phoebe Chen
|
Junqing Yu
|
Wei Yang
Social network simulation is developed to provide a comprehensive understanding of social networks in the real world, which can be leveraged for a wide range of applications such as group behavior emergence, policy optimization, and business strategy development. However, billions of individuals and their evolving interactions involved in social networks pose challenges in accurately reflecting real-world complexities. In this study, we propose a comprehensive Social network Simulation System (GA-S3) that leverages newly designed Group Agents to make intelligent decisions regarding various online events. Unlike other intelligent agents that represent an individual entity, our group agents model a collection of individuals exhibiting similar behaviors, facilitating the simulation of large-scale network phenomena with complex interactions at a manageable computational cost. Additionally, we have constructed a social network benchmark from 2024 popular online events that contains fine-grained information on Internet traffic variations. The experiment demonstrates that our approach is capable of achieving accurate and highly realistic prediction results.
pdf
bib
abs
M-RangeDetector: Enhancing Generalization in Machine-Generated Text Detection through Multi-Range Attention Masks
Kaijie Jiao
|
Quan Wang
|
Licheng Zhang
|
Zikang Guo
|
Zhendong Mao
The increasing capability and widespread usage of large language models (LLMs) highlight the desirability of automatic detection of machine-generated text. Existing supervised detectors often overfit within their training domains, as they have primarily learned domain-specific textual features, such as word frequency, syntax, and semantics. In this paper, we introduce a domain-independent feature, namely the difference of writing strategy between LLMs and human, to improve the out-of-domain generalization capability of detectors. LLMs focus on the preceding range tokens when generating a token, while human consider multiple ranges, including bidirectional, global, and local contexts. The attention mask influences the range of tokens to which the model can attend. Therefore, we propose a method called M-RangeDetector, which integrates four distinct attention masking strategies into a Multi-Range Attention module, enabling the model to capture diverse writing strategies. Specifically, with the global mask, band mask, dilated mask, and random mask, our method learns various writing strategies for machine-generated text detection. The experimental results on three datasets demonstrate the superior generalization capability of our method.
pdf
bib
abs
Does Your Voice Assistant Remember? Analyzing Conversational Context Recall and Utilization in Voice Interaction Models
Heeseung Kim
|
Che Hyun Lee
|
Sangkwon Park
|
Jiheum Yeom
|
Nohil Park
|
Sangwon Yu
|
Sungroh Yoon
Recent advancements in multi-turn voice interaction models have improved user-model communication. However, while closed-source models effectively retain and recall past utterances, whether open-source models share this ability remains unexplored. To fill this gap, we systematically evaluate how well open-source interaction models utilize past utterances using ContextDialog, a benchmark we proposed for this purpose. Our findings show that speech-based models have more difficulty than text-based ones, especially when recalling information conveyed in speech, and even with retrieval-augmented generation, models still struggle with questions about past utterances. These insights highlight key limitations in open-source models and suggest ways to improve memory retention and retrieval robustness.
pdf
bib
abs
NeuronMerge: Merging Models via Functional Neuron Groups
Wangyun Gu
|
Qianghua Gao
|
Zhang Li-Xin
|
Xu Shen
|
Jieping Ye
Model merging techniques like task arithmetic, which combines model parameters through weighted averaging, have proven effective. However, the success of task arithmetic relies on the linearity between model weight differences and output feature changes, which is often lacking in conventional fine-tuned models. In this work, we employ neuron description methods to analyze and classify neurons based on their functionalities. We theoretically demonstrate that grouping Multi-Layer Perceptron (MLP) neurons by functionality enhances model linearity. Building on this, we propose a neuron-based task arithmetic merging method that consistently improves performance across various tasks and model scales. Our approach is complementary to existing merging techniques, achieving superior results in merging models fine-tuned on fundamental tasks like Math, Code and Translation.
pdf
bib
abs
HellaSwag-Pro: A Large-Scale Bilingual Benchmark for Evaluating the Robustness of LLMs in Commonsense Reasoning
Xiaoyuan Li
|
Moxin Li
|
Rui Men
|
Yichang Zhang
|
Keqin Bao
|
Wenjie Wang
|
Fuli Feng
|
Dayiheng Liu
|
Junyang Lin
Large language models (LLMs) have shown remarkable capabilities in commonsense reasoning; however, some variations in questions can trigger incorrect responses. Do these models truly understand commonsense knowledge, or just memorize expression patterns? To investigate this question, we present the first extensive robustness evaluation of LLMs in commonsense reasoning. We introduce HellaSwag-Pro, a large-scale bilingual benchmark consisting of 11,200 cases, by designing and compiling seven types of question variants. To construct this benchmark, we propose a two-stage method to develop Chinese HellaSwag, a finely annotated dataset comprising 12,000 instances across 56 categories. We conduct extensive experiments on 41 representative LLMs, revealing that these LLMs are far from robust in commonsense reasoning. Furthermore, this robustness varies depending on the language in which the LLM is tested. This work establishes a high-quality evaluation benchmark, with extensive experiments offering valuable insights to the community in commonsense reasoning for LLMs.
pdf
bib
abs
Self-Steering Optimization: Autonomous Preference Optimization for Large Language Models
Hao Xiang
|
Bowen Yu
|
Hongyu Lin
|
Keming Lu
|
Yaojie Lu
|
Xianpei Han
|
Ben He
|
Le Sun
|
Jingren Zhou
|
Junyang Lin
The key to effective alignment lies in high-quality preference data. Recent research has focused on automated alignment, which involves developing alignment systems with minimal human intervention. However, prior research has predominantly focused on developing data generation methods, while insufficient attention has been paid to quality control mechanisms and often produces inaccurate and unhelpful data, leading to unpredictable benefits during iterative optimization. In this paper, we present Self-Steering Optimization (SSO), an algorithm that autonomously generates high-quality preference data, eliminating manual annotation requirements. SSO employs a specialized optimization objective to build a data generator from the policy model itself, which is used to produce accurate and on-policy data. We demonstrate SSO‘s effectiveness through comprehensive experiments on two series of models: Llama 3 and Qwen 2. Our evaluation across diverse benchmarks shows that SSO consistently outperforms baselines in human preference alignment and reward optimization. Further analysis validates SSO as a scalable framework for preference optimization, benefiting the advancement in automated alignment techniques.
pdf
bib
abs
LIME: Less Is More for MLLM Evaluation
King Zhu
|
Qianbo Zang
|
Shian Jia
|
Siwei Wu
|
Feiteng Fang
|
Yizhi Li
|
Shuyue Guo
|
Tianyu Zheng
|
Jiawei Guo
|
Bo Li
|
Haoning Wu
|
Xingwei Qu
|
Jian Yang
|
Ruibo Liu
|
Xiang Yue
|
Jiaheng Liu
|
Chenghua Lin
|
Hamid Alinejad-Rokny
|
Min Yang
|
Shiwen Ni
|
Wenhao Huang
|
Ge Zhang
Multimodal Large Language Models (MLLMs) are measured on numerous benchmarks like image captioning, visual question answer, and reasoning. However, these benchmarks often include overly simple or uninformative samples, making it difficult to effectively distinguish the performance of different MLLMs. Additionally, evaluating models across many benchmarks creates a significant computational burden. To address these issues, we propose LIME (Less Is More for MLLM Evaluation), a refined and efficient benchmark curated using a semi-automated pipeline. This pipeline filters out uninformative samples and eliminates answer leakage by focusing on tasks that require image-based understanding. Our experiments show that LIME reduces the number of samples by 76% and evaluation time by 77%, while it can more effectively distinguish different models’ abilities. Notably, we find that traditional automatic metrics like CIDEr are insufficient for evaluating MLLMs’ captioning performance, and excluding the caption task score yields a more accurate reflection of overall model performance. All code and data are available at https://anonymous.4open.science/r/LIME-49CD
pdf
bib
abs
Debate, Reflect, and Distill: Multi-Agent Feedback with Tree-Structured Preference Optimization for Efficient Language Model Enhancement
Xiaofeng Zhou
|
Heyan Huang
|
Lizi Liao
Large Language Models (LLMs) continue to set new standards in knowledge-intensive and complex reasoning tasks, yet their high computational demands limit widespread adoption. While distilling large models into smaller ones offers a sustainable solution, current techniques—such as static knowledge distillation, resource-intensive reinforcement learning from human feedback, or limited self-reflection—struggle to yield substantial and lasting performance gains. In this paper, we present a novel Debate and Reflect (D&R) framework that orchestrates multi-turn debates between smaller models and stronger teacher models, eliciting actionable feedback (e.g., error analysis, corrective strategies) to guide student models. Further, we introduce Tree-structured Direct Preference Optimization (T-DPO) to efficiently leverage these debate logs, organizing interactions into a hierarchical format for effective training. Empirical evaluations across diverse NLP benchmarks demonstrate that our approach significantly improves smaller-model accuracy, robustness, and generalization, outperforming conventional baselines by a large margin.
pdf
bib
abs
CodeReviewQA: The Code Review Comprehension Assessment for Large Language Models
Hong Yi Lin
|
Chunhua Liu
|
Haoyu Gao
|
Patanamon Thongtanunam
|
Christoph Treude
State-of-the-art large language models (LLMs) have demonstrated impressive code generation capabilities but struggle with real-world software engineering tasks, such as revising source code to address code reviews, hindering their practical use. Code review comments are often implicit, ambiguous, and colloquial, requiring models to grasp both code and human intent. This challenge calls for evaluating large language models’ ability to bridge both technical and conversational contexts. While existing work has employed the automated code refinement (ACR) task to resolve these comments, current evaluation methods fall short, relying on text matching metrics that provide limited insight into model failures and remain susceptible to training data contamination.To address these limitations, we introduce a novel evaluation benchmark, CodeReviewQA that enables us to conduct fine-grained assessment of model capabilities and mitigate data contamination risks.In CodeReviewQA, we decompose the generation task of code refinement into three essential reasoning steps: change type recognition (CTR), change localisation (CL), and solution identification (SI). Each step is reformulated as multiple-choice questions with varied difficulty levels, enabling precise assessment of model capabilities, while mitigating data contamination risks. Our comprehensive evaluation spans 72 recently released large language models on 900 manually curated, high-quality examples across nine programming languages. Our results show that CodeReviewQA is able to expose specific model weaknesses in code review comprehension, disentangled from their generative automated code refinement results.
pdf
bib
abs
Narrative Media Framing in Political Discourse
Yulia Otmakhova
|
Lea Frermann
Narrative frames are a powerful way of conceptualizing and communicating complex, controversial ideas, however automated frame analysis to date has mostly overlooked this framing device. In this paper, we connect elements of narrativity with fundamental aspects of framing, and present a framework which formalizes and operationalizes such aspects. We annotate and release a data set of news articles in the climate change domain, analyze the dominance of narrative frame components across political leanings, and test LLMs in their ability to predict narrative frames and their components. Finally, we apply our framework in an unsupervised way to elicit components of narrative framing in a second domain, the COVID-19 crisis, where our predictions are congruent with prior theoretical work showing the generalizability of our approach.
pdf
bib
abs
MHALO: Evaluating MLLMs as Fine-grained Hallucination Detectors
Yishuo Cai
|
Renjie Gu
|
Jiaxu Li
|
Xuancheng Huang
|
Junzhe Chen
|
Xiaotao Gu
|
Minlie Huang
Hallucination remains a critical challenge for multimodal large language models (MLLMs), undermining their reliability in real-world applications. While fine-grained hallucination detection (FHD) holds promise for enhancing high-quality vision-language data construction and model alignment through enriched feedback signals, automated solutions for this task have yet to be systematically explored. Inspired by the concept of “MLLM as a Judge”, we introduce MHALO, the first comprehensive benchmark specifically designed for evaluating MLLMs’ capability in performing token-level FHD. Our benchmark encompasses 12 distinct hallucination types spanning both multimodal perception and reasoning domains. Through extensive evaluations of 9 selected MLLMs, we reveal substantial performance limitations, with the leading model achieving an average F1IoU of only 40.59%. To address this limitation, we develop HaloDet-4B, a specialized model trained on our curated training data, which significantly outperforms existing models. We hope the benchmark can provide valuable insights for future research on hallucination mitigation in MLLMs. The code and dataset will be publicly available.
pdf
bib
abs
Semantic Topology: a New Perspective for Communication Style Characterization
Barbara Scalvini
|
Alireza Mashaghi
We introduce semantic topology, a novel framework for discourse analysis that leverages Circuit Topology to quantify the semantic arrangement of sentences in a text. By mapping recurring themes as series, parallel, or cross relationships, we identify statistical differences in communication patterns in long-form true and fake news. Our analysis of large-scale news datasets reveals that true news are more likely to exhibit more complex topological structures, with greater thematic interleaving and long-range coherence, whereas fake news favor simpler, more linear narratives. These findings suggest that topological features capture stylistic distinctions beyond traditional linguistic cues, offering new insights for discourse modeling.
pdf
bib
abs
Decoding LLM Personality Measurement: Forced-Choice vs. Likert
Xiaoyu Li
|
Haoran Shi
|
Zengyi Yu
|
Yukun Tu
|
Chanjin Zheng
Recent research has focused on investigating the psychological characteristics of Large Language Models (LLMs), emphasizing the importance of comprehending their behavioral traits. Likert scale personality questionnaires have become the primary tool for assessing these characteristics in LLMs. However, such scales can be skewed by factors such as social desirability, distorting the assessment of true personality traits. To address this issue, we firstly incorporate the forced-choice test, a method known for reducing response bias in human personality assessments, into the evaluation of LLM. Specifically, we evaluated six LLMs: Llama-3.1-8B, GLM-4-9B, GPT-3.5-turbo, GPT-4o, Claude-3.5-sonnet, and Deepseek-V3. We compared the Likert scale and forced-choice test results for LLMs’ Big Five personality scores, as well as their reliability. In addition, we looked at how temperature parameter and language affected LLM personality scores. The results show that the forced-choice test better captures differences between LLMs across various personality dimensions and is less influenced by temperature parameters. Furthermore, we found both broad trends and specific variations in personality scores across models and languages.
pdf
bib
abs
MultiMSD: A Corpus for Multilingual Medical Text Simplification from Online Medical References
Koki Horiguchi
|
Tomoyuki Kajiwara
|
Takashi Ninomiya
|
Shoko Wakamiya
|
Eiji Aramaki
We release a parallel corpus for medical text simplification, which paraphrases medical terms into expressions easily understood by patients. Medical texts written by medical practitioners contain a lot of technical terms, and patients who are non-experts are often unable to use the information effectively. Therefore, there is a strong social demand for medical text simplification that paraphrases input sentences without using medical terms. However, this task has not been sufficiently studied in non-English languages. We therefore developed parallel corpora for medical text simplification in nine languages: German, English, Spanish, French, Italian, Japanese, Portuguese, Russian, and Chinese, each with 10,000 sentence pairs, by automatic sentence alignment to online medical references for professionals and consumers. We also propose a method for training text simplification models to actively paraphrase complex expressions, including medical terms. Experimental results show that the proposed method improves the performance of medical text simplification. In addition, we confirmed that training with a multilingual dataset is more effective than training with a monolingual dataset.
pdf
bib
abs
BadWindtunnel: Defending Backdoor in High-noise Simulated Training with Confidence Variance
Ruyi Zhang
|
Songlei Jian
|
Yusong Tan
|
Heng Gao
|
Haifang Zhou
|
Kai Lu
Current backdoor attack defenders in Natural Language Processing (NLP) typically involve data reduction or model pruning, risking losing crucial information. To address this challenge, we introduce a novel backdoor defender, i.e., BadWindtunnel, in which we build a high-noise simulated training environment, similar to the wind tunnel, which allows precise control over training conditions to model the backdoor learning behavior without affecting the final model. We also use the confidence variance as a learning behavior quantification metric in the simulated training, which is based on the characteristics of backdoor-poisoned data (shorted in poisoned data): higher learnability and robustness. In addition, we propose a two-step strategy to further model poisoned data, including target label identification and poisoned data revealing. Extensive experiments demonstrate BadWindtunnel’s superiority, with a 21% higher average reduction in attack success rate than the second-best defender.
pdf
bib
abs
Multimodal Machine Translation with Text-Image In-depth Questioning
Yue Gao
|
Jing Zhao
|
Shiliang Sun
|
Xiaosong Qiao
|
Tengfei Song
|
Hao Yang
Multimodal machine translation (MMT) integrates visual information to address ambiguity and contextual limitations in neural machine translation (NMT). Some empirical studies have revealed that many MMT models underutilize visual data during translation. They attempt to enhance cross-modal interactions to enable better exploitation of visual data. However, they only focus on simple interactions between nouns in text and corresponding entities in image, overlooking global semantic alignment, particularly for prepositional phrases and verbs in text which are more likely to be translated incorrectly. To address this, we design a Text-Image In-depth Questioning method to deepen interactions and optimize translations. Furthermore, to mitigate errors arising from contextually irrelevant image noise, we propose a Consistency Constraint strategy to improve our approach’s robustness. Our approach achieves state-of-the-art results on five translation directions of Multi30K and AmbigCaps, with +2.35 BLEU on the challenging MSCOCO benchmark, validating our method’s effectiveness in utilizing visual data and capturing comprehensive textual semantics.
pdf
bib
abs
ReKG-MCTS: Reinforcing LLM Reasoning on Knowledge Graphs via Training-Free Monte Carlo Tree Search
Xiaozhuang Song
|
Shufei Zhang
|
Tianshu Yu
Recent advancements in combining knowledge graphs (KGs) with large language models (LLMs) have demonstrated promising potential in complex KG reasoning tasks, yet existing approaches face limitations in path exploration strategies or excessive computational overhead. We propose ReKG-MCTS, a novel training-free framework that synergizes Monte Carlo Tree Search (MCTS) with LLM capabilities to enable dynamic reasoning over KGs. The framework conceptualizes KG reasoning as a decision-making process, where MCTS strategically explores paths over KG while LLMs provide semantic guidance for reasoning paths. The framework consists of four phases: (1) UCB-based node selection that balances exploration-exploitation on KG, (2) path expansion with KG structural constraints, (3) LLM-guided MC rollouts for simulation, and (4) value backpropagation. Experimental results on WebQSP and CWQ demonstrate that ReKG-MCTS outperforms existing training-free methods and achieves competitive performance compared to fine-tuned baselines. These findings suggest a new paradigm for leveraging language models in KG reasoning tasks. The code is available at https://github.com/ShawnKS/rekgmcts.
pdf
bib
abs
HTML: Hierarchical Topology Multi-task Learning for Semantic Parsing in Knowledge Base Question Answering
Aziguli Wulamu
|
Lyu Zhengyu
|
Kaiyuan Gong
|
Yu Han
|
Zewen Wang
|
Zhihong Zhu
|
Bowen Xing
Knowledge base question answering (KBQA) aims to answer natural language questions by reasoning over structured knowledge bases. Existing approaches often struggle with the complexity of mapping questions to precise logical forms, particularly when dealing with diverse entities and relations. In this paper, we propose Hierarchical Topology Multi-task Learning (HTML), a novel framework that leverages a hierarchical multi-task learning paradigm to enhance the performance of logical form generation. Our framework consists of a main task: generating logical forms from questions, and three auxiliary tasks: entity prediction from the input question, relation prediction for the given entities, and logical form generation based on the given entities and relations. Through joint instruction-tuning, HTML allows mutual guidance and knowledge transfer among the hierarchical tasks, capturing the subtle dependencies between entities, relations, and logical forms. Extensive experiments on public benchmarks show that HTML markedly outperforms both supervised fine-tuning methods and training-free ones based on powerful large language models (e.g., GPT-4), demonstrating its superiority in question understanding and structural knowledge reasoning.
pdf
bib
abs
StructFlowBench: A Structured Flow Benchmark for Multi-turn Instruction Following
Jinnan Li
|
Jinzhe Li
|
Yue Wang
|
Yi Chang
|
Yuan Wu
Multi-turn instruction following capability constitutes a core competency of large language models (LLMs) in real-world applications. Existing evaluation benchmarks predominantly focus on fine-grained constraint satisfaction and domain-specific capability assessment, yet overlook the crucial structural dependencies between dialogue turns that distinguish multi-turn from single-turn interactions. These structural dependencies not only reflect user intent but also establish an essential second dimension for the instruction following evaluation beyond constraint satisfaction. To address this gap, we propose StructFlowBench, a multi-turn instruction following benchmark with structural flow modeling. The benchmark defines an innovative structural flow framework with six fundamental inter-turn relationships. These relationships introduce novel structural constraints for model evaluation and also serve as generation parameters for creating customized dialogue flows tailored to specific scenarios. Adopting established LLM-based automatic evaluation methodologies, we conduct systematic evaluations of 13 leading open-source and closed-source LLMs. Experimental results reveal significant deficiencies in current models’ comprehension of multi-turn dialogue structures. The code is available at https://github.com/MLGroupJLU/StructFlowBench.
pdf
bib
abs
CMIE: Combining MLLM Insights with External Evidence for Explainable Out-of-Context Misinformation Detection
Fanxiao Li
|
Jiaying Wu
|
Canyuan He
|
Wei Zhou
Multimodal large language models (MLLMs) have demonstrated impressive capabilities in visual reasoning and text generation. While previous studies have explored the application of MLLM for detecting out-of-context (OOC) misinformation, our empirical analysis reveals two persisting challenges of this paradigm. Evaluating the representative GPT-4o model on direct reasoning and evidence augmented reasoning, results indicate that MLLM struggle to capture the deeper relationships—specifically, cases in which the image and text are not directly connected but are associated through underlying semantic links. Moreover, noise in the evidence further impairs detection accuracy.To address these challenges, we propose CMIE, a novel OOC misinformation detection framework that incorporates a Coexistence Relationship Generation (CRG) strategy and an Association Scoring (AS) mechanism. CMIE identifies the underlying coexistence relationships between images and text, and selectively utilizes relevant evidence to enhance misinformation detection. Experimental results demonstrate that our approach outperforms existing methods.
pdf
bib
abs
EtiCor++: Towards Understanding Etiquettical Bias in LLMs
Ashutosh Dwivedi
|
Siddhant Shivdutt Singh
|
Ashutosh Modi
In recent years, researchers have started analyzing the cultural sensitivity of LLMs. In this respect, Etiquettes have been an active area of research. Etiquettes are region-specific and are an essential part of the culture of a region; hence, it is imperative to make LLMs sensitive to etiquettes. However, there needs to be more resources in evaluating LLMs for their understanding and bias with regard to etiquettes. In this resource paper, we introduce EtiCor++, a corpus of etiquettes worldwide. We introduce different tasks for evaluating LLMs for knowledge about etiquettes across various regions. Further, we introduce various metrics for measuring bias in LLMs. Extensive experimentation with LLMs shows inherent bias towards certain regions.
pdf
bib
abs
FinRipple: Aligning Large Language Models with Financial Market for Event Ripple Effect Awareness
Yuanjian Xu
|
Jianing Hao
|
Kunsheng Tang
|
Jingnan Chen
|
Anxian Liu
|
Peng Liu
|
Guang Zhang
Financial markets exhibit complex dynamics where localized events trigger ripple effects across entities. Previous event studies, constrained by static single-companies analyses and simplistic assumptions, fail to capture these ripple effects. While large language models (LLMs) offer emergent reasoning capabilities, their direct application falters due to structural market unawareness and limited capacity to analyze ripple effects. We propose FinRipple, an elegant framework that empowers LLMs with the ability to analyze ripple effects through financial theory-guided large-scale reinforcement learning. We begin by relaxing the assumptions of previous methods, incorporating a time-varying knowledge graph to accurately represent market structure. By seamlessly integrating classical asset pricing theory, we align the LLM with the market, enabling it to predict ripple effects. To the best of our knowledge, we are the first to provide a standardized definition of ripple effect prediction, a task that is extremely important yet unexplored in the financial domain. Extensive experiments demonstrate that FinRipple provides a promising solution to this task.
pdf
bib
abs
Beyond Decoder-only: Large Language Models Can be Good Encoders for Machine Translation
Yingfeng Luo
|
Tong Zheng
|
Yongyu Mu
|
Bei Li
|
Qinghong Zhang
|
Yongqi Gao
|
Ziqiang Xu
|
Peinan Feng
|
Xiaoqian Liu
|
Tong Xiao
|
JingBo Zhu
The field of neural machine translation (NMT) has changed with the advent of large language models (LLMs). Much of the recent emphasis in natural language processing (NLP) has been on modeling machine translation and many other problems using a single pre-trained Transformer decoder, while encoder-decoder architectures, which were the standard in earlier NMT models, have received relatively less attention. In this paper, we explore translation models that are universal, efficient, and easy to optimize, by marrying the world of LLMs with the world of NMT. We apply LLMs to NMT encoding and leave the NMT decoder unchanged. We also develop methods for adapting LLMs to work better with the NMT decoder. Furthermore, we construct a new dataset involving multiple tasks to assess how well the machine translation system generalizes across various tasks. Evaluations on the WMT and our datasets show that results using our method match or surpass a range of baselines in terms of translation quality, but achieve 2.4 ∼ 6.5 × inference speedups and a 75% reduction in the memory footprint of the KV cache. It also demonstrates strong generalization across a variety of translation-related tasks.
pdf
bib
abs
EC-RAFT: Automated Generation of Clinical Trial Eligibility Criteria through Retrieval-Augmented Fine-Tuning
Nopporn Lekuthai
|
Nattawit Pewngam
|
Supitcha Sokrai
|
Titipat Achakulvisut
Eligibility criteria (EC) are critical components of clinical trial design, defining the parameters for participant inclusion and exclusion. However, designing EC remains a complex, expertise-intensive process. Traditional approaches to EC generation may fail to produce comprehensive, contextually appropriate criteria. To address these challenges, we introduce EC-RAFT, a method that utilizes Retrieval-Augmented Fine-Tuning (RAFT) to generate structured and cohesive EC directly from clinical trial titles and descriptions. EC-RAFT integrates contextual retrieval, synthesized intermediate reasoning, and fine-tuned language models to produce comprehensive EC sets. To enhance clinical alignment evaluation with referenced criteria, we also propose an LLM-guided evaluation pipeline. Our results demonstrate that our solution, which uses Llama-3.1-8B-Instruct as a base model, achieves a BERTScore of 86.23 and an EC-matched LLM-as-a-Judge score of 1.66 out of 3, outperforming zero-shot Llama-3.1 and Gemini-1.5 by 0.41 and 0.11 points, respectively. On top of that, EC-RAFT also outperforms other fine-tuned versions of Llama-3.1. EC-RAFT was trained in a low-cost setup and, therefore, can be used as a practical solution for EC generation while ensuring quality and relevance in clinical trial design. We release our code on GitHub at https://github.com/biodatlab/ec-raft/
pdf
bib
abs
Pitfalls of Scale: Investigating the Inverse Task of Redefinition in Large Language Models
Elena Stringli
|
Maria Lymperaiou
|
Giorgos Filandrianos
|
Athanasios Voulodimos
|
Giorgos Stamou
Inverse tasks can uncover potential reasoning gaps as Large Language Models (LLMs) scale up. In this work, we explore the redefinition task, in which we assign alternative values to well-known physical constants and units of measure, prompting LLMs to respond accordingly. Our findings show that not only does model performance degrade with scale, but its false confidence also rises. Moreover, while factors such as prompting strategies or response formatting are influential, they do not preclude LLMs from anchoring to memorized values.
pdf
bib
abs
Implicit Reasoning in Transformers is Reasoning through Shortcuts
Tianhe Lin
|
Jian Xie
|
Siyu Yuan
|
Deqing Yang
Test-time compute is emerging as a new paradigm for enhancing language models’ complex multi-step reasoning capabilities, as demonstrated by the success of OpenAI’s o1 and o3, as well as DeepSeek’s R1. Compared to explicit reasoning in test-time compute, implicit reasoning is more inference-efficient, requiring fewer generated tokens. However, why does the advanced reasoning capability fail to emerge in the implicit reasoning style? In this work, we train GPT-2 from scratch on a curated multi-step mathematical reasoning dataset and conduct analytical experiments to investigate how language models perform implicit reasoning in multi-step tasks. Our findings reveal: 1) Language models can perform step-by-step reasoning and achieve high accuracy in both in-domain and out-of-domain tests via implicit reasoning. However, this capability only emerges when trained on fixed-pattern data. 2) Conversely, implicit reasoning abilities emerging from training on unfixed-pattern data tend to overfit a specific pattern and fail to generalize further. Notably, this limitation is also observed in state-of-the-art large language models. These findings suggest that language models acquire implicit reasoning through shortcut learning, enabling strong performance on tasks with similar patterns while lacking generalization. Resources are available at https://github.com/TianheL/LM-Implicit-Reasoning.
pdf
bib
abs
Learning to Align Multi-Faceted Evaluation: A Unified and Robust Framework
Kaishuai Xu
|
Tiezheng Yu
|
Yi Cheng
|
Wenjun Hou
|
Liangyou Li
|
Xin Jiang
|
Lifeng Shang
|
Qun Liu
|
Wenjie Li
Large Language Models (LLMs) are being used more and more extensively for automated evaluation in various scenarios. Previous studies have attempted to fine-tune open-source LLMs to replicate the evaluation explanations and judgments of powerful proprietary models, such as GPT-4. However, these methods are largely limited to text-based analyses under predefined general criteria, resulting in reduced adaptability for unseen instructions and demonstrating instability in evaluating adherence to quantitative and structural constraints. To address these limitations, we propose a novel evaluation framework, ARJudge, that adaptively formulates evaluation criteria and synthesizes both text-based and code-driven analyses to evaluate LLM responses. ARJudge consists of two components: a fine-tuned Analyzer that generates multi-faceted evaluation analyses and a tuning-free Refiner that combines and refines all analyses to make the final judgment. We construct a Composite Analysis Corpus that integrates tasks for evaluation criteria generation alongside text-based and code-driven analysis generation to train the Analyzer. Our results demonstrate that ARJudge outperforms existing fine-tuned evaluators in effectiveness and robustness. Furthermore, it demonstrates the importance of multi-faceted evaluation and code-driven analyses in enhancing evaluation capabilities.
pdf
bib
abs
CortexDebate: Debating Sparsely and Equally for Multi-Agent Debate
Yiliu Sun
|
Zicheng Zhao
|
Sheng Wan
|
Chen Gong
Nowadays, single Large Language Model (LLM) struggles with critical issues such as hallucination and inadequate reasoning abilities. To mitigate these issues, Multi-Agent Debate (MAD) has emerged as an effective strategy, where LLM agents engage in in-depth debates with others on tasks. However, existing MAD methods face two major issues: (a) too lengthy input contexts, which causes LLM agents to get lost in plenty of input information and experiences performance drop; and (b) the overconfidence dilemma, where self-assured LLM agents dominate the debate, leading to low debating effectiveness. To address these limitations, we propose a novel MAD method called ”CortexDebate”. Inspired by the human brain’s tendency to establish a sparse and dynamically optimized network among cortical areas governed by white matter, CortexDebate constructs a sparse debating graph among LLM agents, where each LLM agent only debates with the ones that are helpful to it. To optimize the graph, we propose a module named McKinsey-based Debate Matter (MDM), which acts as an artificial analog to white matter. By integrating the McKinsey Trust Formula, a well-established measure of trustworthiness from sociology, MDM enables credible evaluations that guide graph optimization. The effectiveness of our CortexDebate has been well demonstrated by extensive experimental results across eight datasets from four task types.
pdf
bib
abs
PAP2PAT: Benchmarking Outline-Guided Long-Text Patent Generation with Patent-Paper Pairs
Valentin Knappich
|
Anna Hätty
|
Simon Razniewski
|
Annemarie Friedrich
Dealing with long and highly complex technical text is a challenge for Large Language Models (LLMs), which still have to unfold their potential in supporting expensive and time intensive processes like patent drafting. Within patents, the description constitutes more than 90% of the document on average. Yet, its automatic generation remains understudied. When drafting patent applications, patent attorneys typically receive invention reports (IRs), which are usually confidential, hindering research on LLM-supported patent drafting.Often, pre-publication research papers serve as IRs. We leverage this duality to build PAP2PAT, an open and realistic benchmark for patent drafting consisting of 1.8k patent-paper pairs describing the same inventions. To address the complex long-document patent generation task, we propose chunk-based outline-guided generation using the research paper as invention specification. Our extensive evaluation using PAP2PAT and a human case study show that LLMs can effectively leverage information from the paper, but still struggle to provide the necessary level of detail. Fine-tuning leads to more patent-style language, but also to more hallucination. We release our data and code.
pdf
bib
abs
Debt Collection Negotiations with Large Language Models: An Evaluation System and Optimizing Decision Making with Multi-Agent
Xiaofeng Wang
|
Zhixin Zhang
|
Jin Guang Zheng
|
Yiming Ai
|
Rui Wang
Debt collection negotiations (DCN) are vital for managing non-performing loans (NPLs) and reducing creditor losses. Traditional methods are labor-intensive, while large language models (LLMs) offer promising automation potential. However, prior systems lacked dynamic negotiation and real-time decision-making capabilities. This paper explores LLMs in automating DCN and proposes a novel evaluation framework with 13 metrics across 4 aspects. Our experiments reveal that LLMs tend to over-concede compared to human negotiators. To address this, we propose the Multi-Agent Debt Negotiation (MADeN) framework, incorporating planning and judging modules to improve decision rationality. We also apply post-training techniques, including DPO with rejection sampling, to optimize performance. Our studies provide valuable insights for practitioners and researchers seeking to enhance efficiency and outcomes in this domain.
pdf
bib
abs
Focused-DPO: Enhancing Code Generation Through Focused Preference Optimization on Error-Prone Points
Kechi Zhang
|
Ge Li
|
Jia Li
|
Yihong Dong
|
Jia Li
|
Zhi Jin
Code generation models have shown significant potential for automating programming tasks. However, the challenge of generating accurate and reliable code persists due to the highly complex and long-reasoning nature of the task. Even state-of-the-art models often fail in code generation due to small errors, which can drastically affect the overall functionality of code. Our study identifies that current models tend to produce errors concentrated at specific error-prone points, which significantly impacts the accuracy of the generated code. To address this issue, we introduce Focused-DPO, a framework that enhances code generation by directing preference optimization towards these critical error-prone areas. This approach builds on Direct Preference Optimization, emphasizing accuracy in parts prone to errors. Additionally, we develop a method called Error-Point Identification, which constructs a dataset that targets these problematic points without requiring costly human annotations. Our experiments on benchmarks such as HumanEval(+), MBPP(+), and LiveCodeBench demonstrate that Focused-DPO significantly improves the precision and reliability of code generation, reducing common errors and enhancing overall code quality. By focusing on error-prone points, Focused-DPO advances the accuracy and functionality of model-generated code.
pdf
bib
abs
Supervised and Unsupervised Probing of Shortcut Learning: Case Study on the Emergence and Evolution of Syntactic Heuristics in BERT
Elke Vandermeerschen
|
Miryam De Lhoneux
Contemporary language models (LMs) such as BERT (Devlin et al., 2019, T5 (Raffel et al., 2023), GPT-4 (OpenAI, 2023), have exhibited remarkable capabilities, effectively addressing long-standing challenges in the field. However, these models rely on shortcut learning, using a decision rule that relies on superficial cues that are spuriously correlated with the labels (Geirhos et al., 2020). In this research, we focus on the reliance on a specific type of shortcuts, namely syntactic heuristics, in BERT when performing Natural Language Inference (NLI), a representative task in Natural Language Understanding (Jeretic et al., 2020). By making use of two probing methods, one supervised, one unsupervised, we investigate where these shortcuts emerge, how they evolve and how they impact the latent knowledge of the LM. Our findings reveal that syntactic heuristics are absent in pretrained models but emerge and evolve as the model is finetuned with datasets of increasing size. The adoption of these shortcuts varies across different hidden layers, with specific layers closer to the output contributing more to this phenomenon. Despite the model’s reliance on shortcuts during inference, it retains information relevant to the task, and our supervised and unsupervised probes process this information differently.
pdf
bib
abs
GIMMICK: Globally Inclusive Multimodal Multitask Cultural Knowledge Benchmarking
Florian Schneider
|
Carolin Holtermann
|
Chris Biemann
|
Anne Lauscher
Large Vision-Language Models (LVLMs) have recently gained attention due to their distinctive performance and broad applicability. While it has been previously shown that their efficacy in usage scenarios involving non-Western contexts falls short, existing studies are limited in scope, covering just a narrow range of cultures, focusing exclusively on a small number of cultural aspects, or evaluating a limited selection of models on a single task only. Towards globally inclusive LVLM research, we introduce GIMMICK, an extensive multimodal benchmark designed to assess a broad spectrum of cultural knowledge across 144 countries representing six global macro-regions. GIMMICK comprises six tasks built upon three new datasets that span 728 unique cultural events or facets on which we evaluated 20 LVLMs and 11 LLMs, including five proprietary and 26 open-weight models of all sizes. We systematically examine (1) regional cultural biases, (2) the influence of model size, (3) input modalities, and (4) external cues. Our analyses reveal strong biases toward Western cultures across models and tasks and highlight strong correlations between model size and performance, as well as the effectiveness of multimodal input and external geographic cues. We further find that models have more knowledge of tangible than intangible aspects (e.g., food vs. rituals) and that they excel in recognizing broad cultural origins but struggle with a more nuanced understanding.
pdf
bib
abs
R-VLM: Region-Aware Vision Language Model for Precise GUI Grounding
Joonhyung Park
|
Peng Tang
|
Sagnik Das
|
Srikar Appalaraju
|
Kunwar Yashraj Singh
|
R. Manmatha
|
Shabnam Ghadar
Visual agent models for automating human activities on Graphical User Interfaces (GUIs) have emerged as a promising research direction, driven by advances in large Vision Language Models (VLMs). A critical challenge in GUI automation is the precise grounding of interface elements across diverse platforms. Existing vision-only GUI agents directly ground elements from large and cluttered screenshots, requiring them to process substantial irrelevant information that compromises their accuracy. In addition, these approaches typically employ basic cross-entropy loss for learning grounding objectives, which fails to effectively capture grounding quality compared to established object detection metrics like Intersection-over-Union (IoU). To address these issues, we introduce R-VLM, a novel GUI grounding approach that leverages zoomed-in region proposals for precise element localization. We also propose an IoU-aware objective function that facilitates model convergence toward high IoU predictions. Our approach bridges the gap between VLMs and conventional object detection techniques, improving the state-of-the-art grounding accuracy by 13% across diverse GUI platforms on the GUI grounding benchmarks ScreenSpot and AgentStudio. In addition, our R-VLM approach shows 3.2-9.7% absolute accuracy improvements in GUI navigation tasks on the AITW and Mind2Web benchmarks.
pdf
bib
abs
Perspective Transition of Large Language Models for Solving Subjective Tasks
Xiaolong Wang
|
Yuanchi Zhang
|
Ziyue Wang
|
Yuzhuang Xu
|
Fuwen Luo
|
Yile Wang
|
Peng Li
|
Yang Liu
Large language models (LLMs) have revolutionized the field of natural language processing, enabling remarkable progress in various tasks. Different from objective tasks such as commonsense reasoning and arithmetic question-answering, the performance of LLMs on subjective tasks is still limited, where the perspective on the specific problem plays crucial roles for better interpreting the context and giving proper response. For example, in certain scenarios, LLMs may perform better when answering from an expert role perspective, potentially eliciting their relevant domain knowledge. In contrast, in some scenarios, LLMs may provide more accurate responses when answering from a third-person standpoint, enabling a more comprehensive understanding of the problem and potentially mitigating inherent biases. In this paper, we propose Reasoning through Perspective Transition (RPT), a method based on in-context learning that enables LLMs to dynamically select among direct, role, and third-person perspectives for the best way to solve corresponding subjective problem. Through extensive experiments on totally 12 subjective tasks by using both closed-source and open-source LLMs including GPT-4, GPT-3.5, Llama-3, and Qwen-2, our method outperforms widely used single fixed perspective based methods such as chain-of-thought prompting and expert prompting, highlights the intricate ways that LLMs can adapt their perspectives to provide nuanced and contextually appropriate responses for different problems.
pdf
bib
abs
TripTailor: A Real-World Benchmark for Personalized Travel Planning
Kaimin Wang
|
Yuanzhe Shen
|
Changze Lv
|
Xiaoqing Zheng
|
Xuanjing Huang
The continuous evolution and enhanced reasoning capabilities of large language models (LLMs) have elevated their role in complex tasks, notably in travel planning, where demand for personalized, high-quality itineraries is rising. However, current benchmarks often rely on unrealistic simulated data, failing to reflect the differences between LLM-generated and real-world itineraries. Existing evaluation metrics, which primarily emphasize constraints, fall short of providing a comprehensive assessment of the overall quality of travel plans. To address these limitations, we introduce TripTailor, a benchmark designed specifically for personalized travel planning in real-world scenarios. This dataset features an extensive collection of over 500,000 real-world points of interest (POIs) and nearly 4,000 diverse travel itineraries, complete with detailed information, providing a more authentic evaluation framework. Experiments show that fewer than 10% of the itineraries generated by the latest state-of-the-art LLMs achieve human-level performance. Moreover, we identify several critical challenges in travel planning, including the feasibility, rationality, and personalized customization of the proposed solutions. We hope that TripTailor will drive the development of travel planning agents capable of understanding and meeting user needs while generating practical itineraries.
pdf
bib
abs
Random Splitting Negatively Impacts NER Evaluation: Quantifying and Eliminating the Overestimation of NER Performance
Florian Babl
|
Moritz Hennen
|
Jakob Murauer
|
Michaela Geierhos
In named entity recognition (NER), models are evaluated on their ability to identify entity mentions in text. However, standard evaluation methods often rely on test sets that contain named entities already present in the training data, raising concerns about overestimation of model performance.This work investigates the impact of varying degrees of entity contamination on a dataset level on the generalization ability and reported F1 scores of three state-of-the-art NER models.Experiments on five standard benchmarks show that F1 scores for contaminated entities statistically significantly inflate reported F1 scores as contamination rates increase, with F1 performance gaps ranging from 2-10% compared to entities not seen during training.To address these inflated F1 scores, we additionally propose a novel NER dataset splitting method using a minimum cut algorithm to minimize train-test entity leakage.While our splitting method ensures near-zero entity contamination, we also compare new and existing dataset splits on named entity sample counts.
pdf
bib
abs
Structure-adaptive Adversarial Contrastive Learning for Multi-Domain Fake News Detection
Lingwei Wei
|
Dou Hu
|
Wei Zhou
|
Philip S. Yu
|
Songlin Hu
The rapid proliferation of fake news across multiple domains poses significant threats to society. Existing multi-domain detection models typically capture domain-shared semantic features to achieve generalized detection. However, they often fail to generalize well due to poor adaptability, which limits their ability to provide complementary features for detection, especially in data-constrained conditions. To address these challenges, we investigate the propagation-adaptive multi-domain fake news detection paradigm. We propose a novel framework, Structure-adaptive Adversarial Contrastive Learning (StruACL), to adaptively enable structure knowledge transfer between multiple domains. Specifically, we first contrast representations between content-only and propagation-rich data to preserve structural patterns in the shared representation space. Additionally, we design a propagation-guided adversarial training strategy to enhance the diversity of representations. Under the StruACL objective, we leverage a unified Transformer-based and graph-based model to jointly learn transferable semantic and structural features for detection across multiple domains. Experiments on seven fake news datasets demonstrate that StruACL-TGN achieves better multi-domain detection performance on general and data-constrained scenarios, showing the effectiveness and better generalization of StruACL.
pdf
bib
abs
BiasGuard: A Reasoning-Enhanced Bias Detection Tool for Large Language Models
Zhiting Fan
|
Ruizhe Chen
|
Zuozhu Liu
Identifying bias in LLM-generated content is a crucial prerequisite for ensuring fairness in LLMs. Existing methods, such as fairness classifiers and LLM-based judges, face limitations related to difficulties in understanding underlying intentions and the lack of criteria for fairness judgment. In this paper, we introduce BiasGuard, a novel bias detection tool that explicitly analyzes inputs and reasons through fairness specifications to provide accurate judgments. BiasGuard is implemented through a two-stage approach: the first stage initializes the model to explicitly reason based on fairness specifications, while the second stage leverages reinforcement learning to enhance its reasoning and judgment capabilities. Our experiments, conducted across five datasets, demonstrate that BiasGuard outperforms existing tools, improving accuracy and reducing over-fairness misjudgments. We also highlight the importance of reasoning-enhanced decision-making and provide evidence for the effectiveness of our two-stage optimization pipeline.
pdf
bib
abs
Qorǵau: Evaluating Safety in Kazakh-Russian Bilingual Contexts
Maiya Goloburda
|
Nurkhan Laiyk
|
Diana Turmakhan
|
Yuxia Wang
|
Mukhammed Togmanov
|
Jonibek Mansurov
|
Askhat Sametov
|
Nurdaulet Mukhituly
|
Minghan Wang
|
Daniil Orel
|
Zain Muhammad Mujahid
|
Fajri Koto
|
Timothy Baldwin
|
Preslav Nakov
Large language models (LLMs) are known to have the potential to generate harmful content, posing risks to users. While significant progress has been made in developing taxonomies for LLM risks and safety evaluation prompts, most studies have focused on monolingual contexts, primarily in English. However, language- and region-specific risks in bilingual contexts are often overlooked, and core findings can diverge from those in monolingual settings. In this paper, we introduce Qorǵau, a novel dataset specifically designed for safety evaluation in Kazakh and Russian, reflecting the unique bilingual context in Kazakhstan, where both Kazakh (a low-resource language) and Russian (a high-resource language) are spoken. Experiments with both multilingual and language-specific LLMs reveal notable differences in safety performance, emphasizing the need for tailored, region-specific datasets to ensure the responsible and safe deployment of LLMs in countries like Kazakhstan. Warning: this paper contains example data that may be offensive, harmful, or biased.
pdf
bib
abs
MMXU: A Multi-Modal and Multi-X-ray Understanding Dataset for Disease Progression
Linjie Mu
|
Zhongzhen Huang
|
Shengqian Qin
|
Yakun Zhu
|
Shaoting Zhang
|
Xiaofan Zhang
Large vision-language models (LVLMs) have shown great promise in medical applications, particularly in visual question answering (MedVQA) and diagnosis from medical images. However, existing datasets and models often fail to consider critical aspects of medical diagnostics, such as the integration of historical records and the analysis of disease progression over time. In this paper, we introduce MMXU (Multimodal and MultiX-ray Understanding), a novel dataset for MedVQA that focuses on identifying changes in specific regions between two patient visits. Unlike previous datasets that primarily address single-image questions, MMXU enables multi-image questions, incorporating both current and historical patient data. We demonstrate the limitations of current LVLMs in identifying disease progression on MMXU-test, even those that perform well on traditional benchmarks. To address this, we propose a MedRecord-Augmented Generation (MAG) approach, incorporating both global and regional historical records.Our experiments show that integrating historical records significantly enhances diagnostic accuracy by at least 20%, bridging the gap between current LVLMs and human expert performance. Additionally, we fine-tune models with MAG on MMXU-dev, which demonstrates notable improvements. We hope this work could illuminate the avenue of advancing the use of LVLMs in medical diagnostics by emphasizing the importance of historical context in interpreting medical images.Our dataset is released at github.
pdf
bib
abs
Tree-of-Code: A Self-Growing Tree Framework for End-to-End Code Generation and Execution in Complex Tasks
Ziyi Ni
|
Yifan Li
|
Ning Yang
|
Dou Shen
|
Pin Lyu
|
Daxiang Dong
Solving complex reasoning tasks is a key real-world application of agents. Thanks to the pretraining of Large Language Models (LLMs) on code data, recent approaches like CodeAct successfully use code as LLM agents’ action, achieving good results. However, CodeAct greedily generates the next action’s code block by relying on fragmented thoughts, resulting in inconsistency and accumulative hallucination. Moreover, CodeAct lacks action-related ground-truth (GT), making its supervision signals and termination conditions questionable in multi-turn interactions. To address these issues, we propose Tree-of-Code (ToC), a self-growing framework that generates nodes through self-supervision, incorporating prompt and model exploration in a GT-free setting. Each node employs CodeProgram, an end-to-end code generation paradigm that aligns executable code logic with global reasoning. This approach uses task-level execution success as both node validity and stop-growing flags, bypassing process supervision to enable online applications. Experiments on two datasets with ten popular zero-shot LLMs show that ToC boosts accuracy by nearly 20% over CodeAct with fewer than 1/4 turns. To further investigate the trade-off between efficacy and efficiency, ablation studies on different ToC tree sizes and exploration mechanisms validate ToC’s superiority.
pdf
bib
abs
Akan Cinematic Emotions (ACE): A Multimodal Multi-party Dataset for Emotion Recognition in Movie Dialogues
David Sasu
|
Zehui Wu
|
Ziwei Gong
|
Run Chen
|
Pengyuan Shi
|
Lin Ai
|
Julia Hirschberg
|
Natalie Schluter
In this paper, we introduce the Akan Cinematic Emotions (AkaCE) dataset, the first multimodal emotion dialogue dataset for an African language, addressing the significant lack of resources for low-resource languages in emotion recognition research. AkaCE, developed for the Akan language, contains 385 emotion-labeled dialogues and 6162 utterances across audio, visual, and textual modalities, along with word-level prosodic prominence annotations. The presence of prosodic labels in this dataset also makes it the first prosodically annotated African language dataset. We demonstrate the quality and utility of AkaCE through experiments using state-of-the-art emotion recognition methods, establishing solid baselines for future research. We hope AkaCE inspires further work on inclusive, linguistically and culturally diverse NLP resources.
pdf
bib
abs
A Cognitive Writing Perspective for Constrained Long-Form Text Generation
Kaiyang Wan
|
Honglin Mu
|
Rui Hao
|
Haoran Luo
|
Tianle Gu
|
Xiuying Chen
Like humans, Large Language Models (LLMs) struggle to generate high-quality long-form text that adheres to strict requirements in a single pass. This challenge is unsurprising, as successful human writing, according to the Cognitive Writing Theory, is a complex cognitive process involving iterative planning, translating, reviewing, and monitoring. Motivated by these cognitive principles, we aim to equip LLMs with human-like cognitive writing capabilities through CogWriter, a novel training-free framework that transforms LLM constrained long-form text generation into a systematic cognitive writing paradigm. Our framework consists of two key modules: (1) a Planning Agent that performs hierarchical planning to decompose the task, and (2) multiple Generation Agents that execute these plans in parallel. The system maintains quality via continuous monitoring and reviewing mechanisms, which evaluate outputs against specified requirements and trigger necessary revisions. CogWriter demonstrates exceptional performance on LongGenBench, a benchmark for complex constrained long-form text generation. Even when using Qwen-2.5-14B as its backbone, CogWriter surpasses GPT-4o by 22% in complex instruction completion accuracy while reliably generating texts exceeding 10,000 words. We hope this cognitive science-inspired approach provides a paradigm for LLM writing advancements: https://anonymous.4open.science/r/CogWriter-8DFE.
pdf
bib
abs
Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models
You Li
|
Heyu Huang
|
Chi Chen
|
Kaiyu Huang
|
Chao Huang
|
Zonghao Guo
|
Zhiyuan Liu
|
Jinan Xu
|
Yuhua Li
|
Ruixuan Li
|
Maosong Sun
The recent advancement of Multimodal Large Language Models (MLLMs) has significantly improved their fine-grained perception of single images and general comprehension across multiple images. However, existing MLLMs still face challenges in achieving precise grounding in complex multi-image scenarios. To address this, we first explore a Chain-of-Thought (CoT) framework that integrates single-image grounding with multi-image comprehension. While partially effective, it remains unstable and struggles to capture abstract visual information due to its non-end-to-end nature. Therefore, we introduce Migician, the first multi-image grounding model capable of performing free-form and accurate grounding across multiple images. To support this, we present the MGrounding-630k dataset, which comprises data for several multi-image grounding tasks derived from existing datasets, along with newly generated free-form grounding instruction-following data. Furthermore, we propose MIG-Bench, a comprehensive benchmark specifically designed for evaluating multi-image grounding capabilities. Experimental results demonstrate that our model achieves significantly superior multi-image grounding capabilities, outperforming the best existing MLLMs by 24.94% and even surpassing much larger 70B models. Our code, model, dataset, and benchmark are fully open-sourced at https://migician-vg.github.io/.
pdf
bib
abs
SIKeD: Self-guided Iterative Knowledge Distillation for Mathematical Reasoning
Shivam Adarsh
|
Kumar Shridhar
|
Caglar Gulcehre
|
Nicholas Monath
|
Mrinmaya Sachan
Large Language Models (LLMs) can transfer their reasoning skills to smaller models by teaching them to generate the intermediate reasoning process required to solve multistep reasoning tasks. While LLMs can accurately solve reasoning tasks through a variety of strategies, even without fine-tuning, smaller models are not expressive enough to fit the LLMs distribution on all strategies when distilled and tend to prioritize one strategy over the others. This reliance on one strategy poses a challenge for smaller models when attempting to solve reasoning tasks that may be difficult with their preferred strategy. To address this, we propose a distillation method SIKeD: **S**elf-guided **I**terative **K**nowledg**e** **D**istillation, where the LLM teaches the smaller model to approach a task using different strategies and the smaller model uses its self-generated on-policy outputs to choose the most suitable strategy for the given task. The training continues in a self-guided iterative manner, where for each training iteration, a decision is made on how to combine the LLM data with the self-generated outputs. Unlike traditional distillation methods, SIKeD allows the smaller model to learn which strategy is suitable for a given task while continuously learning to solve a task using different strategies. Our experiments on various mathematical reasoning datasets show that SIKeD significantly outperforms traditional distillation techniques across smaller models of different sizes.
pdf
bib
abs
Chain of Attack: Hide Your Intention through Multi-Turn Interrogation
Xikang Yang
|
Biyu Zhou
|
Xuehai Tang
|
Jizhong Han
|
Songlin Hu
The latent knowledge of large language models (LLMs) contains harmful or unethical content, which introduces significant security risks upon their widespread deployment. Conducting jailbreak attacks on LLMs can proactively identify vulnerabilities to enhance their security measures. However, previous jailbreak attacks primarily focus on single-turn dialogue scenarios, leaving vulnerabilities in multi-turn dialogue contexts inadequately explored. This paper investigates the resilience of black-box LLMs in multi-turn jailbreak attack scenarios from a novel interrogation perspective. We propose an optimal interrogation principle to conceal the jailbreak intent and introduce a multi-turn attack chain generation strategy called CoA. By employing two effective interrogation strategies tailored for LLMs, coupled with an interrogation history record management mechanis, it achieves a significant optimization of the attack process. Our approach enables the iterative generation of attack chains, offering a powerful tool for LLM red team testing. Experimental results demonstrate that LLMs exhibit insufficient resistance under multi-turn interrogation, with our method shows more advantages(ASR, 83% vs 64%). This work offers new insights into improving the safety of LLMs.
pdf
bib
abs
MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space
Yicheng Chen
|
Yining Li
|
Kai Hu
|
Ma Zerun
|
HaochenYe HaochenYe
|
Kai Chen
Data quality and diversity are key to the construction of effective instruction-tuning datasets. With the increasing availability of open-source instruction-tuning datasets, it is advantageous to automatically select high-quality and diverse subsets from a vast amount of data. Existing methods typically prioritize instance quality and use heuristic rules to maintain diversity. However, this absence of a comprehensive view of the entire collection often leads to suboptimal results. Moreover, heuristic rules generally focus on distance or clustering within the embedding space, which fails to accurately capture the intent of complex instructions in the semantic space. To bridge this gap, we propose a unified method for quantifying the information content of datasets. This method models the semantic space by constructing a label graph and quantifies diversity based on the distribution of information within the graph. Based on such a measurement, we further introduce an efficient sampling method that selects data samples iteratively to Maximize the Information Gain (MIG) in semantic space. Experiments on various datasets and base models demonstrate that MIG consistently outperforms state-of-the-art methods. Notably, the model fine-tuned with 5% Tulu3 data sampled by MIG achieves comparable performance to the official SFT model trained on the full dataset, with improvements of +5.73% on AlpacaEval and +6.89% on Wildbench.
pdf
bib
abs
Enhancing Automatic Term Extraction with Large Language Models via Syntactic Retrieval
Yongchan Chun
|
Minhyuk Kim
|
Dongjun Kim
|
Chanjun Park
|
Heuiseok Lim
Automatic Term Extraction (ATE) identifies domain-specific expressions that are crucial for downstream tasks such as machine translation and information retrieval. Although large language models (LLMs) have significantly advanced various NLP tasks, their potential for ATE has scarcely been examined. We propose a retrieval-based prompting strategy that, in the few-shot setting, selects demonstrations according to syntactic rather than semantic similarity. This syntactic retrieval method is domain-agnostic and provides more reliable guidance for capturing term boundaries. We evaluate the approach in both in-domain and cross-domain settings, analyzing how lexical overlap between the query sentence and its retrieved examples affects performance. Experiments on three specialized ATE benchmarks show that syntactic retrieval improves F1-score. These findings highlight the importance of syntactic cues when adapting LLMs to terminology-extraction tasks.
pdf
bib
abs
Explainable Depression Detection in Clinical Interviews with Personalized Retrieval-Augmented Generation
Linhai Zhang
|
Ziyang Gao
|
Deyu Zhou
|
Yulan He
Depression is a widespread mental health disorder, and clinical interviews are the gold standard for assessment. However, their reliance on scarce professionals highlights the need for automated detection. Current systems mainly employ black-box neural networks, which lack interpretability, which is crucial in mental health contexts. Some attempts to improve interpretability use post-hoc LLM generation but suffer from hallucination. To address these limitations, we propose RED, a Retrieval-augmented generation framework for Explainable depression Detection. RED retrieves evidence from clinical interview transcripts, providing explanations for predictions. Traditional query-based retrieval systems use a one-size-fits-all approach, which may not be optimal for depression detection, as user backgrounds and situations vary. We introduce a personalized query generation module that combines standard queries with user-specific background inferred by LLMs, tailoring retrieval to individual contexts. Additionally, to enhance LLM performance in social intelligence, we augment LLMs by retrieving relevant knowledge from a social intelligence datastore using an event-centric retriever. Experimental results on the real-world benchmark demonstrate RED’s effectiveness compared to neural networks and LLM-based baselines.
pdf
bib
abs
EMPEC: A Comprehensive Benchmark for Evaluating Large Language Models Across Diverse Healthcare Professions
Zheheng Luo
|
Chenhan Yuan
|
Qianqian Xie
|
Sophia Ananiadou
Recent advancements in Large Language Models (LLMs) show their potential in accurately answering biomedical questions, yet current healthcare benchmarks primarily assess knowledge mastered by medical doctors, neglecting other essential professions. To address this gap, we introduce the Examinations for Medical PErsonnel in Chinese (EMPEC), a comprehensive healthcare knowledge benchmark featuring 157,803 exam questions across 124 subjects and 20 healthcare professions, including underrepresented roles like Optometrists and Audiologists. Each question is tagged for release time and source authenticity. We evaluated 17 LLMs, including proprietary and open-source models, finding that while models like GPT-4 achieved over 75% accuracy, they struggled with specialized fields and alternative medicine. Notably, we find that most medical-specific LLMs underperform their general-purpose counterparts in EMPEC, and incorporating EMPEC’s data in fine-tuning improves performance. In addition, we tested LLMs on questions released after the completion of their training to examine their ability in unseen queries. We also translated the test set into English and simplified Chinese and analyse the impact on different models. Our findings emphasize the need for broader benchmarks to assess LLM applicability in real-world healthcare, and we will provide the dataset and evaluation toolkit for future research.
pdf
bib
abs
Beyond Numeric Rewards: In-Context Dueling Bandits with LLM Agents
Fanzeng Xia
|
Hao Liu
|
Yisong Yue
|
Tongxin Li
In-Context Reinforcement Learning (ICRL) is a frontier paradigm to solve Reinforcement Learning (RL) problems in the foundation-model era. While ICRL capabilities have been demonstrated in transformers through task-specific training, the potential of large language models (LLMs) out of the box remains largely unexplored. This paper investigates whether LLMs can generalize cross-domain to perform ICRL on the Dueling Bandits (DB) problem, a stateless preference-based RL setting. We find that top-performing LLMs exhibit a notable zero-shot capacity for relative decision-making, which translates to low short-term weak regret across all DB environments by quickly including the best arm in duels. However, an optimality gap still exists between LLMs and classic DB algorithms in terms of strong regret. LLMs struggle to converge and consistently exploit even when explicitly prompted to do so, and they are sensitive to prompt variations. To bridge this gap, we propose an agentic-flow framework—LLM with Enhanced Algorithmic Dueling (LEAD)—which integrates off-the-shelf DB algorithm support with LLM agents through fine-grained adaptive interplay. We show that LEAD inherits theoretical guarantees from classic DB algorithms on both weak and strong regret. We validate its efficacy and robustness even with noisy and adversarial prompts. The design of such an agentic framework sheds light on how to enhance the trustworthiness of general-purpose LLMs generalized to in-context decision-making tasks.
pdf
bib
abs
“Well, Keep Thinking”: Enhancing LLM Reasoning with Adaptive Injection Decoding
Hyunbin Jin
|
Je Won Yeom
|
Seunghyun Bae
|
Taesup Kim
Large language models (LLMs) exhibit strong reasoning abilities, often attributed to few-shot or zero-shot Chain-of-Thought (CoT) prompting. While effective, these methods require labor-intensive prompt engineering, raising the question of whether reasoning can be induced without reliance on explicit prompts. In this work, we unlock the reasoning capabilities of LLMs without explicit prompting.Inspired by zero-shot CoT and CoT-decoding, we propose a novel decoding strategy that systematically nudges LLMs to continue reasoning, thereby preventing immature reasoning processes. Specifically, we monitor the model’s generation and inject a designated phrase, whenever the model is likely to halt or drift away from logical reasoning process. Our experimental evaluations on diverse reasoning benchmarks demonstrate that our proposed strategy substantially improves LLM reasoning capabilities, highlighting the potential of decoding-based interventions as an alternative to traditional prompting techniques.
pdf
bib
abs
SpeechT-RAG: Reliable Depression Detection in LLMs with Retrieval-Augmented Generation Using Speech Timing Information
Xiangyu Zhang
|
Hexin Liu
|
Qiquan Zhang
|
Beena Ahmed
|
Julien Epps
Large Language Models (LLMs) have been increasingly adopted for health-related tasks, yet their performance in depression detection remains limited when relying solely on text input. While Retrieval-Augmented Generation (RAG) typically enhances LLM capabilities, our experiments indicate that traditional text-based RAG systems struggle to significantly improve depression detection accuracy. This challenge stems partly from the rich depression-relevant information encoded in acoustic speech patterns — information that current text-only approaches fail to capture effectively. To address this limitation, we conduct a systematic analysis of temporal speech patterns, comparing healthy individuals with those experiencing depression. Based on our findings, we introduce Speech Timing-based Retrieval-Augmented Generation, SpeechT-RAG, a novel system that leverages speech timing features for both accurate depression detection and reliable confidence estimation. This integrated approach not only outperforms traditional text-based RAG systems in detection accuracy but also enhances uncertainty quantification through a confidence scoring mechanism that naturally extends from the same temporal features. Our unified framework achieves comparable results to fine-tuned LLMs without additional training while simultaneously addressing the fundamental requirements for both accuracy and trustworthiness in mental health assessment
pdf
bib
abs
Fine-grained Knowledge Enhancement for Retrieval-Augmented Generation
Jingxuan Han
|
Zhendong Mao
|
Yi Liu
|
Yexuan Che
|
Zheren Fu
|
Quan Wang
Retrieval-augmented generation (RAG) effectively mitigates hallucinations in large language models (LLMs) by filling knowledge gaps with retrieved external information. Most existing studies primarily retrieve knowledge documents based on semantic similarity to assist in answering questions but ignore the fine-grained necessary information within documents. In this paper, we propose a novel fine-grained knowledge enhancement method (FKE) for RAG, where fine-grained knowledge primarily includes sentence-level information easily overlooked in the document-based retrieval process. Concretely, we create a disentangled Chain-of-Thought prompting procedure to retrieve fine-grained knowledge from the external knowledge corpus. Then we develop a decoding enhancement strategy to constrain the document-based decoding process using fine-grained knowledge, thereby facilitating more accurate generated answers. Given an existing RAG pipeline, our method could be applied in a plug-and-play manner to enhance its performance with no additional modules or training process. Extensive experiments verify the effectiveness and generality of our method.
pdf
bib
abs
Bayesian Optimization for Controlled Image Editing via LLMs
Chengkun Cai
|
Haoliang Liu
|
Xu Zhao
|
Zhongyu Jiang
|
Tianfang Zhang
|
Zongkai Wu
|
John Lee
|
Jenq-Neng Hwang
|
Lei Li
In the rapidly evolving field of image generation, achieving precise control over generated content and maintaining semantic consistency remain significant limitations, particularly concerning grounding techniques and the necessity for model fine-tuning. To address these challenges, we propose BayesGenie, an off-the-shelf approach that integrates Large Language Models (LLMs) with Bayesian Optimization to facilitate precise and user-friendly image editing. Our method enables users to modify images through natural language descriptions without manual area marking, while preserving the original image’s semantic integrity. Unlike existing techniques that require extensive pre-training or fine-tuning, our approach demonstrates remarkable adaptability across various LLMs through its model-agnostic design. BayesGenie employs an adapted Bayesian optimization strategy to automatically refine the inference process parameters, achieving high-precision image editing with minimal user intervention. Through extensive experiments across diverse scenarios, we demonstrate that our framework outperforms existing methods in both editing accuracy and semantic preservation, as validated using different LLMs including Claude3 and GPT-4.
pdf
bib
abs
SPOT: Zero-Shot Semantic Parsing Over Property Graphs
Francesco Cazzaro
|
Justin Kleindienst
|
Sofia Márquez Gomez
|
Ariadna Quattoni
Knowledge Graphs (KGs) have gained popularity as a means of storing structured data, with property graphs, in particular, gaining traction in recent years. Consequently, the task of semantic parsing remains crucial in enabling access to the information in these graphs via natural language queries. However, annotated data is scarce, requires significant effort to create, and is not easily transferable between different graphs. To address these challenges we introduce SPOT, a method to generate training data for semantic parsing over Property Graphs without human annotations. We generate tree patterns, match them to the KG to obtain a query program, and use a finite-state transducer to produce a proto-natural language realization of the query. Finally, we paraphrase the proto-NL with an LLM to generate samples for training a semantic parser. We demonstrate the effectiveness of SPOT on two property graph benchmarks utilizing the Cypher query language. In addition, we show that our approach can also be applied effectively to RDF graphs.
pdf
bib
abs
Reasoning Circuits in Language Models: A Mechanistic Interpretation of Syllogistic Inference
Geonhee Kim
|
Marco Valentino
|
Andre Freitas
Recent studies on reasoning in language models (LMs) have sparked a debate on whether they can learn systematic inferential principles or merely exploit superficial patterns in the training data. To understand and uncover the mechanisms adopted for formal reasoning in LMs, this paper presents a mechanistic interpretation of syllogistic inference. Specifically, we present a methodology for circuit discovery aimed at interpreting content-independent and formal reasoning mechanisms. Through two distinct intervention methods, we uncover a sufficient and necessary circuit involving middle-term suppression that elucidates how LMs transfer information to derive valid conclusions from premises. Furthermore, we investigate how belief biases manifest in syllogistic inference, finding evidence of partial contamination from additional attention heads responsible for encoding commonsense and contextualized knowledge. Finally, we explore the generalization of the discovered mechanisms across various syllogistic schemes, model sizes and architectures. The identified circuit is sufficient and necessary for syllogistic schemes on which the models achieve high accuracy (≥ 60%), with compatible activation patterns across models of different families. Overall, our findings suggest that LMs learn transferable content-independent reasoning mechanisms, but that, at the same time, such mechanisms do not involve generalizable and abstract logical primitives, being susceptible to contamination by the same world knowledge acquired during pre-training.
pdf
bib
abs
Multi-Hop Question Generation via Dual-Perspective Keyword Guidance
Maodong Li
|
Longyin Zhang
|
Fang Kong
Multi-hop question generation (MQG) aims to generate questions that require synthesizing multiple information snippets from documents to derive target answers. The primary challenge lies in effectively pinpointing crucial information snippets related to question-answer (QA) pairs, typically relying on keywords. However, existing works fail to fully utilize the guiding potential of keywords and neglect to differentiate the distinct roles of question-specific and document-specific keywords. To address this, we define dual-perspective keywords—question and document keywords—and propose a Dual-Perspective Keyword-Guided (DPKG) framework, which seamlessly integrates keywords into the multi-hop question generation process. We argue that question keywords capture the questioner’s intent, whereas document keywords reflect the content related to the QA pair. Functionally, question and document keywords work together to pinpoint essential information snippets in the document, with question keywords required to appear in the generated question. The DPKG framework consists of an expanded transformer encoder and two answer-aware transformer decoders for keyword and question generation, respectively. Extensive experiments on HotpotQA demonstrate the effectiveness of our work, showcasing its promising performance and underscoring its significant value in the MQG task.
pdf
bib
abs
LoRMA: Low-Rank Multiplicative Adaptation for LLMs
Harsh Bihany
|
Shubham Patel
|
Ashutosh Modi
Large Language Models have shown remarkable capabilities in the NLP domain. Their effectiveness can mainly be attributed to their ability to adapt to an array of downstream tasks. However, generally, full fine-tuning is a computationally expensive job. To mitigate this, many techniques have been developed that prime efficiency, a prominent one being Low-Rank Adaptation (LoRA). However, LoRA and its variants employ re-parametrized additive updates. In this paper, we propose Low-Rank Multiplicative Adaptation (LoRMA), which shifts the paradigm of additive updates to a richer space of matrix multiplicative transformations. We tackle challenges such as computational complexity and rank bottleneck of matrix multiplication by effectively re-ordering operations and introducing rank inflation strategies. We conduct extensive experiments to demonstrate the effectiveness of our approach in terms of various evaluation metrics.
pdf
bib
abs
DI-BENCH: Benchmarking Large Language Models on Dependency Inference with Testable Repositories at Scale
Linghao Zhang
|
Junhao Wang
|
Shilin He
|
Chaoyun Zhang
|
Yu Kang
|
Bowen Li
|
Jiaheng Wen
|
Chengxing Xie
|
Maoquan Wang
|
Yufan Huang
|
Elsie Nallipogu
|
Qingwei Lin
|
Yingnong Dang
|
Saravan Rajmohan
|
Dongmei Zhang
|
Qi Zhang
Large Language Models have advanced automated software development, however, it remains a challenge to correctly infer dependencies, namely, identifying the internal components and external packages required for a repository to successfully run. Existing studies highlight that dependency-related issues cause over 40% of observed runtime errors on the generated repository. To address this, we introduce DI-BENCH, a large-scale benchmark and evaluation framework specifically designed to assess LLMs’ capability on dependency inference. The benchmark features 581 repositories with testing environments across Python, C#, Rust, and JavaScript. Extensive experiments with textual and execution-based metrics reveal that the current best-performing model achieves only a 48% execution pass rate on Python, indicating significant room for improvement. DI-BENCH establishes a new viewpoint for evaluating LLM performance on repositories, paving the way for more robust end-to-end software synthesis.
pdf
bib
abs
Weak-to-Strong Honesty Alignment via Learning-to-Rank Supervision
YunfanXie YunfanXie
|
Lixin Zou
|
Dan Luo
|
Min Tang
|
Chenliang Li
Honest alignment refers to the ability of a language model to truthfully convey its knowledge limitations by appropriately refusing to answer questions when it lacks sufficient information. Existing solutions, such as prompt engineering and fine-tuning, face limitations: the former provides only marginal improvements, while the latter struggles to enhance honesty when annotated data is scarce.To overcome the above limitations, we propose , a novel framework that enhances honesty through weak-to-strong generalization. Specifically, we train the strong LLMs under weak model supervision to improve their honesty. For the weak model, we employ a learning-to-rank strategy to train a “honest head”, which learns to select the most honest response among model’s outputs generated through beam search. For the strong LLM, we leverage the self-labeled dataset to update its parameters. Our proposal requires only minimal training data to train the weak honest model, yet achieve decent performance for labeling data. In addition, it enables the strong LLMs to have the capabilities to generalize even facing with the flawed label data. Extensive experiments show significantly boosts honest alignment in large models even with limited labeled data. Our code is available at
https://github.com/zewanfaan/WHAT_Honesty.
pdf
bib
abs
MultiHoax: A Dataset of Multi-hop False-premise questions
Mohammadamin Shafiei
|
Hamidreza Saffari
|
Nafise Sadat Moosavi
As Large Language Models are increasingly deployed in high-stakes domains, their ability to detect false assumptions and reason critically is crucial for ensuring reliable outputs. False-premise questions (FPQs) serve as an important evaluation method by exposing cases where flawed assumptions lead to incorrect responses. While existing benchmarks focus on single-hop FPQs, real-world reasoning often requires multi-hop inference, where models must verify consistency across multiple reasoning steps rather than relying on surface-level cues. To address this gap, we introduce MultiHoax, a benchmark for evaluating LLMs’ ability to handle false premises in complex, multi-step reasoning tasks. Our dataset spans seven countries and ten diverse knowledge categories, using Wikipedia as the primary knowledge source to enable cross-regional factual reasoning. Experiments reveal that state-of-the-art LLMs struggle to detect false premises across different countries, knowledge categories, and multi-hop reasoning types, highlighting the need for improved false premise detection and more robust multi-hop reasoning capabilities in LLMs.
pdf
bib
abs
Learning to Play Like Humans: A Framework for LLM Adaptation in Interactive Fiction Games
Jinming Zhang
|
Yunfei Long
Interactive Fiction games (IF games) are where players interact through natural language commands. While recent advances in Artificial Intelligence agents have reignited interest in IF games as a domain for studying decision-making, existing approaches prioritize task-specific performance metrics over human-like comprehension of narrative context and gameplay logic. This work presents a cognitively inspired framework that guides Large Language Models (LLMs) to learn and play IF games systematically. Our proposed **L**earning to **P**lay **L**ike **H**umans (LPLH) framework integrates three key components: (1) structured map building to capture spatial and narrative relationships, (2) action learning to identify context-appropriate commands, and (3) feedback-driven experience analysis to refine decision-making over time. By aligning LLMs-based agents’ behavior with narrative intent and commonsense constraints, LPLH moves beyond purely exploratory strategies to deliver more interpretable, human-like performance. Crucially, this approach draws on cognitive science principles to more closely simulate how human players read, interpret, and respond within narrative worlds. As a result, LPLH reframes the IF games challenge as a learning problem for LLMs-based agents, offering a new path toward robust, context-aware gameplay in complex text-based environments.
pdf
bib
abs
STATE ToxiCN: A Benchmark for Span-level Target-Aware Toxicity Extraction in Chinese Hate Speech Detection
Zewen Bai
|
Liang Yang
|
Shengdi Yin
|
Junyu Lu
|
Jingjie Zeng
|
Haohao Zhu
|
Yuanyuan Sun
|
Hongfei Lin
The proliferation of hate speech has caused significant harm to society. The intensity and directionality of hate are closely tied to the target and argument it is associated with. However, research on hate speech detection in Chinese has lagged behind, and existing datasets lack span-level fine-grained annotations. Furthermore, the lack of research on Chinese hateful slang poses a significant challenge. In this paper, we provide two valuable fine-grained Chinese hate speech detection research resources. First, we construct a Span-level Target-Aware Toxicity Extraction dataset (STATE ToxiCN), which is the first span-level Chinese hate speech dataset. Secondly, we evaluate the span-level hate speech detection performance of existing models using STATE ToxiCN. Finally, we conduct the first study on Chinese hateful slang and evaluate the ability of LLMs to understand hate semantics. Our work contributes valuable resources and insights to advance span-level hate speech detection in Chinese.
pdf
bib
abs
RelEdit: Evaluating Conceptual Knowledge Editing in Language Models via Relational Reasoning
Yifan Niu
|
Miao Peng
|
Nuo Chen
|
Yatao Bian
|
Tingyang Xu
|
Jia Li
The conceptual knowledge in Large Language Models (LLMs) can become outdated over time, and concept editing is often an option. Current evaluations on conceptual knowledge editing primarily focus on whether the definitions of concepts are successfully edited, neglecting the impact on the model’s related beliefs. To address this gap, we introduce a benchmark called RelEdit, which includes criteria and questions to assess both concept-level and instance-level relational reasoning abilities of edited models. Our findings reveal that existing knowledge editing methods struggle to reason about related conceptual knowledge effectively. Additionally, we introduce a simple memory-based in-context editing baseline, MICE, which prompts the language model to generate answers that align with the stored edited concepts in external memory. In addition, we find that MICE obtains the best scores on our benchmark, suggesting a promising research direction for model editing.
pdf
bib
abs
Unlocking Speech Instruction Data Potential with Query Rewriting
Yonghua Hei
|
Yibo Yan
|
Shuliang Liu
|
Huiyu Zhou
|
Linfeng Zhang
|
Xuming Hu
End-to-end Large Speech Language Models (**LSLMs**) demonstrate strong potential in response latency and speech comprehension capabilities, showcasing general intelligence across speech understanding tasks. However, the ability to follow speech instructions has not been fully realized due to the lack of datasets and heavily biased training tasks. Leveraging the rich ASR datasets, previous approaches have used Large Language Models (**LLMs**) to continue the linguistic information of speech to construct speech instruction datasets. Yet, due to the gap between LLM-generated results and real human responses, the continuation methods further amplify these shortcomings. Given the high costs of collecting and annotating speech instruction datasets by humans, using speech synthesis to construct large-scale speech instruction datasets has become a balanced and robust alternative. Although modern Text-To-Speech (**TTS**) models have achieved near-human-level synthesis quality, it is challenging to appropriately convert out-of-distribution text instruction to speech due to the limitations of the training data distribution in TTS models. To address this issue, we propose a query rewriting framework with multi-LLM knowledge fusion, employing multiple agents to annotate and validate the synthesized speech, making it possible to construct high-quality speech instruction datasets without relying on human annotation. Experiments show that this method can transform text instructions into distributions more suitable for TTS models for speech synthesis through zero-shot rewriting, increasing data usability from 72% to 93%. It also demonstrates unique advantages in rewriting tasks that require complex knowledge and context-related abilities.
pdf
bib
abs
From Evasion to Concealment: Stealthy Knowledge Unlearning for LLMs
Tianle Gu
|
Kexin Huang
|
Ruilin Luo
|
Yuanqi Yao
|
Xiuying Chen
|
Yujiu Yang
|
Yan Teng
|
Yingchun Wang
LLM Unlearning plays a crucial role in removing sensitive information from language models to mitigate potential misuse. However, previous approaches often treat nonsensical responses or template-based refusals (e.g., “Sorry, I cannot answer.”) as the unlearning target, which can give the impression of deliberate information suppression, making the process even more vulnerable to attacks and jailbreaks. Moreover, most methods rely on auxiliary models or retaining datasets, which adds complexity to the unlearning process. To address these challenges, we propose MEOW, a streamlined and stealthy unlearning method that eliminates the need for auxiliary models or retaining data while avoiding leakage through its innovative use of inverted facts. These inverted facts are generated by an offline LLM and serve as fine-tuning labels. Meanwhile, we introduce MEMO, a novel metric that measures the model’s memorization, to select optimal fine-tuning targets. The use of inverted facts not only maintains the covert nature of the model but also ensures that sensitive information is effectively forgotten without revealing the target data. Evaluated on the ToFU Knowledge Unlearning dataset using Llama2-7B-Chat and Phi-1.5, MEOW outperforms baselines in forgetting quality while preserving model utility. MEOW also maintains strong performance across NLU and NLG tasks and demonstrates superior resilience to attacks, validated via the Min-K% membership inference method.
pdf
bib
abs
Context-DPO: Aligning Language Models for Context-Faithfulness
Baolong Bi
|
Shaohan Huang
|
Yiwei Wang
|
Tianchi Yang
|
Zihan Zhang
|
Haizhen Huang
|
Lingrui Mei
|
Junfeng Fang
|
Zehao Li
|
Furu Wei
|
Weiwei Deng
|
Feng Sun
|
Qi Zhang
|
Shenghua Liu
Reliable responses from large language models (LLMs) require adherence to user instructions and retrieved information. While alignment techniques help LLMs align with human intentions and values, improving context-faithfulness through alignment remains underexplored. To address this, we propose Context-DPO, the first alignment method specifically designed to enhance LLMs’ context-faithfulness. We introduce ConFiQA, a benchmark that simulates Retrieval-Augmented Generation (RAG) scenarios with knowledge conflicts to evaluate context-faithfulness. By leveraging faithful and stubborn responses to questions with provided context from ConFiQA, our Context-DPO aligns LLMs through direct preference optimization. Extensive experiments demonstrate that our Context-DPO significantly improves context-faithfulness, achieving 35% to 280% improvements on popular open-source models. Further analysis demonstrates that Context-DPO preserves LLMs’ generative capabilities while providing interpretable insights into context utilization.
pdf
bib
abs
Reasoning Does Not Necessarily Improve Role-Playing Ability
Xiachong Feng
|
Longxu Dou
|
Lingpeng Kong
The application of role-playing large language models (LLMs) is rapidly expanding in both academic and commercial domains, driving an increasing demand for high-precision role-playing models. Simultaneously, the rapid advancement of reasoning techniques has continuously pushed the performance boundaries of LLMs. This intersection of practical role-playing demands and evolving reasoning capabilities raises an important research question: Can reasoning techniques enhance the role-playing capabilities of LLMs?” To address this, we conduct a comprehensive study using 6 role-playing benchmarks, 24 LLMs, and 3 distinct role-playing strategies, comparing the effectiveness of direct zero-shot role-playing, role-playing with Chain-of-Thought (CoT), and role-playing using reasoning-optimized LLMs. Our findings reveal that CoT may reduce role-playing performance, reasoning-optimized LLMs are unsuitable for role-playing, reasoning ability disrupts the role-playing scaling law, and large models still lack proficiency in advanced role-playing. Furthermore, based on extensive experimental results, we propose two promising future research directions: Role-aware Chain-of-Thought for improving role-playing LLMs and Reinforcement Learning for role-playing LLMs, aiming to enhance the adaptability, consistency, and effectiveness of role-playing LLMs for both research and real-world applications.
pdf
bib
abs
TableLLM: Enabling Tabular Data Manipulation by LLMs in Real Office Usage Scenarios
Xiaokang Zhang
|
Sijia Luo
|
Bohan Zhang
|
Zeyao Ma
|
Jing Zhang
|
Yang Li
|
Guanlin Li
|
Zijun Yao
|
Kangli Xu
|
Jinchang Zhou
|
Daniel Zhang-Li
|
Jifan Yu
|
Shu Zhao
|
Juanzi Li
|
Jie Tang
We introduce TableLLM, a robust large language model (LLM) with 8 billion parameters, purpose-built for proficiently handling tabular data manipulation tasks, whether they are embedded within documents or spreadsheets, catering to real-world office scenarios. We propose a distant supervision method for training, which comprises a reasoning process extension strategy, aiding in training LLMs to understand reasoning patterns more effectively as well as a cross-way validation strategy, ensuring the quality of the automatically generated data. To evaluate the performance of TableLLM, we have crafted benchmarks tailored to address both document and spreadsheet formats as well as constructed a well-organized evaluation pipeline capable of handling both scenarios. Thorough evaluations underscore the advantages of TableLLM when compared to various existing general-purpose and tabular data-focused LLMs. We have publicly released the model checkpoint, source code, benchmarks, and a web application for user interaction on this anonymized repository.
pdf
bib
abs
A Survey of LLM-based Agents in Medicine: How far are we from Baymax?
Wenxuan Wang
|
Zizhan Ma
|
Zheng Wang
|
Chenghan Wu
|
Jiaming Ji
|
Wenting Chen
|
Xiang Li
|
Yixuan Yuan
Large Language Models (LLMs) are transforming healthcare through LLM-based agents that can understand and assist with medical tasks. This survey examines the architectures, applications, and challenges of LLM-based agents in medicine. We analyze key components including system profiles, clinical planning, medical reasoning frameworks, and external capacity enhancement. The survey covers major applications in clinical decision support, medical documentation, training simulations, and healthcare service optimization, along with evaluation frameworks and metrics. While these agents show promise in enhancing healthcare delivery, challenges remain in hallucination management, multimodal integration, implementation, and ethics. We conclude by highlighting future directions in medical reasoning, physical system integration, and training simulations, providing researchers and practitioners with a structured overview of the field’s current state and prospects.
pdf
bib
abs
Context-Robust Knowledge Editing for Language Models
Haewon Park
|
Gyubin Choi
|
Minjun Kim
|
Yohan Jo
Knowledge editing (KE) methods offer an efficient way to modify knowledge in large language models. Current KE evaluations typically assess editing success by considering only the edited knowledge without any preceding contexts. In real-world applications, however, preceding contexts often trigger the retrieval of the original knowledge and undermine the intended edit. To address this issue, we have developed CHED—a benchmark designed to evaluate the context robustness of KE methods. Evaluations on CHED show that they often fail when preceding contexts are present. To mitigate this shortcoming, we introduce CoRE, a KE method designed to strengthen context robustness by minimizing context-sensitive variance in hidden states of the model for edited knowledge. This method not only improves the editing success rate in situations where a preceding context is present but also preserves the overall capabilities of the model. We also provide an in-depth analysis of the differing impacts of preceding contexts when introduced as user utterances versus assistant responses, and we dissect attention-score patterns to assess how specific tokens influence editing success. We release our dataset and code at [https://github.com/holi-lab/CoRE](https://github.com/holi-lab/CoRE).
pdf
bib
abs
Multi-Agent Collaboration via Cross-Team Orchestration
Zhuoyun Du
|
Chen Qian
|
Wei Liu
|
Zihao Xie
|
YiFei Wang
|
Rennai Qiu
|
Yufan Dang
|
Weize Chen
|
Cheng Yang
|
Ye Tian
|
Xuantang Xiong
|
Lei Han
Large Language Models (LLMs) have significantly impacted various domains, especially through organized LLM-driven autonomous agents. A representative scenario is in software development, where agents can collaborate in a team like humans, following predefined phases to complete sub-tasks sequentially. However, for an agent team, each phase yields only one possible outcome. This results in the completion of only one development chain, thereby losing the opportunity to explore multiple potential decision paths within the solution space. Consequently leading to suboptimal results or extensive trial and error. To address this, we introduce Cross-Team Orchestration (Croto), a scalable multi-team framework that enables orchestrated teams to jointly propose various task-oriented solutions and interact with their insights in a self-independence while cross-team collaboration environment for superior solutions generation. Experiments reveal a notable increase in software quality compared to state-of-the-art baselines. We further tested our framework on story generation tasks, which demonstrated a promising generalization ability of our framework in other domains. The code and data is available at https://github.com/OpenBMB/ChatDev/tree/macnet
pdf
bib
abs
Semantic Evaluation of Multilingual Data-to-Text Generation via NLI Fine-Tuning: Precision, Recall and F1 scores
William Soto Martinez
|
Yannick Parmentier
|
Claire Gardent
Performance in the KG-to-Text task has improved over the years, particularly in English. However, models are still prone to mistakes like Additions and Omissions. Furthermore, few languages are taken into account since both train and test data are not readily available. In this paper, we hope to facilitate the development and improvement of multilingual KG-to-Text models by providing a multilingual evaluation framework that is reference-less (no need for test data) and permits estimating how much a KG-to-Text Model under- (omission) or over- (addition) generates. We focus on two high (English, Russian) and five low (Breton, Irish, Maltese, Welsh, Xhosa) resource languages and show that our metric has fair to moderate correlation with reference-based metrics, positioning it as a consistent alternative when no references are available. We also show that our metric outperforms prior reference-less metrics in correlation with existing human judgments. Additional human evaluation shows moderate to strong correlation with human annotators in assessing precision and recall at a higher granularity level than shown in previous studies. Since our metric provides scores for precision and recall, it helps better assess the level of over- or under-generation of multilingual KG-to-Text models.
pdf
bib
abs
Optimized Text Embedding Models and Benchmarks for Amharic Passage Retrieval
Kidist Amde Mekonnen
|
Yosef Worku Alemneh
|
Maarten de Rijke
Neural retrieval methods using transformer-based pre-trained language models have advanced multilingual and cross-lingual retrieval. However, their effectiveness for low-resource, morphologically rich languages such as Amharic remains underexplored due to data scarcity and suboptimal tokenization. We address this gap by introducing Amharic-specific dense retrieval models based on pre-trained Amharic BERT and RoBERTa backbones. Our proposed RoBERTa-Base-Amharic-Embed model (110M parameters) achieves a 17.6% relative improvement in MRR@10 and a 9.86% gain in Recall@10 over the strongest multilingual baseline, Arctic Embed 2.0 (568M parameters). More compact variants, such as RoBERTa-Medium-Amharic-Embed (42M), remain competitive while being over 13× smaller. Additionally, we train a ColBERT-based late interaction retrieval model that achieves the highest MRR@10 score (0.843) among all evaluated models. We benchmark our proposed models against both sparse and dense retrieval baselines to systematically assess retrieval effectiveness in Amharic. Our analysis highlights key challenges in low-resource settings and underscores the importance of language-specific adaptation. To foster future research in low-resource IR, we publicly release our dataset, codebase, and trained models at https://github.com/kidist-amde/amharic-ir-benchmarks.
pdf
bib
abs
Enhancing Transformation from Natural Language to Signal Temporal Logic Using LLMs with Diverse External Knowledge
Yue Fang
|
Zhi Jin
|
Jie An
|
Hongshen Chen
|
Xiaohong Chen
|
Naijun Zhan
Temporal Logic (TL), especially Signal Temporal Logic (STL), enables precise formal specification, making it widely used in cyber-physical systems such as autonomous driving and robotics. Automatically transforming NL into STL is an attractive approach to overcome the limitations of manual transformation, which is time-consuming and error-prone. However, due to the lack of datasets, automatic transformation currently faces significant challenges and has not been fully explored. In this paper, we propose a NL-STL dataset named STL-Diversity-Enhanced (STL-DivEn), comprising 16,000 samples enriched with diverse patterns. To develop the dataset, we first manually create a small-scale seed set of NL-STL pairs. Next, representative examples are identified through clustering and used to guide large language models (LLMs) in generating additional NL-STL pairs. Finally, diversity and accuracy are ensured through rigorous rule-based filters and human validation. Furthermore, we introduce the Knowledge-Guided STL Transformation (KGST) framework, a novel approach for transforming natural language into STL, involving a generate-then-refine process based on external knowledge. Statistical analysis shows that the STL-DivEn dataset exhibits more diversity than the existing NL-STL dataset. Moreover, both metric-based and human evaluations indicate that our KGST approach outperforms baseline models in transformation accuracy on STL-DivEn and DeepSTL datasets.
pdf
bib
abs
DAGS: A Dependency-Based Dual-Attention and Global Semantic Improvement Framework for Metaphor Recognition
Puli Chen
|
Cheng Yang
|
Xingmao Zhang
|
Qingbao Huang
Current metaphor recognition mainly rely on Metaphor Detection Theory (MDT), such as the Metaphor Identification Procedure, which recognizes metaphors by comparing the basic meaning of target word with context meaning. Existing studies have gradually adopted literal annotations to model basic meanings, rejecting the aggregated meanings of target words. However, these methods ignore the problem of interference caused by literal annotations, and do not make full use of semantic expression relations of MDT, making the models difficult to detect and generalize. To address these challenges, we propose a dependency-based Dual-Attention and Global Semantic Improvement (DAGS) framework. DAGS first extracts literal annotations of target words as basic meaning from several mainstream corpora. Then, we apply dependency tree and dual-attention while filtering on input sentences and basic meanings. Finally, we improve the MDT to further consider the global semantic relationship on contexts. The DAGS can not only extract features from multiple information sources but alsoeffectively removes redundancy, while focusing on mission-critical information. We achieve state-of-the-art on several mainstream metaphor datasets (e.g., VUA ALL, VUAverb, TroFi and PSUCMC), which suggests that filtering and global semantic improvement of contexts is crucial for enhancing metaphor recognition performance.
pdf
bib
abs
ESF: Efficient Sensitive Fingerprinting for Black-Box Tamper Detection of Large Language Models
Xiaofan Bai
|
Pingyi Hu
|
Xiaojing Ma
|
Linchen Yu
|
Dongmei Zhang
|
Qi Zhang
|
Bin Benjamin Zhu
The rapid adoption of large language models (LLMs) in diverse applications has intensified concerns over their security and integrity, especially in cloud environments where internal model parameters are inaccessible to users. Traditional tamper detection methods, designed for deterministic classification models, fail to address the output randomness and massive parameter spaces characteristic of LLMs. In this paper, we introduce Efficient Sensitive Fingerprinting (ESF), the first fingerprinting method tailored for black-box tamper detection of LLMs. ESF generates fingerprint samples by optimizing output sensitivity at selected detection token positions and leverages Randomness-Set Consistency Checking (RSCC) to accommodate inherent output randomness. Furthermore, a novel Max Coverage Strategy (MCS) is proposed to select an optimal set of fingerprint samples that maximizes joint sensitivity to tampering. Grounded in a rigorous theoretical framework, ESF is both computationally efficient and scalable to large models. Extensive experiments across state-of-the-art LLMs demonstrate that ESF reliably detects tampering, such as fine-tuning, model compression, and backdoor injection, with a detection rate exceeding 99.2% using 5 fingerprint samples, thereby offering a robust solution for securing cloud-based AI systems.
pdf
bib
abs
The Lessons of Developing Process Reward Models in Mathematical Reasoning
Zhenru Zhang
|
Chujie Zheng
|
Yangzhen Wu
|
Beichen Zhang
|
Runji Lin
|
Bowen Yu
|
Dayiheng Liu
|
Jingren Zhou
|
Junyang Lin
Process Reward Models (PRMs) aim to identify and mitigate intermediate errors in the reasoning processes in mathematical reasoning of Large Language Models (LLMs).However, the development of effective PRMs faces significant challenges, particularly in data annotation and evaluation methodologies.In this paper, through extensive experiments, we demonstrate that commonly used Monte Carlo (MC) estimation-based data synthesis for PRMs typically yields inferior performance and generalization compared to LLM-as-a-judge and human annotation methods.Furthermore, we identify potential biases in conventional Best-of-N (BoN) evaluation strategies for PRMs.To address these challenges, we develop a consensus filtering mechanism that effectively integrates MC estimation with LLM-as-a-judge and advocates a more comprehensive evaluation framework that combines response-level and step-level metrics. Based on the mechanisms, we significantly improve both model performance and data efficiency in the BoN evaluation and the step-wise error identification task.Finally, we release a new state-of-the-art PRM that outperforms existing open-source alternatives and provides practical guidelines for future research.
pdf
bib
abs
MinosEval: Distinguishing Factoid and Non-Factoid for Tailored Open-Ended QA Evaluation with LLMs
Yongqi Fan
|
Yating Wang
|
Guandong Wang
|
Zhai Jie
|
Jingping Liu
|
Qi Ye
|
Tong Ruan
Open-ended question answering (QA) is a key task for evaluating the capabilities of large language models (LLMs). Compared to closed-ended QA, it demands longer answer statements, more nuanced reasoning processes, and diverse expressions, making refined and interpretable automatic evaluation both crucial and challenging. Traditional metrics like ROUGE and BERTScore struggle to capture semantic similarities due to different patterns between model responses and reference answers. Current LLM-based evaluation approaches, such as pairwise or listwise comparisons of candidate answers, lack intuitive interpretability. While pointwise scoring of each response provides some descriptions, it fails to adapt across different question contents. Most notably, existing methods overlook the distinction between factoid and non-factoid questions. To address these challenges, we propose MinosEval, a novel evaluation method that first distinguishes open-ended questions and then ranks candidate answers using different evaluation strategies. For factoid questions, it applies an adaptive key-point scoring strategy, while for non-factoid questions, it uses an instance-aware listwise ranking strategy. Experiments on multiple open-ended QA datasets, including self-built ones with more candidate responses to complement community resources, show that MinosEval better aligns with human annotations and offers more interpretable results.
pdf
bib
abs
Towards Conditioning Clinical Text Generation for User Control
Osman Alperen Koraş
|
Rabi Bahnan
|
Jens Kleesiek
|
Amin Dada
Deploying natural language generation systems in clinical settings remains challenging despite advances in Large Language Models (LLMs), which continue to exhibit hallucinations and factual inconsistencies, necessitating human oversight. This paper explores automated dataset augmentation using LLMs as human proxies to condition LLMs for clinician control without increasing cognitive workload. On the BioNLP ACL’24 Discharge Me! Shared Task, we achieve new state-of-the-art results with simpler methods than prior submissions through more efficient training, yielding a 9% relative improvement without augmented training and up to 34% with dataset augmentation. Preliminary human evaluation further supports the effectiveness of our approach, highlighting the potential of augmenting clinical text generation for control to enhance relevance, accuracy, and factual consistency.
pdf
bib
abs
CoDet-M4: Detecting Machine-Generated Code in Multi-Lingual, Multi-Generator and Multi-Domain Settings
Daniil Orel
|
Dilshod Azizov
|
Preslav Nakov
Large Language Models (LLMs) have revolutionized code generation, automating programming with remarkable efficiency. However, this has had important consequences for programming skills, ethics, and assessment integrity, thus making the detection of LLM-generated code essential for maintaining accountability and standards. While, there has been some previous research on this problem, it generally lacks domain coverage and robustness, and only covers a small number of programming languages. Here, we aim to bridge this gap. In particular, we propose a framework capable of distinguishing between human-written and LLM-generated program code across multiple programming languages, code generators, and domains. We use a large-scale dataset from renowned platforms and LLM-based code generators, alongside applying rigorous data quality checks, feature engineering, and comparative analysis of traditional machine learning models, pre-trained language models (PLMs), and LLMs for code detection. We perform an evaluation on out-of-domain scenarios, such as detecting authorship and hybrid authorship of generated code and generalizing to unseen models, domains, and programming languages. Our extensive experiments show that our framework effectively distinguishes human-written from LLM-generated program code, setting a new benchmark for the task.
pdf
bib
abs
Q-Mamba: Towards more efficient Mamba models via post-training quantization
Chen Tianqi
|
Yuanteng Chen
|
Peisong Wang
|
Weixiang Xu
|
Zeyu Zhu
|
Jian Cheng
State Space Models (SSMs), such as Mamba, have recently demonstrated potential in language understanding tasks, positioning them as competitors to transformer architectures. However, our investigations reveal that the Mamba architecture still has room for further optimization—not only in linear projections but also in state caches, which contribute significantly to memory consumption, particularly after quantizing the former into low bits. After a theoretical analysis of the causes of outliers in states, we propose Decoupled Scale Quantization (DSQ), which mitigates outliers in both the state and channel dimensions by applying separate quantization scales. To preserve the selective ability of quantized Mamba, we introduce Efficient Selectivity Reconstruction (ESR), a novel quantization simulation scheme in block-wise reconstruction that enables fast parallel scan algorithms with the non-linear quantization function. We demonstrate the effectiveness of Q-Mamba across various quantization settings, model sizes, and both generation and zero-shot tasks. In particular, for Mamba2-2.7B with W8A8H4 (8-bit weights and activations, 4-bit state caches) quantization, Q-Mamba achieves a 50% reduction in memory consumption with only a 2.13% average accuracy degradation on zero-shot tasks.
pdf
bib
abs
P²Net: Parallel Pointer-based Network for Key Information Extraction with Complex Layouts
Kaiwen Wei
|
Jie Yao
|
Jiang Zhong
|
Yangyang Kang
|
Jingyuan Zhang
|
Changlong Sun
|
Xin Zhang
|
Fengmao Lv
|
Li Jin
Key Information Extraction (KIE) is a challenging multimodal task aimed at extracting structured value entities from visually rich documents. Despite recent advancements, two major challenges remain. First, existing datasets typically feature fixed layouts and a limited set of entity categories, while current methods are based on a full-shot setting that is difficult to apply in real-world scenarios, where new entity categories frequently emerge. Secondly, current methods often treat key entities simply as parts of the OCR-parsed context, neglecting the positive impact of the relationships between key-value entities. To address the first challenge, we introduce a new large-scale, human-annotated dataset, Complex Layout document for Key Information Extraction (CLEX). Comprising 5,860 images with 1,162 entity categories, CLEX is larger and more complex than existing datasets. It also primarily focuses on the zero-shot and few-shot KIE tasks, which are more aligned with real-world applications. To tackle the second challenge, we propose the Parallel Pointer-based Network (P²Net). This model frames KIE as a pointer-based classification task and effectively leverages implicit relationships between key-value entities to enhance extraction. Its parallel extraction mechanism enables simultaneous and efficient extraction of multiple results. Experiments on widely-used datasets, including SROIE, CORD, and the newly introduced CLEX, demonstrate that P²Net outperforms existing state-of-the-art methods (including GPT-4V) while maintaining fast inference speeds.
pdf
bib
abs
Refining Sentence Embedding Model through Ranking Sentences Generation with Large Language Models
Liyang He
|
Chenglong Liu
|
Rui Li
|
Zhenya Huang
|
Shulan Ruan
|
Jun Zhou
|
Enhong Chen
Sentence embedding is essential for many NLP tasks, with contrastive learning methods achieving strong performance using annotated datasets like NLI. Yet, the reliance on manual labels limits scalability. Recent studies leverage large language models (LLMs) to generate sentence pairs, reducing annotation dependency. However, they overlook ranking information crucial for fine-grained semantic distinctions. To tackle this challenge, we propose a method for controlling the generation direction of LLMs in the latent space. Unlike unconstrained generation, the controlled approach ensures meaningful semantic divergence. Then, we refine exist sentence embedding model by integrating ranking information and semantic information. Experiments on multiple benchmarks demonstrate that our method achieves new SOTA performance with a modest cost in ranking sentence synthesis.
pdf
bib
abs
RQT: Hierarchical Residual Quantization for Multi-Model Compression
Chen Tianqi
|
Peisong Wang
|
Weixiang Xu
|
Zeyu Zhu
|
Jian Cheng
Delta compression methods focus on efficiently serving multiple uniquely fine-tuned models, each tailored to specific tasks and user requirements. These approaches decompose a fine-tuned LLM into a base model and corresponding delta weights, which are compressed using low-rank or low-bit representations to reduce storage costs. However, their effectiveness is highly sensitive to the magnitude of the model deltas—a factor directly influenced by the scale of the training data. We propose the Residual Quantization Tree (RQT), a hierarchical quantization framework that automatically shares low-bit integer weights across similar fine-tuned models. The RQT construction employs a two-phase greedy algorithm: a bottom-up aggregation of models based on weight matrix similarity, and top-down residual quantization, in which each node optimizes the quantization parameters and then delegates residual errors to child nodes. We evaluate RQT on fine-tuned models across mathematics, coding, chatbot, and Chinese LLMs. The results show that RQT achieves an average accuracy degradation of approximately 3% (comparable to previous 4-bit post-training quantization) while maintaining an effective bitwidth of around 2 bits.
pdf
bib
abs
taz2024full: Analysing German Newspapers for Gender Bias and Discrimination across Decades
Stefanie Urchs
|
Veronika Thurner
|
Matthias Aßenmacher
|
Christian Heumann
|
Stephanie Thiemichen
Open-access corpora are essential for advancing natural language processing (NLP) and computational social science (CSS). However,large-scale resources for German remain limited, restricting research on linguistic trends and societal issues such as gender bias. Wepresent taz2024full, the largest publicly available corpus of German newspaper articles to date, comprising over 1.8 million texts fromtaz, spanning 1980 to 2024.As a demonstration of the corpus’s utility for bias and discrimination research, we analyse gender representation across four decades ofreporting. We find a consistent overrepresentation of men, but also a gradual shift toward more balanced coverage in recent years. Usinga scalable, structured analysis pipeline, we provide a foundation for studying actor mentions, sentiment, and linguistic framing in Germanjournalistic texts.The corpus supports a wide range of applications, from diachronic language analysis to critical media studies, and is freely available tofoster inclusive and reproducible research in German-language NLP.
pdf
bib
abs
LCFO: Long Context and Long Form Output Dataset and Benchmarking
Marta R. Costa-jussà
|
Pierre Andrews
|
Mariano Coria Meglioli
|
Joy Chen
|
Joe Chuang
|
David Dale
|
Christophe Ropers
|
Alexandre Mourachko
|
Eduardo Sánchez
|
Holger Schwenk
|
Tuan A. Tran
|
Arina Turkatenko
|
Carleigh Wood
This paper presents the Long Context and Form Output (LCFO) benchmark, a novel evaluation framework for assessing gradual summarization and summary expansion capabilities across diverse domains. LCFO consists of long input documents (5k words average length), each of which comes with three summaries of different lengths (20%, 10%, and 5% of the input text), as well as approximately 15 questions and answers (QA) related to the input content. Notably, LCFO also provides alignments between specific QA pairs and corresponding summaries in 7 domains. The primary motivation behind providing summaries of different lengths is to establish a controllable framework for generating long texts from shorter inputs, i.e. summary expansion. To establish an evaluation metric framework for summarization and summary expansion, we provide human evaluation scores for human-generated outputs, as well as results from various state-of-the-art large language models (LLMs). GPT-4o-mini achieves best human scores among automatic systems in both summarization and summary expansion tasks (≈ +10% and +20%, respectively). It even surpasses human output quality in the case of short summaries (≈ +7%). Overall automatic metrics achieve low correlations with human evaluation scores (≈ 0.4) but moderate correlation on specific evaluation aspects such as fluency and attribution (≈ 0.6).
pdf
bib
abs
Span-based Semantic Role Labeling as Lexicalized Constituency Tree Parsing
Yang Hou
|
Zhenghua Li
Semantic Role Labeling (SRL) is a critical task that focuses on identifying predicate-argument structures in sentences. Span-based SRL, a prominent paradigm, is often tackled using BIO-based or graph-based methods. However, these approaches often fail to capture the inherent relationship between syntax and semantics. While syntax-aware models have been proposed to address this limitation, they heavily rely on pre-existing syntactic resources, limiting their general applicability. In this work, we propose a lexicalized tree representation for span-based SRL, which integrates constituency and dependency parsing to explicitly model predicate-argument structures. By structurally representing predicates as roots and arguments as subtrees directly linked to the predicate, our approach bridges the gap between syntactic and semantic representations. Experiments on standard English benchmarks (CoNLL05 and CoNLL12) demonstrate that our model achieves competitive performance, with particular improvement in predicate-given settings.
pdf
bib
abs
Learning from Negative Samples in Biomedical Generative Entity Linking
Chanhwi Kim
|
Hyunjae Kim
|
Sihyeon Park
|
Jiwoo Lee
|
Mujeen Sung
|
Jaewoo Kang
Generative models have become widely used in biomedical entity linking (BioEL) due to their excellent performance and efficient memory usage. However, these models are usually trained only with positive samples—entities that match the input mention’s identifier—and do not explicitly learn from hard negative samples, which are entities that look similar but have different meanings. To address this limitation, we introduce ANGEL (Learning from Negative Samples in Biomedical Generative Entity Linking), the first framework that trains generative BioEL models using negative samples. Specifically, a generative model is initially trained to generate positive entity names from the knowledge base for given input entities. Subsequently, both correct and incorrect outputs are gathered from the model’s top-k predictions. Finally, the model is updated to prioritize the correct predictions through preference optimization. Our models fine-tuned with ANGEL outperform the previous best baseline models by up to an average top-1 accuracy of 1.4% on five benchmarks. When incorporating our framework into pre-training, the performance improvement increases further to 1.7%, demonstrating its effectiveness in both the pre-training and fine-tuning stages. The code and model weights are available at https://github.com/dmis-lab/ANGEL.
pdf
bib
abs
Self-play through Computational Runtimes improves Chart Reasoning
Tautvydas Misiūnas
|
Hassan Mansoor
|
Jasper Uijlings
|
Oriana Riva
|
Victor Carbune
Vision-language models (VLMs) achieve impressive zero-shot performance on multimodal reasoning tasks. Typically, best reported performance is achieved with a zero- or a few-shot prompt. We observe that asking the model to take other routes of solving the same task, such as through code generation, hurts performance. Furthermore, training sets are typically no longer useful for improving model performance through few-shot learning, due to their use in training. Indeed, we observe that auto-prompting techniques such as DSPy (CITATION), when applied on training sets, do not produce few-shot examples that further improve validation performance. Further, when used in conjunction with program-of-thought, performance becomes even worse.Our work overcomes these limitations by introducing a novel self-play programming interface which leverages the ability of VLMs to first generate code to decompose a complex visual reasoning task in sub-tasks, then use itself, or other models, as a tool to solve decomposed tasks. Our approach enables DSPy to not suffer from performance drops, when applied iteratively on training sets. Furthermore, it outperforms zero-shot baselines on difficult chart reasoning benchmarks. We report the performance of our approach on ChartQA, PlotQA and ChartFC. This enables large models, such as Gemini or GPT to autonomously learn how to use themselves as tools and iteratively improve without the need for additional data.
pdf
bib
abs
Towards Better Chain-of-Thought: A Reflection on Effectiveness and Faithfulness
Jiachun Li
|
Pengfei Cao
|
Yubo Chen
|
Jiexin Xu
|
Huaijun Li
|
Xiaojian Jiang
|
Kang Liu
|
Jun Zhao
Chain-of-thought (CoT) prompting demonstrates varying performance under different reasoning tasks.Previous work attempts to evaluate it but falls short in providing an in-depth analysis of patterns that influence the CoT. In this paper, we study the CoT performance from the perspective of effectiveness and faithfulness. For the former, we identify key factors that influence CoT effectiveness on performance improvement, including problem difficulty, information gain, and information flow. For the latter, we interpret the unfaithful CoT issue by conducting a joint analysis of the information interaction among the question, CoT, and answer. The result demonstrates that, when the LLM predicts answers, it can recall correct information missing in the CoT from the question, leading to the problem. Finally, we propose a novel algorithm to mitigate this issue, in which we recall extra information from the question to enhance the CoT generation and evaluate CoTs based on their information gain. Extensive experiments demonstrate that our approach enhances both the faithfulness and effectiveness of CoT.
pdf
bib
abs
A Couch Potato is not a Potato on a Couch: Prompting Strategies, Image Generation, and Compositionality Prediction for Noun Compounds
Sinan Kurtyigit
|
Diego Frassinelli
|
Carina Silberer
|
Sabine Schulte Im Walde
We explore the role of the visual modality and of vision transformers in predicting the compositionality of English noun compounds. Crucially, we contribute a framework to address the challenge of obtaining adequate images that represent non-compositional compounds (such as “couch potato”), making it relevant for any image-based approach targeting figurative language. Our method uses prompting strategies and diffusion models to generate images. Comparing and combining our approach with a state-of-the-art text-based approach reveals complementary contributions regarding features as well as degrees of abstractness in compounds.
pdf
bib
abs
A Rose by Any Other Name: LLM-Generated Explanations Are Good Proxies for Human Explanations to Collect Label Distributions on NLI
Beiduo Chen
|
Siyao Peng
|
Anna Korhonen
|
Barbara Plank
Disagreement in human labeling is ubiquitous, and can be captured in human judgment distributions (HJDs). Recent research has shown that explanations provide valuable information for understanding human label variation (HLV) and large language models (LLMs) can approximate HJD from a few human-provided label-explanation pairs. However, collecting explanations for every label is still time-consuming. This paper examines whether LLMs can be used to replace humans in generating explanations for approximating HJD. Specifically, we use LLMs as annotators to generate model explanations for a few given human labels. We test ways to obtain and combine these label-explanations with the goal to approximate human judgment distributions. We further compare the resulting human with model-generated explanations, and test automatic and human explanation selection. Our experiments show that LLM explanations are promising for NLI: to estimate HJDs, generated explanations yield comparable results to human’s when provided with human labels. Importantly, our results generalize from datasets with human explanations to i) datasets where they are not available and ii) challenging out-of-distribution test sets.
pdf
bib
abs
Measuring What Matters: Evaluating Ensemble LLMs with Label Refinement in Inductive Coding
Angelina Parfenova
|
Jürgen Pfeffer
Inductive coding traditionally relies on labor-intensive human efforts, who are prone to inconsistencies and individual biases. Although large language models (LLMs) offer promising automation capabilities, their standalone use often results in inconsistent outputs, limiting their reliability. In this work, we propose a framework that combines ensemble methods with code refinement methodology to address these challenges. Our approach integrates multiple smaller LLMs, fine-tuned via Low-Rank Adaptation (LoRA), and employs a moderator-based mechanism to simulate human consensus. To address the limitations of metrics like ROUGE and BERTScore, we introduce a composite evaluation metric that combines code conciseness and contextual similarity. The validity of this metric is confirmed through correlation analysis with human expert ratings. Results demonstrate that smaller ensemble models with refined outputs consistently outperform other ensembles, individual models, and even large-scale LLMs like GPT-4. Our evidence suggests that smaller ensemble models significantly outperform larger standalone language models, pointing out the risk of relying solely on a single large model for qualitative analysis.
pdf
bib
abs
Dynamic Evil Score-Guided Decoding: An Efficient Decoding Framework For Red-Team Model
Cong Gao
|
Bo Zhang
|
Linkang Yang
|
Minghao Hu
|
Zhunchen Luo
|
Xiaoying Bai
|
Guotong Geng
|
Jun Zhang
|
Yunhua Xue
Large language models (LLMs) have achieved significant advances but can potentially generate harmful content such as social biases, extremism, and misinformation. Red teaming is a promising approach to enhance model safety by creating adversarial prompts to test and improve model robustness. However, existing red-teaming methods often require expensive fine-tuning, especially for large LLMs. We propose the Dynamic Evil Score-Guided Decoding framework (DESGD), an efficient red-teaming method that does not increase computational cost with the target model size. DESGD introduces the concept of an ‘evil score’ to dynamically evaluate the potential of tokens to contribute to harmful outputs during decoding. This framework constructs a small unsafe model using an adversarial dataset and adjusts the logits vector of the target model based on the evil score. Experiments show that DESGD achieves an ASR of 92.83% on the Llama-3.2-3B-Instruct model, compared to 83.48% with adversarial fine-tuning while using less computational resources. Similarly, on the Qwen2.5-3B-Instruct model, DESGD reaches an ASR of 88.62%, outperforming adversarial fine-tuning (77.56%).
pdf
bib
abs
CoMuMDR: Code-mixed Multi-modal Multi-domain corpus for Discourse paRsing in conversations
Divyaksh Shukla
|
Ritesh Baviskar
|
Dwijesh Gohil
|
Aniket Tiwari
|
Atul Shree
|
Ashutosh Modi
Discourse parsing is an important task useful for NLU applications such as summarization, machine comprehension, and emotion recognition. The current discourse parsing datasets based on conversations consists of written English dialogues restricted to a single domain. In this resource paper, we introduce CoMuMDR: Code-mixed Multi-modal Multi-domain corpus for Discourse paRsing in conversations. The corpus (code-mixed in Hindi and English) has both audio and transcribed text and is annotated with nine discourse relations. We experiment with various SoTA baseline models; the poor performance of SoTA models highlights the challenges of multi-domain code-mixed corpus, pointing towards the need for developing better models for such realistic settings.
pdf
bib
abs
Multi-word Measures: Modeling Semantic Change in Compound Nouns
Chris Jenkins
|
Filip Miletić
|
Sabine Schulte Im Walde
Compound words (e.g. shower thought) provide a multifaceted challenge for diachronic models of semantic change. Datasets describing noun compound semantics tend to describe only the predominant sense of a compound, which is limiting, especially in diachronic settings where senses may shift over time. We create a novel dataset of relatedness judgements of noun compounds in English and German, the first to capture diachronic meaning changes for multi-word expressions without prematurely condensing individual senses into an aggregate value. Furthermore, we introduce a novel, sense-targeting approach for noun compounds that evaluates two contrasting vector representations in their ability to cluster example sentence pairs. Our clustering approach targets both noun compounds and their constituent parts, to model the interdependence of these terms over time. We calculate time-delineated distributions of these clusters and compare them against measures of semantic change aggregated from the human relatedness annotations.
pdf
bib
abs
Bridge-Coder: Transferring Model Capabilities from High-Resource to Low-Resource Programming Language
Jipeng Zhang
|
Jianshu Zhang
|
Yuanzhe Li
|
Renjie Pi
|
Rui Pan
|
Runtao Liu
|
Zheng Ziqiang
|
Tong Zhang
Most LLMs universally excel at generating code for high-resource programming languages (HRPLs) like Python, a capability that has become standard due to the abundance of training data. However, they struggle significantly with low-resource programming languages (LRPLs) such as D, exacerbating the digital divide. This gap limits developers using LRPLs from equally benefiting and hinders innovation within underrepresented programming communities. To make matters worse, manually generating data for LRPLs is highly labor intensive and requires expensive expert effort. In this work, we begin by analyzing the NL-PL Gap, where LLMs’ direct-generated LRPL data often suffers from subpar quality due to the misalignment between natural language (NL) instructions and programming language (PL) outputs. To address this issue, we introduce Bridge-Assist Generation, a method to generate LRPL data utilizing LLM’s general knowledge, HRPL proficiency, and in-context learning capabilities. To further maximize the utility of the generated data, we propose Bridged Alignment to obtain Bridge-Coder. To thoroughly evaluate our approach, we select four relatively LRPLs: R, D, Racket, and Bash. Experimental results reveal that Bridge-Coder achieves significant improvements over the original model, with average gains of 18.71 and 10.81 on two comprehensive benchmarks, M-HumanEval and M-MBPP.
pdf
bib
abs
ProBench: Judging Multimodal Foundation Models on Open-ended Multi-domain Expert Tasks
Yan Yang
|
Dongxu Li
|
Haoning Wu
|
Bei Chen
|
Liu Liu
|
Liyuan Pan
|
Junnan Li
Solving expert-level multimodal tasks is a key milestone in general intelligence. As the capabilities of multimodal large language models (MLLMs) continue to evolve, evaluation of frontier multimodal intelligence becomes necessary yet challenging. In this work, we introduce ProBench, a benchmark of open-ended user queries encapsulating professional expertise and advanced reasoning. ProBench consists of 4,000 high-quality samples independently collected from professionals based on their productivity demands. It spans across 10 fields and 56 sub-fields, including science, arts, humanities, coding, mathematics, and creative writing. Experimentally, we evaluate and compare 24 latest models using MLLM-as-a-Judge. Our results reveal that although the best open-source models rival the proprietary ones, they all face significant challenges in visual perception, textual understanding, domain knowledge, and advanced reasoning. Our benchmark is publicly accessible at
TBC.
pdf
bib
abs
2M-BELEBELE: Highly Multilingual Speech and American Sign Language Comprehension Dataset Download PDF
Marta R. Costa-jussà
|
Bokai Yu
|
Pierre Andrews
|
Belen Alastruey
|
Necati Cihan Camgoz
|
Joe Chuang
|
Jean Maillard
|
Christophe Ropers
|
Arina Turkatenko
|
Carleigh Wood
We introduce the first highly multilingual speech and American Sign Language (ASL) comprehension dataset by extending BELEBELE. Our dataset covers 91 spoken languages at the intersection of BELEBELE and FLEURS, and one sign language (ASL). As a by-product we also extend the Automatic Speech Recognition Benchmark, FLEURS, by 20%. We evaluate 2M-BELEBELE dataset for both 5-shot and zero-shot settings and across languages, the speech comprehension accuracy is ≈ 10% average lower compared to reading comprehension.
pdf
bib
abs
LSC-Eval: A General Framework to Evaluate Methods for Assessing Dimensions of Lexical Semantic Change Using LLM-Generated Synthetic Data
Naomi Baes
|
Raphael Merx
|
Nick Haslam
|
Ekaterina Vylomova
|
Haim Dubossarsky
Lexical Semantic Change (LSC) provides insight into cultural and social dynamics. Yet, the validity of methods for measuring different kinds of LSC remains unestablished due to the absence of historical benchmark datasets. To address this gap, we propose LSC-Eval, a novel three-stage general-purpose evaluation framework to: (1) develop a scalable methodology for generating synthetic datasets that simulate theory-driven LSC using In-Context Learning and a lexical database; (2) use these datasets to evaluate the sensitivity of computational methods to synthetic change; and (3) assess their suitability for detecting change in specific dimensions and domains. We apply LSC-Eval to simulate changes along the Sentiment, Intensity, and Breadth (SIB) dimensions, as defined in the SIBling framework, using examples from psychology. We then evaluate the ability of selected methods to detect these controlled interventions. Our findings validate the use of synthetic benchmarks, demonstrate that tailored methods effectively detect changes along SIB dimensions, and reveal that a state-of-the-art LSC model faces challenges in detecting affective dimensions of LSC. LSC-Eval offers a valuable tool for dimension- and domain-specific benchmarking of LSC methods, with particular relevance to the social sciences.
pdf
bib
abs
Chain-of-Jailbreak Attack for Image Generation Models via Step by Step Editing
Wenxuan Wang
|
Kuiyi Gao
|
Youliang Yuan
|
Jen-tse Huang
|
Qiuzhi Liu
|
Shuai Wang
|
Wenxiang Jiao
|
Zhaopeng Tu
Text-based image generation models, such as Stable Diffusion and DALL-E 3, hold significant potential in content creation and publishing workflows, making them the focus in recent years. Despite their remarkable capability to generate diverse and vivid images, considerable efforts are being made to prevent the generation of harmful content, such as abusive, violent, or pornographic material. To assess the safety of existing models, we introduce a novel jailbreaking method called Chain-of-Jailbreak (CoJ) attack, which compromises image generation models through a step-by-step editing process. Specifically, for malicious queries that cannot bypass the safeguards with a single prompt, we intentionally decompose the query into multiple sub-queries. The image generation models are then prompted to generate and iteratively edit images based on these sub-queries. To evaluate the effectiveness of our CoJ attack method, we constructed a comprehensive dataset, CoJ-Bench, including nine safety scenarios, three types of editing operations, and three editing elements. Experiments on four widely-used image generation services provided by GPT-4V, GPT-4o, Gemini 1.5 and Gemini 1.5 Pro, demonstrate that our CoJ attack method can successfully bypass the safeguards of models for over 60% cases, which significantly outperforms other jailbreaking methods (i.e., 14%). Further, to enhance these models’ safety against our CoJ attack method, we also propose an effective prompting-based method, Think-Twice Prompting, that can successfully defend over 95% of CoJ attack. Our dataset and code are included in the supplementary materials and will be made publicly available upon publication.
pdf
bib
abs
Tokenization is Sensitive to Language Variation
Anna Wegmann
|
Dong Nguyen
|
David Jurgens
Variation in language is ubiquitous and often systematically linked to regional, social, and contextual factors. Tokenizers split texts into smaller units and might behave differently for less common linguistic forms. This might affect downstream LLM performance differently on two types of tasks: Tasks where the model should be robust to language variation (e.g., for semantic tasks like NLI, labels do not depend on whether a text uses British or American spelling) and tasks where the model should be sensitive to language variation (e.g., for form-based tasks like authorship verification, labels depend on whether a text uses British or American spelling). We pre-train BERT base models with the popular Byte-Pair Encoding algorithm to investigate how key tokenization design choices impact the performance of downstream models: the corpus used to train the tokenizer, the pre-tokenizer and the vocabulary size. We find that the best tokenizer varies on the two task types and that the pre-tokenizer has the biggest overall impact on performance. Further, we introduce a new approach to estimate tokenizer impact on downstream LLM performance, showing substantial improvement over metrics like Rényi efficiency. We encourage more work on language variation and its relation to tokenizers and thus LLM performance.
pdf
bib
abs
WirelessMathBench: A Mathematical Modeling Benchmark for LLMs in Wireless Communications
Xin Li
|
Mengbing Liu
|
Li Wei
|
Jiancheng An
|
Merouane Abdelkader Debbah
|
Chau Yuen
Large Language Models (LLMs) have achieved impressive results across a broad array of tasks, yet their capacity for complex, domain-specific mathematical reasoning—particularly in wireless communications—remains underexplored. In this work, we introduce WirelessMathBench, a novel benchmark specifically designed to evaluate LLMs on mathematical modeling challenges to wireless communications engineering. Our benchmark consists of 587 meticulously curated questions sourced from 40 state-of-the-art research papers, encompassing a diverse spectrum of tasks ranging from basic multiple-choice questions to complex equation completion tasks, including both partial and full completions, all of which rigorously adhere to physical and dimensional constraints. Through extensive experimentation with leading LLMs, we observe that while many models excel in basic recall tasks, their performance degrades significantly when reconstructing partially or fully obscured equations, exposing fundamental limitations in current LLMs. Even DeepSeek-R1, the best performer on our benchmark, achieves an average accuracy of only 38.05%, with a mere 7.83% success rate in full equation completion. By publicly releasing WirelessMathBench along with the evaluation toolkit, we aim to advance the development of more robust, domain-aware LLMs for wireless system analysis and broader engineering applications.
pdf
bib
abs
Self-Improvement Towards Pareto Optimality: Mitigating Preference Conflicts in Multi-Objective Alignment
Moxin Li
|
Yuantao Zhang
|
Wenjie Wang
|
Wentao Shi
|
Zhuo Liu
|
Fuli Feng
|
Tat-Seng Chua
Multi-Objective Alignment (MOA) aims to align LLMs’ responses with multiple human preference objectives, with Direct Preference Optimization (DPO) emerging as a prominent approach. However, we find that DPO-based MOA approaches suffer from widespread preference conflicts in the data, where different objectives favor different responses. This results in conflicting optimization directions, hindering the optimization on the Pareto Front. To address this, we propose to construct Pareto-optimal responses to resolve preference conflicts. To efficiently obtain and utilize such responses, we propose a self-improving DPO framework that enables LLMs to self-generate and select Pareto-optimal responses for self-supervised preference alignment. Extensive experiments on two datasets demonstrate the superior Pareto Front achieved by our framework compared to various baselines
pdf
bib
abs
Investigating and Scaling up Code-Switching for Multilingual Language Model Pre-Training
Zhijun Wang
|
Jiahuan Li
|
Hao Zhou
|
Rongxiang Weng
|
Jingang Wang
|
Xin Huang
|
Xue Han
|
Junlan Feng
|
Chao Deng
|
Shujian Huang
Large language models (LLMs) exhibit remarkable multilingual capabilities despite the extreme language imbalance in the pre-training data. In this paper, we closely examine the reasons behind this phenomenon, focusing on the pre-training corpus. We find that the existence of code-switching, alternating between different languages within a context, is key to multilingual capabilities. We conduct an analysis to investigate code-switching in the pre-training corpus, examining its presence and categorizing it into four types within two quadrants. We then assess its impact on multilingual performance. These types of code-switching data are unbalanced in proportions and demonstrate different effects on facilitating language transfer. To better explore the power of code-switching for language alignment during pre-training, we investigate the strategy of synthetic code-switching. We continuously scale up the synthetic code-switching data and observe remarkable improvements in both benchmarks and representation space. Extensive experiments indicate that incorporating synthetic code-switching data enables better language alignment and generalizes well to high, medium, and low-resource languages with pre-training corpora of varying qualities.
pdf
bib
abs
User Behavior Prediction as a Generic, Robust, Scalable, and Low-Cost Evaluation Strategy for Estimating Generalization in LLMs
Sougata Saha
|
Monojit Choudhury
Measuring the generalization ability of Large Language Models (LLMs) is challenging due to data contamination. As models grow and computation becomes cheaper, ensuring tasks and test cases are unseen during training phases will become nearly impossible. We argue that knowledge-retrieval and reasoning tasks are not ideal for measuring generalization, as LLMs are not trained for specific tasks. Instead, we propose user behavior prediction, also a key aspect of personalization, as a theoretically sound, scalable, and robust alternative. We introduce a novel framework for this approach and test it on movie and music recommendation datasets for GPT-4o, GPT-4o-mini, and Llama-3.1-8B-Instruct. Results align with our framework’s predictions, showing GPT-4o outperforms GPT-4o-mini and Llama, though all models have much room for improvement, especially Llama.
pdf
bib
abs
Beyond Browsing: API-Based Web Agents
Yueqi Song
|
Frank F. Xu
|
Shuyan Zhou
|
Graham Neubig
Web browsers are a portal to the internet, where much of human activity is undertaken. Thus, there has been significant research work in AI agents that interact with the internet through web browsing.However, there is also another interface designed specifically for machine interaction with online content: application programming interfaces (APIs). In this paper we ask – *what if we were to take tasks traditionally tackled by Browsing Agents, and give AI agents access to APIs*?To do so, we propose two varieties of agents: (1) an API-calling agent that attempts to perform online tasks through APIs only, similar to traditional coding agents, and (2) a Hybrid Agent that can interact with online data through both web browsing and APIs.In experiments on WebArena, a widely-used and realistic benchmark for web navigation tasks, we find that API-Based Agents outperform web Browsing Agents.Hybrid Agents out-perform both others nearly uniformly across tasks, resulting in a more than 24.0% absolute improvement over web browsing alone, achieving a success rate of 38.9%, the SOTA performance among task-agnostic agents.These results strongly suggest that when APIs are available, they present an attractive alternative to relying on web browsing alone.
pdf
bib
abs
MiLiC-Eval: Benchmarking Multilingual LLMs for China’s Minority Languages
Chen Zhang
|
Mingxu Tao
|
Zhiyuan Liao
|
Yansong Feng
Large language models (LLMs) excel in high-resource languages but struggle with low-resource languages (LRLs), particularly those spoken by minority communities in China, such as Tibetan, Uyghur, Kazakh, and Mongolian. To systematically track the progress in these languages, we introduce MiLiC-Eval, a benchmark designed for minority languages in China, featuring 24K instances across 9 tasks. MiLiC-Eval focuses on underrepresented writing systems. Its parallelism between tasks and languages can provide a faithful and fine-grained assessment of linguistic and problem-solving skills. Our evaluation reveals that open-source LLMs perform poorly on syntax-intensive tasks and multi-script languages. We further demonstrate how MiLiC-Eval can help advance LRL research in handling diverse writing systems and understanding the process of language adaptation.
pdf
bib
abs
ArgInstruct: Specialized Instruction Fine-Tuning for Computational Argumentation
Maja Stahl
|
Timon Ziegenbein
|
Joonsuk Park
|
Henning Wachsmuth
Training large language models (LLMs) to follow instructions has significantly enhanced their ability to tackle unseen tasks. However, despite their strong generalization capabilities, instruction-following LLMs encounter difficulties when dealing with tasks that require domain knowledge. This work introduces a specialized instruction fine-tuning for the domain of computational argumentation (CA). The goal is to enable an LLM to effectively tackle any unseen CA tasks while preserving its generalization capabilities. Reviewing existing CA research, we crafted natural language instructions for 105 CA tasks to this end. On this basis, we developed a CA-specific benchmark for LLMs that allows for a comprehensive evaluation of LLMs’ capabilities in solving various CA tasks. We synthesized 52k CA-related instructions, adapting the self-instruct process to train a CA-specialized instruction-following LLM. Our experiments suggest that CA-specialized instruction fine-tuning significantly enhances the LLM on both seen and unseen CA tasks. At the same time, performance on the general NLP tasks of the SuperNI benchmark remains stable.
pdf
bib
abs
Crabs: Consuming Resource via Auto-generation for LLM-DoS Attack under Black-box Settings
Yuanhe Zhang
|
Zhenhong Zhou
|
Wei Zhang
|
Xinyue Wang
|
Xiaojun Jia
|
Yang Liu
|
Sen Su
Large Language Models (LLMs) have demonstrated remarkable performance across diverse tasks yet still are vulnerable to external threats, particularly LLM Denial-of-Service (LLM-DoS) attacks. Specifically, LLM-DoS attacks aim to exhaust computational resources and block services. However, existing studies predominantly focus on white-box attacks, leaving black-box scenarios underexplored. In this paper, we introduce Auto-Generation for LLM-DoS (AutoDoS) attack, an automated algorithm designed for black-box LLMs. AutoDoS constructs the DoS Attack Tree and expands the node coverage to achieve effectiveness under black-box conditions. By transferability-driven iterative optimization, AutoDoS could work across different models in one prompt.Furthermore, we reveal that embedding the Length Trojan allows AutoDoS to bypass existing defenses more effectively.Experimental results show that AutoDoS significantly amplifies service response latency by over 250×↑, leading to severe resource consumption in terms of GPU utilization and memory usage. Our work provides a new perspective on LLM-DoS attacks and security defenses.
pdf
bib
abs
Probabilistic Aggregation and Targeted Embedding Optimization for Collective Moral Reasoning in Large Language Models
Chenchen Yuan
|
Zheyu Zhang
|
Shuo Yang
|
Bardh Prenkaj
|
Gjergji Kasneci
Large Language Models (LLMs) have shown impressive moral reasoning abilities. Yet they often diverge when confronted with complex, multi-factor moral dilemmas. To address these discrepancies, we propose a framework that synthesizes multiple LLMs’ moral judgments into a collectively formulated moral judgment, realigning models that deviate significantly from this consensus. Our aggregation mechanism fuses continuous moral acceptability scores (beyond binary labels) into a collective probability, weighting contributions by model reliability. For misaligned models, a targeted embedding-optimization procedure fine-tunes token embeddings for moral philosophical theories, minimizing JS divergence to the consensus while preserving semantic integrity. Experiments on a large-scale social moral dilemma dataset show our approach builds robust consensus and improves individual model fidelity. These findings highlight the value of data-driven moral alignment across multiple models and its potential for safer, more consistent AI systems.
pdf
bib
abs
Unlocking Recursive Thinking of LLMs: Alignment via Refinement
Haoke Zhang
|
Xiaobo Liang
|
Cunxiang Wang
|
Juntao Li
|
Min Zhang
The OpenAI o1-series models have demonstrated that leveraging long-form Chain of Thought (CoT) can substantially enhance performance. However, the recursive thinking capabilities of Large Language Models (LLMs) remain limited, particularly in the absence of expert-curated data for distillation. In this paper, we propose AvR: Alignment via Refinement, a novel method aimed at unlocking the potential of LLMs for recursive reasoning through long-form CoT. AvR introduces a refinement process that integrates criticism and improvement actions, guided by differentiable learning techniques to optimize refinement-aware rewards. As a result, the synthesized multi-round data can be organized as a long refinement thought, further enabling test-time scaling. Experimental results show that AvR significantly outperforms conventional preference optimization methods. Notably, with only 3k synthetic samples, our method boosts the performance of the LLaMA-3-8B-Instruct model by over 20% in win rate on AlpacaEval 2.0. Our code is available at Github .
pdf
bib
abs
CitaLaw: Enhancing LLM with Citations in Legal Domain
Kepu Zhang
|
Weijie Yu
|
Sunhao Dai
|
Jun Xu
In this paper, we propose CitaLaw, the first benchmark designed to evaluate LLMs’ ability to produce legally sound responses with appropriate citations. CitaLaw features a diverse set of legal questions for both laypersons and practitioners, paired with a comprehensive corpus of law articles and precedent cases as a reference pool. This framework enables LLM-based systems to retrieve supporting citations from the reference corpus and align these citations with the corresponding sentences in their responses. Moreover, we introduce syllogism-inspired evaluation methods to assess the legal alignment between retrieved references and LLM-generated responses, as well as their consistency with user questions. Extensive experiments on 2 open-domain and 7 legal-specific LLMs demonstrate that integrating legal references substantially enhances response quality. Furthermore, our proposed syllogism-based evaluation method exhibits strong agreement with human judgments.
pdf
bib
abs
MEGen: Generative Backdoor into Large Language Models via Model Editing
Jiyang Qiu
|
Xinbei Ma
|
Zhuosheng Zhang
|
Hai Zhao
|
Yun Li
|
Qianren Wang
Large language models (LLMs) have exhibited remarkable versatility and adaptability, while their widespread adoption across various applications also raises critical safety concerns.This paper focuses on the impact of backdoored LLMs. Traditional backdoor injection methods are primarily limited to yes-or-no discriminative tasks, leading users to underestimate the potential risks of backdoored LLMs.Given the inherently generative nature of LLMs, this paper reveals that a generative backdoor injected into LLMs can expose the true safety risks in their applications. We propose an editing-based generative backdoor, named MEGen, aiming to expand the backdoor to generative tasks in a unified format of any text-to any text, leading to natural generations with a specific intention. Experiments show that MEGen achieves a high attack success rate by adjusting only a small set of local parameters with few-shot samples. Notably, we show that the backdoored model, when triggered, can freely output pre-set dangerous information while completing downstream tasks.Our work highlights that MEGen enables backdoors in LLMs to exhibit generative capabilities, causing potential safety risks by altering the generative style. The code is available at [https://github.com/MonoQ-hub/MEGen](https://github.com/MonoQ-hub/MEGen).
pdf
bib
abs
Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations
Jiho Jin
|
Woosung Kang
|
Junho Myung
|
Alice Oh
Measuring social bias in large language models (LLMs) is crucial, but existing bias evaluation methods struggle to assess bias in long-form generation. We propose a Bias Benchmark for Generation (BBG), an adaptation of the Bias Benchmark for QA (BBQ), designed to evaluate social bias in long-form generation by having LLMs generate continuations of story prompts. Building our benchmark in English and Korean, we measure the probability of neutral and biased generations across ten LLMs. We also compare our long-form story generation evaluation results with multiple-choice BBQ evaluation, showing that the two approaches produce inconsistent results.
pdf
bib
abs
Generating Pedagogically Meaningful Visuals for Math Word Problems: A New Benchmark and Analysis of Text-to-Image Models
Junling Wang
|
Anna Rutkiewicz
|
April Wang
|
Mrinmaya Sachan
Visuals are valuable tools for teaching math word problems (MWPs), helping young learners interpret textual descriptions into mathematical expressions before solving them.However, creating such visuals is labor-intensive and there is a lack of automated methods to support this process. In this paper, we present Math2Visual, an automatic framework for generating pedagogically meaningful visuals from MWP text descriptions. Math2Visual leverages a pre-defined visual language and a design space grounded in interviews with math teachers, to illustrate the core mathematical relationships in MWPs.Using Math2Visual, we construct an annotated dataset of 1,903 visuals and evaluate Text-to-Image (TTI) models for their ability to generate visuals that align with our design. We further fine-tune several TTI models with our dataset, demonstrating improvements in educational visual generation. Our work establishes a new benchmark for automated generation of pedagogically meaningful visuals and offers insights into key challenges in producing multimodal educational content, such as the misrepresentation of mathematical relationships and the omission of essential visual elements.
pdf
bib
abs
RASPberry: Retrieval-Augmented Monte Carlo Tree Self-Play with Reasoning Consistency for Multi-Hop Question Answering
Baixuan Li
|
Yunlong Fan
|
Tianyi Ma
|
Miao Gao
|
Chuanqi Shi
|
Zhiqiang Gao
Complex multi-hop question answering requires large language models (LLMs) not only to retrieve external knowledge but also to reason over the retrieved information in order to arrive at the final solution. This involves two key challenges: (i) how to effectively explore the solution space and generate more potentially correct solution candidates, and (ii) how to select the optimal solution from multiple solution candidates, both of which require a training-free approach without introducing a more powerful teacher model. To address these challenges, we propose Retrieval-Augmented Monte Carlo Tree Self-Play with Reasoning Consistency (RASPberry), which introduces a more flexible action-level sampling granularity compared to existing methods, leverages Monte Carlo Tree Search for efficient solution space exploration, and utilizes an enhanced version of reasoning consistency to guide the selection of the optimal solution. Experimental results demonstrate that our proposed RASPberry effectively tackles the two challenges outlined above, achieving more efficient RAG inference-time scaling. Our code is available at https://github.com/BaixuanLi/RASPberry.
pdf
bib
abs
All That Glitters is Not Gold: Improving Robust Retrieval-Augmented Language Models with Fact-Centric Preference Alignment
Jia Hao
|
Chunhong Zhang
|
Jiarun Liu
|
Haiyu Zhao
|
Zhiqiang Zhan
|
Zheng Hu
Retrieval-augmented language model (RALM) relies on retrieved external knowledge to generate responses, resulting in vulnerability in the face of retrieval results with noisy documents. Previous works integrate additional filters or finetune Large Language Models (LLMs) to learn adaptive retrieval to reduce the performance damage of noisy documents. However, prior noise filtering may lead to the loss of crucial information, and these methods do not focus on distracting documents with high semantic relevance, which is the most challenging problem. In this study, we propose a training method for fact-centric preference alignment (FPA) to improve the ability of LLMs to directly extract useful information from noisy retrieval results without prior filtering. Our method performs positive document mining based on factual consistency and uses LLMs self-generated synthetic data as training data without manual annotation. We evaluate our FPA on four question answering benchmarks, and the experimental results demonstrate that our method achieves significant improvement with a small scale of training data.
pdf
bib
abs
FairSteer: Inference Time Debiasing for LLMs with Dynamic Activation Steering
Yichen Li
|
Zhiting Fan
|
Ruizhe Chen
|
Xiaotang Gai
|
Luqi Gong
|
Yan Zhang
|
Zuozhu Liu
Large language models (LLMs) are prone to capturing biases from training corpus, leading to potential negative social impacts. Existing prompt-based debiasing methods exhibit instability due to their sensitivity to prompt changes, while fine-tuning-based techniques incur substantial computational overhead and catastrophic forgetting. In this paper, we propose FairSteer, a novel inference-time debiasing framework without requiring customized prompt design or model retraining. Motivated by the linear representation hypothesis, our preliminary investigation demonstrates that fairness-related features can be encoded into separable directions in the hidden activation space. FairSteer operates in three steps: biased activation detection, debiasing steering vector (DSV) computation, and dynamic activation steering. Specifically, it first trains a lightweight linear classifier to detect bias signatures in activations, and then computes DSVs as intervention directions derived from small contrastive prompt pairs. Subsequently, it performs debiasing by adjusting activations with DSVs in the inference stage. Comprehensive evaluation with six LLMs demonstrates the superiority of FairSteer across question-answering, counterfactual input evaluation and open-ended text generation tasks. Code will be released.
pdf
bib
abs
Listen, Watch, and Learn to Feel: Retrieval-Augmented Emotion Reasoning for Compound Emotion Generation
Zhuofan Wen
|
Zheng Lian
|
Shun Chen
|
Hailiang Yao
|
Longjiang Yang
|
Bin Liu
|
Jianhua Tao
The ability to comprehend human emotion using multimodal large language models (MLLMs) is essential for advancing human-AI interaction and multimodal sentiment analysis. While psychology theory-based human annotations have contributed to multimodal emotion tasks, the subjective nature of emotional perception often leads to inconsistent annotations, limiting the robustness of current models. Addressing these challenges requires more fine-grained methods and evaluation frameworks. In this paper, we propose the Retrieval-Augmented Emotion Reasoning (RAER) framework, a plug-and-play module that enhances MLLMs’ ability to tackle compound and context-rich emotion tasks. To systematically evaluate model performance, we introduce the Stimulus-Armed Bandit (SAB) framework, designed to benchmark emotional reasoning capabilities. Additionally, we construct the Compound Emotion QA dataset, an AI-generated multimodal dataset aimed at strengthening emotion understanding in MLLMs. Experimental results demonstrate the effectiveness of RAER across both traditional benchmarks and SAB evaluations, highlighting its potential to enhance emotional intelligence in multimodal AI systems.
pdf
bib
abs
GLTW: Joint Improved Graph Transformer and LLM via Three-Word Language for Knowledge Graph Completion
Kangyang Luo
|
Yuzhuo Bai
|
Cheng Gao
|
Shuzheng Si
|
Zhu Liu
|
Yingli Shen
|
Zhitong Wang
|
Cunliang Kong
|
Wenhao Li
|
Yufei Huang
|
Ye Tian
|
Xuantang Xiong
|
Lei Han
|
Maosong Sun
Knowledge Graph Completion (KGC), which aims to infer missing or incomplete facts, is a crucial task for KGs. However, integrating the vital structural information of KGs into Large Language Models (LLMs) and outputting predictions deterministically remains challenging. To address this, we propose a new method called GLTW, which encodes the structural information of KGs and merges it with LLMs to enhance KGC performance. Specifically, we introduce an improved Graph Transformer (iGT) that effectively encodes subgraphs with both local and global structural information and inherits the characteristics of language model, bypassing training from scratch. Also, we develop a subgraph-based multi-classification training objective, using all entities within KG as classification objects, to boost learning efficiency. Importantly, we combine iGT with an LLM that takes KG language prompts as input. Our extensive experiments on various KG datasets show that GLTW achieves significant performance gains compared to SOTA baselines.
pdf
bib
abs
Learning to Select In-Context Demonstration Preferred by Large Language Model
Zheng Zhang
|
Shaocheng Lan
|
Lei Song
|
Jiang Bian
|
Yexin Li
|
Kan Ren
In-context learning (ICL) enables large language models (LLMs) to adapt to new tasks during inference using only a few demonstrations. However, ICL performance is highly dependent on the selection of these demonstrations. Recent work explores retrieval-based methods for selecting query-specific demonstrations, but these approaches often rely on surrogate objectives such as metric learning, failing to directly optimize ICL performance. Consequently, they struggle to identify truly beneficial demonstrations. Moreover, their discriminative retrieval paradigm is ineffective when the candidate pool lacks sufficient high-quality demonstrations. To address these challenges, we propose GenICL, a novel generative preference learning framework that leverages LLM feedback to directly optimize demonstration selection for ICL. Experiments on 19 datasets across 11 task categories demonstrate that GenICL achieves superior performance than existing methods in selecting the most effective demonstrations, leading to better ICL performance.
pdf
bib
abs
Beyond the Spelling Miracle: Investigating Substring Awareness in Character-Blind Language Models
Cristiano Ciaccio
|
Marta Sartor
|
Alessio Miaschi
|
Felice Dell’Orletta
Correctly identifying characters and substrings of words should be a basic but essential ability of any Language Model that aims to proficiently understand and produce language. Despite so, the majority of Pre-trained Language Models (PLMs) are “character-blind” and struggle in spelling tasks, although they still seem to acquire some character knowledge during pre-training, a phenomenon dubbed Spelling Miracle. To shed light on this phenomenon, we systematically evaluate a range of PLMs with different parameter sizes using a controlled binary substring identification task. Through a series of experiments, we propose the first comprehensive investigation on where, when, and how a PLMs develop awareness of characters and substrings, with a particular linguistic focus on morphemic units such as prefixes, suffixes, and roots.
pdf
bib
abs
DEMO: Reframing Dialogue Interaction with Fine-grained Element Modeling
Minzheng Wang
|
Xinghua Zhang
|
Kun Chen
|
Nan Xu
|
Haiyang Yu
|
Fei Huang
|
Wenji Mao
|
Yongbin Li
Large language models (LLMs) enabled dialogue systems have become one of the central modes in human-machine interaction, which bring about vast amounts of conversation logs and increasing demand for dialogue generation. The dialogue’s life-cycle spans from Prelude through Interlocution to Epilogue, encompassing rich dialogue elements. Despite large volumes of dialogue-related studies, there is a lack of systematic investigation into the dialogue stages to frame benchmark construction that covers comprehensive dialogue elements. This hinders the precise modeling, generation and assessment of LLMs-based dialogue systems. To bridge this gap, in this paper, we introduce a new research task—Dialogue Element MOdeling, including Element Awareness and Dialogue Agent Interaction, and propose a novel benchmark, DEMO, designed for a comprehensive dialogue modeling and assessment. On this basis, we further build the DEMO agent with the adept ability to model dialogue elements via imitation learning. Extensive experiments on DEMO indicate that current representative LLMs still have considerable potential for enhancement, and our DEMO agent performs well in both dialogue element modeling and out-of-domain tasks.
pdf
bib
abs
InfiniteICL: Breaking the Limit of Context Window Size via Long Short-term Memory Transformation
Bowen Cao
|
Deng Cai
|
Wai Lam
In-context learning (ICL) is critical for large language models (LLMs), but its effectiveness is constrained by finite context windows, particularly in ultra-long contexts. To overcome this, we introduce **InfiniteICL**, a framework that parallels context and parameters in LLMs with short- and long-term memory in human cognitive systems, focusing on transforming temporary context knowledge into permanent parameter updates. This approach significantly reduces memory usage, maintains robust performance across varying input lengths, and theoretically enables infinite context integration through the principles of context knowledge elicitation, selection, and consolidation. Evaluations demonstrate that our method reduces context length by 90% while achieving 103% average performance of full-context prompting across fact recall, grounded reasoning, and skill acquisition tasks. When conducting sequential multi-turn transformations on complex, real-world contexts (with length up to 2M tokens), our approach surpasses full-context prompting while using only 0.4% of the original contexts. These findings highlight InfiniteICL’s potential to enhance the scalability and efficiency of LLMs by breaking the limitations of conventional context window sizes.
pdf
bib
abs
M3HG: Multimodal, Multi-scale, and Multi-type Node Heterogeneous Graph for Emotion Cause Triplet Extraction in Conversations
Qiao Liang
|
Ying Shen
|
Tiantian Chen
|
Lin Zhang
Emotion Cause Triplet Extraction in Multimodal Conversations (MECTEC) has recently gained significant attention in social media analysis, aiming to extract emotion utterances, cause utterances, and emotion categories simultaneously. However, the scarcity of related datasets, with only one published dataset featuring highly uniform dialogue scenarios, hinders model development in this field. To address this, we introduce MECAD, the first multimodal, multi-scenario MECTEC dataset, comprising 989 conversations from 56 TV series spanning a wide range of dialogue contexts. In addition, existing MECTEC methods fail to explicitly model emotional and causal contexts and neglect the fusion of semantic information at different levels, leading to performance degradation. In this paper, we propose M3HG, a novel model that explicitly captures emotional and causal contexts and effectively fuses contextual information at both inter- and intra-utterance levels via a multimodal heterogeneous graph. Extensive experiments demonstrate the effectiveness of M3HG compared with existing state-of-the-art methods. Codes are available at https://anonymous.4open.science/r/M3HG-6B34.
pdf
bib
abs
Large Language Models Are Natural Video Popularity Predictors
Pratik Kayal
|
Pascal Mettes
|
Nima Dehmamy
|
Minsu Park
Predicting video popularity is often framed as a supervised learning task, relying heavily on meta-information and aggregated engagement data. However, video popularity is shaped by complex cultural and social factors that such approaches often overlook. We argue that Large Language Models (LLMs), with their deep contextual awareness, can better capture these nuances. To bridge the gap between pixel-based video data and token-based LLMs, we convert frame-level visuals into sequential text representations using Vision-Language Models. This enables LLMs to process multimodal content—titles, frame-based descriptions, and captions—capturing both engagement intensity (view count) and geographic spread (number of countries where a video trends). On 13,639 popular videos, a supervised neural network using content embeddings achieves 80% accuracy, while our LLM-based approach reaches 82% without fine-tuning. Combining the neural network’s predictions with the LLM further improves accuracy to 85.5%. Moreover, the LLM generates interpretable, attribute-based explanations for its predictions. Manual validations confirm the quality of these hypotheses and address concerns about hallucinations in the video-to-text conversion process. Overall, our findings suggest that LLMs, equipped with text-based multimodal representations, offer a powerful, interpretable, and data-efficient solution for tasks requiring rich contextual insight, such as video popularity prediction.
pdf
bib
abs
DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing
Yi Wang
|
Fenghua Weng
|
Sibei Yang
|
Zhan Qin
|
Minlie Huang
|
Wenjie Wang
Large Language Models (LLMs) are widely applied in decision making, but their deployment is threatened by jailbreak attacks, where adversarial users manipulate model behavior to bypass safety measures. Existing defense mechanisms, such as safety fine-tuning and model editing, either require extensive parameter modifications or lack precision, leading to performance degradation on general tasks, which is unsuitable to post-deployment safety alignment. To address these challenges, we propose DELMAN (**D**ynamic **E**diting for **L**L**M**s J**A**ilbreak Defe**N**se), a novel approach leveraging direct model editing for precise, dynamic protection against jailbreak attacks. DELMAN directly updates a minimal set of relevant parameters to neutralize harmful behaviors while preserving the model’s utility. To avoid triggering a safe response in benign context, we incorporate KL-divergence regularization to ensure the updated model remains consistent with the original model when processing benign queries. Experimental results demonstrate that DELMAN outperforms baseline methods in mitigating jailbreak attacks while preserving the model’s utility, and adapts seamlessly to new attack instances, providing a practical and efficient solution for post-deployment model protection.
pdf
bib
abs
You need to MIMIC to get FAME: Solving Meeting Transcript Scarcity with Multi-Agent Conversations
Frederic Kirstein
|
Muneeb Khan
|
Jan Philip Wahle
|
Terry Ruas
|
Bela Gipp
Meeting summarization suffers from limited high-quality data, mainly due to privacy restrictions and expensive collection processes. We address this gap with FAME, a dataset of 500 meetings in English and 300 in German produced by MIMIC, our new multi-agent meeting synthesis framework that generates meeting transcripts on a given knowledge source by defining psychologically grounded participant profiles, outlining the conversation, and orchestrating a large language model (LLM) debate. A modular post-processing step refines these outputs, mitigating potential repetitiveness and overly formal tones, ensuring coherent, credible dialogues at scale. We also propose a psychologically grounded evaluation framework assessing naturalness, social behavior authenticity, and transcript difficulties. Human assessments show that FAME approximates real-meeting spontaneity (4.5/5 in naturalness), preserves speaker-centric challenges (3/5 in spoken language), and introduces richer information-oriented difficulty (4/5 points in difficulty). These findings show FAME is a good and scalable proxy for real-world meeting conditions. It enables new test scenarios for meeting summarization research and other conversation-centric applications in tasks requiring conversation data or simulating social scenarios under behavioral constraints.
pdf
bib
abs
Code-Switching and Syntax: A Large-Scale Experiment
Igor Sterner
|
Simone Teufel
The theoretical code-switching (CS) literature provides numerous pointwise investigations that aim to explain patterns in CS, i.e. why bilinguals switch language in certain positions in a sentence more often than in others. A resulting consensus is that CS can be explained by the syntax of the contributing languages. There is however no large-scale, multi-language, cross-phenomena experiment that tests this claim. When designing such an experiment, we need to make sure that the system that is predicting where bilinguals tend to switch has access only to syntactic information. We provide such an experiment here. Results show that syntax alone is sufficient for an automatic system to distinguish between sentences in minimal pairs of CS, to the same degree as bilingual humans. Furthermore, the learnt syntactic patterns generalise well to unseen language pairs.
pdf
bib
abs
Optima: Optimizing Effectiveness and Efficiency for LLM-Based Multi-Agent System
Weize Chen
|
Jiarui Yuan
|
Chen Qian
|
Cheng Yang
|
Zhiyuan Liu
|
Maosong Sun
Large Language Model (LLM) based multi-agent systems (MAS) show remarkable potential in collaborative problem-solving, yet they still face critical challenges: low communication efficiency, poor scalability, and a lack of effective parameter-updating optimization methods. We present Optima, a novel framework that addresses these issues by significantly enhancing both communication efficiency and task effectiveness in LLM-based MAS through training. Optima employs an iterative generate, rank, select, and train paradigm with a reward function balancing task performance, token efficiency, and communication readability. We explore various algorithms, including Supervised Fine-Tuning, Direct Preference Optimization, and their hybrid approaches, providing insights into their effectiveness-efficiency trade-offs. We integrate Monte Carlo Tree Search-inspired techniques for DPO data generation, treating conversation turns as tree nodes to explore diverse interaction paths. Evaluated on common multi-agent tasks, including information-asymmetric question answering and complex reasoning, Optimashows consistent and substantial improvements over single-agent baselines and vanilla MAS based on Llama 3 8B / 3.2 3B, achieving up to 2.8x performance gain with less than 10% tokens on tasks requiring heavy information exchange. Moreover, Optima’s efficiency gains enable more effective compute utilization during inference, leading to improved inference-time scaling laws. By addressing fundamental challenges in LLM-based MAS, Optima shows the potential towards scalable, efficient, and effective MAS.
pdf
bib
abs
Generating Domain-Specific Knowledge Graphs from Large Language Models
Marinela Parović
|
Ze Li
|
Jinhua Du
Knowledge graphs (KGs) have been a cornerstone of search and recommendation due to their ability to store factual knowledge about any domain in a structured form enabling easy search and retrieval. Large language models (LLMs) have shown impressive world knowledge across different benchmarks and domains but their knowledge is inconveniently scattered across their billions of parameters. In this paper, we propose a prompt-based method to construct domain-specific KGs by extracting knowledge solely from LLMs’ parameters. First, we use an LLM to create a schema for a specific domain, which contains a set of domain-representative entities and relations. After that, we use the schema to guide the LLM through an iterative data generation process equipped with Chain-of-Verification (CoVe) for increased data quality. Using this method, we construct KGs for two domains: books and landmarks, which we then evaluate against Wikidata, an open-source human-created KG. Our results show that LLMs can generate large domain-specific KGs containing tens of thousands of entities and relations. However, due to the increased hallucination rates as the procedure evolves, the utility of large-scale LLM-generated KGs in practical applications could remain limited.
pdf
bib
abs
Large Language Models are Miscalibrated In-Context Learners
Chengzu Li
|
Han Zhou
|
Goran Glavaš
|
Anna Korhonen
|
Ivan Vulić
When adapting ICL with or without fine-tuning, we are curious about whether the instruction-tuned language model is able to achieve well-calibrated results without suffering from the problem of overconfidence (i.e., miscalibration) considering its strong instruction following ability, especially in such limited data setups. In this work, we deliver an in-depth analysis of the behavior across different choices of learning methods from the perspective of both performance and calibration. Through extensive controlled experiments, we observe that the miscalibration problem exists across all learning methods in low-resource setups. To achieve simultaneous gain for both in-task performance and calibration, we then study the potential of self-ensembling applied at different modeling stages (e.g., variations of in-context examples or variations in prompts or different ensembling strategies) to make the predictions more calibrated and have comparable or even better performance. We find that self-ensembling with max probability produces robust and calibrated predictions. Our work reveals the potential calibration problem of using ICL despite the improvements in task performance and sheds light on which learning paradigm to choose. We also provide practical guidelines for choosing learning paradigms depending on whether the data has been seen by the model before and a worthwhile solution via self-ensembling on how to enhance both task performance and calibration of LMs, which we hope could encourage further study.
pdf
bib
abs
STeCa: Step-level Trajectory Calibration for LLM Agent Learning
Hanlin Wang
|
Jian Wang
|
Chak Tou Leong
|
Wenjie Li
Large language model (LLM)-based agents have shown promise in tackling complex tasks by interacting dynamically with the environment. Existing work primarily focuses on behavior cloning from expert demonstrations or preference learning through exploratory trajectory sampling. However, these methods often struggle to address long-horizon tasks, where suboptimal actions accumulate step by step, causing agents to deviate from correct task trajectories.To address this, we highlight the importance of timely calibration and the need to automatically construct calibration trajectories for training agents. We propose Step-Level Trajectory Calibration (STeCa), a novel framework for LLM agent learning. Specifically, STeCa identifies suboptimal actions through a step-level reward comparison during exploration. It constructs calibrated trajectories using LLM-driven reflection, enabling agents to learn from improved decision-making processes. We finally leverage these calibrated trajectories with successful trajectories for reinforced training.Extensive experiments demonstrate that STeCa significantly outperforms existing methods. Further analysis highlights that timely calibration enables agents to complete tasks with greater robustness. Our code and data are available at https://github.com/WangHanLinHenry/STeCa.
pdf
bib
abs
LEMMA: Learning from Errors for MatheMatical Advancement in LLMs
Zhuoshi Pan
|
Yu Li
|
Honglin Lin
|
Qizhi Pei
|
Zinan Tang
|
Wei Wu
|
Chenlin Ming
|
H. Vicky Zhao
|
Conghui He
|
Lijun Wu
Large language models (LLMs) have demonstrated remarkable reasoning capability in solving mathematical problems. However, existing approaches primarily focus on improving the quality of correct training data, e.g., distilling high-quality correct solutions from advanced models, neglecting the value contained in error data, potentially hindering the model’s reflective ability. Though some studies attempted to leverage error data, they often involve complex mechanisms, such as Monte Carlo Tree Search (MCTS) to explore error nodes.In this work, we propose to enhance LLM’s reasoning ability by Learning from Errors for MatheMatical Advancement (LEMMA). LEMMA constructs data consists of an incorrect solution with an erroneous step and a reflection connection to a correct solution for fine-tuning. Specifically, we systematically analyze the model-generated error types and introduce an _error-type grounded mistake augmentation_ method to collect diverse and representative errors. Correct solutions are either from fixing the errors or generating a fresh start. By fine-tuning on the constructed dataset, the model is able to _self-correct errors autonomously_ within the generation process _without relying on external critique models_. Experimental results demonstrate that LEMMA achieves significant performance improvements over other strong models with less than 90k data.
pdf
bib
abs
Voting or Consensus? Decision-Making in Multi-Agent Debate
Lars Benedikt Kaesberg
|
Jonas Becker
|
Jan Philip Wahle
|
Terry Ruas
|
Bela Gipp
Much of the success of multi-agent debates depends on carefully choosing the right parameters. The decision-making protocol stands out as it can highly impact final model answers, depending on how decisions are reached. Systematic comparison of decision protocols is difficult because many studies alter multiple discussion parameters beyond the protocol. So far, it has been largely unknown how decision-making influences different tasks. This work systematically evaluates the impact of seven decision protocols (e.g., majority voting, unanimity consensus). We change only one variable at a time - the decision protocol - to analyze how different methods affect the collaboration between agents and measure differences in knowledge and reasoning tasks. Our results show that voting protocols improve performance by 13.2% in reasoning tasks and consensus protocols by 2.8% in knowledge tasks compared to other decision protocols. Increasing the number of agents improves performance, while more discussion rounds before voting reduce it. To improve decision-making by increasing answer diversity, we propose two new methods, All-Agents Drafting (AAD) and Collective Improvement (CI). Our methods improve task performance by up to 3.3% with AAD and up to 7.4% with CI. This work demonstrates the importance of decision-making in multi-agent debates beyond scaling.
pdf
bib
abs
Rhetorical Device-Aware Sarcasm Detection with Counterfactual Data Augmentation
Qingqing Hong
|
Dongyu Zhang
|
Jiayi Lin
|
Dapeng Yin
|
Shuyue Zhu
|
Junli Wang
Sarcasm is a complex form of sentiment expression widely used in human daily life. Previous work primarily defines sarcasm as a form of verbal irony, which covers only a subset of real-world sarcastic expressions. However, sarcasm serves multifaceted functions and manifests itself through various rhetorical devices, such as echoic mention, rhetorical question and hyperbole. To fully capture its complexity, this paper investigates fine-grained sarcasm classification through the lens of rhetorical devices, and introduces RedSD, a RhEtorical Device-Aware Sarcasm Dataset with counterfactually augmented data.To construct the dataset, we extract sarcastic dialogues from situation comedies (i.e., sitcoms), and summarize nine rhetorical devices commonly employed in sarcasm. We then propose a rhetorical device-aware counterfactual data generation pipeline facilitated by both Large Language Models (LLMs) and human revision. Additionally, we propose duplex counterfactual augmentation that generates counterfactuals for both sarcastic and non-sarcastic dialogues, to further enhance the scale and diversity of the dataset.Experimental results on the dataset demonstrate that fine-tuned models exhibit a more balanced performance compared to zero-shot models, including GPT-3.5 and LLaMA 3.1, underscoring the importance of integrating various rhetorical devices in sarcasm detection. Our dataset is avaliable at https://github.com/qqHong73/RedSD.
pdf
bib
abs
Selecting Demonstrations for Many-Shot In-Context Learning via Gradient Matching
Jianfei Zhang
|
Bei Li
|
Jun Bai
|
Rumei Li
|
Yanmeng Wang
|
Chenghua Lin
|
Wenge Rong
In-Context Learning (ICL) empowers Large Language Models (LLMs) for rapid task adaptation without Fine-Tuning (FT), but its reliance on demonstration selection remains a critical challenge. While many-shot ICL shows promising performance through scaled demonstrations, the selection method for many-shot demonstrations remains limited to random selection in existing work. Since the conventional instance-level retrieval is not suitable for many-shot scenarios, we hypothesize that the data requirements for in-context learning and fine-tuning are analogous. To this end, we introduce a novel gradient matching approach that selects demonstrations by aligning fine-tuning gradients between the entire training set of the target task and the selected examples, so as to approach the learning effect on the entire training set within the selected examples. Through gradient matching on relatively small models, e.g., Qwen2.5-3B or Llama3-8B, our method consistently outperforms random selection on larger LLMs from 4-shot to 128-shot scenarios across 9 diverse datasets. For instance, it surpasses random selection by 4% on Qwen2.5-72B and Llama3-70B, and by around 2% on 5 closed-source LLMs. This work unlocks more reliable and effective many-shot ICL, paving the way for its broader application.
pdf
bib
abs
Cheap Character Noise for OCR-Robust Multilingual Embeddings
Andrianos Michail
|
Juri Opitz
|
Yining Wang
|
Robin Meister
|
Rico Sennrich
|
Simon Clematide
The large amount of text collections digitized by imperfect OCR systems requires semantic search models that perform robustly on noisy input. Such collections are highly heterogeneous, with varying degrees of OCR quality, spelling conventions and other inconsistencies —all phenomena that are underrepresented in the training data of standard embedding models, with ramifications for their generalization. In our paper, we show that this problem can be alleviated with a simple and inexpensive method that does not require supervision or in-domain training. Specifically, we fine-tune existing multilingual models using noisy texts and a contrastive loss. We show that these models show considerable improvements across different noise conditions. Control experiments indicate minimal, and occasionally positive, impact on standard similarity tasks. These findings suggest that embedding models can be inexpensively adapted for cross-lingual semantic search in heterogeneous, digitized corpora. We publicly release our code, datasets, and models at https://github.com/impresso/ocr-robust-multilingual-embeddings.
pdf
bib
abs
Physics: Benchmarking Foundation Models on University-Level Physics Problem Solving
Kaiyue Feng
|
Yilun Zhao
|
Yixin Liu
|
Tianyu Yang
|
Chen Zhao
|
John Sous
|
Arman Cohan
We introduce Physics, a comprehensive benchmark for university-level physics problem solving. It contains 1,297 expert-annotated problems covering six core areas: classical mechanics, quantum mechanics, thermodynamics and statistical mechanics, electromagnetism, atomic physics, and optics.Each problem requires advanced physics knowledge and mathematical reasoning.We develop a robust automated evaluation system for precise and reliable validation. Our evaluation of leading foundation models reveals substantial limitations. Even the most advanced model, o3-mini, achieves only 59.9% accuracy, highlighting significant challenges in solving high-level scientific problems.Through comprehensive error analysis, exploration of diverse prompting strategies, and Retrieval-Augmented Generation (RAG)-based knowledge augmentation, we identify key areas for improvement, laying the foundation for future advancements.
pdf
bib
abs
DOVE: A Large-Scale Multi-Dimensional Predictions Dataset Towards Meaningful LLM Evaluation
Eliya Habba
|
Ofir Arviv
|
Itay Itzhak
|
Yotam Perlitz
|
Elron Bandel
|
Leshem Choshen
|
Michal Shmueli-Scheuer
|
Gabriel Stanovsky
Recent work found that LLMs are sensitive to a wide range of arbitrary prompt dimensions, including the type of delimiters, answer enumerators, instruction wording, and more. This throws into question popular single-prompt evaluation practices. We present DOVE (Dataset Of Variation Evaluation) a large-scale dataset containing prompt perturbations of various evaluation benchmarks. In contrast to previous work, we examine LLM sensitivity from an holistic perspective, and assess the joint effects of perturbations along various dimensions, resulting in thousands of perturbations per instance. We evaluate several model families against DOVE, leading to several findings, including efficient methods for choosing well-performing prompts, observing that few-shot examples reduce sensitivity, and identifying instances which are inherently hard across all perturbations. DOVE consists of more than 250M prompt perturbations and model outputs, which we make publicly available to spur a community-wide effort toward meaningful, robust, and efficient evaluation. Browse the data, contribute, and more at: https://slab-nlp.github.io/DOVE
pdf
bib
abs
ALPS: Attention Localization and Pruning Strategy for Efficient Adaptation of Large Language Models
Hao Chen
|
Haoze Li
|
Zhiqing Xiao
|
Lirong Gao
|
Qi Zhang
|
Xiaomeng Hu
|
Ningtao Wang
|
Xing Fu
|
Junbo Zhao
Aligning general-purpose large language models (LLMs) to downstream tasks often incurs significant training adjustment costs. Prior research has explored various avenues to enhance alignment efficiency, primarily through minimal-data training or data-driven activations to identify key attention heads. However, these approaches inherently introduce data dependency, which hinders generalization and reusability. To address this issue and enhance model alignment efficiency, we propose the Attention Localization and Pruning Strategy ALPS, an efficient algorithm that localizes the most task-sensitive attention heads and prunes by restricting attention training updates to these heads, thereby reducing alignment costs. Experimental results demonstrate that our method activates only 10% of attention parameters during fine-tuning while achieving a 2% performance improvement over baselines on three tasks. Moreover, the identified task-specific heads are transferable across datasets and mitigate knowledge forgetting. Our work and findings provide a novel perspective on efficient LLM alignment.
pdf
bib
abs
DeTAM: Defending LLMs Against Jailbreak Attacks via Targeted Attention Modification
Yu Li
|
Han Jiang
|
Zhihua Wei
With the widespread adoption of Large Language Models (LLMs), jailbreak attacks have become an increasingly pressing safety concern. While safety-aligned LLMs can effectively defend against normal harmful queries, they remain vulnerable to such attacks. Existing defense methods primarily rely on fine-tuning or input modification, which often suffer from limited generalization and reduced utility. To address this, we introduce DeTAM, a finetuning-free defense approach that improves the defensive capabilities against jailbreak attacks of LLMs via targeted attention modification. Specifically, we analyze the differences in attention scores between successful and unsuccessful defenses to identify the attention heads sensitive to jailbreak attacks. During inference, we reallocate attention to emphasize users’ core intentions, minimizing interference from attack tokens. Our experimental results demonstrate that DeTAM outperforms various baselines in jailbreak defense and exhibits robust generalization across different attacks and models, maintaining its effectiveness even on in-the-wild jailbreak data. Furthermore, we compare DeTAM with the baselines on over-defense datasets, further validating its superior balance between helpfulness and harmlessness.
pdf
bib
abs
A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges
Yibo Yan
|
Jiamin Su
|
Jianxiang He
|
Fangteng Fu
|
Xu Zheng
|
Yuanhuiyi Lyu
|
Kun Wang
|
Shen Wang
|
Qingsong Wen
|
Xuming Hu
Mathematical reasoning, a core aspect of human cognition, is vital across many domains, from educational problem-solving to scientific advancements. As artificial general intelligence (AGI) progresses, integrating large language models (LLMs) with mathematical reasoning tasks is becoming increasingly significant. This survey provides **the first comprehensive analysis of mathematical reasoning in the era of multimodal large language models (MLLMs)**. We review over 200 studies published since 2021, and examine the state-of-the-art developments in Math-LLMs, with a focus on multimodal settings. We categorize the field into three dimensions: benchmarks, methodologies, and challenges. In particular, we explore multimodal mathematical reasoning pipeline, as well as the role of (M)LLMs and the associated methodologies. Finally, we identify five major challenges hindering the realization of AGI in this domain, offering insights into the future direction for enhancing multimodal reasoning capabilities. This survey serves as a critical resource for the research community in advancing the capabilities of LLMs to tackle complex multimodal reasoning tasks.
pdf
bib
abs
Fast-and-Frugal Text-Graph Transformers are Effective Link Predictors
Andrei Catalin Coman
|
Christos Theodoropoulos
|
Marie-Francine Moens
|
James Henderson
We propose Fast-and-Frugal Text-Graph (FnF-TG) Transformers, a Transformer-based framework that unifies textual and structural information for inductive link prediction in text-attributed knowledge graphs. We demonstrate that, by effectively encoding ego-graphs (1-hop neighbourhoods), we can reduce the reliance on resource-intensive textual encoders. This makes the model both fast at training and inference time, as well as frugal in terms of cost. We perform a comprehensive evaluation on three popular datasets and show that FnF-TG can achieve superior performance compared to previous state-of-the-art methods. We also extend inductive learning to a fully inductive setting, where relations don’t rely on transductive (fixed) representations, as in previous work, but are a function of their textual description. Additionally, we introduce new variants of existing datasets, specifically designed to test the performance of models on unseen relations at inference time, thus offering a new test-bench for fully inductive link prediction.
pdf
bib
abs
NeoQA: Evidence-based Question Answering with Generated News Events
Max Glockner
|
Xiang Jiang
|
Leonardo F. R. Ribeiro
|
Iryna Gurevych
|
Markus Dreyer
Evaluating Retrieval-Augmented Generation (RAG) in large language models (LLMs) is challenging because benchmarks can quickly become stale. Questions initially requiring retrieval may become answerable from pretraining knowledge as newer models incorporate more recent information during pretraining, making it difficult to distinguish evidence-based reasoning from recall. We introduce NeoQA (News Events for Out-of-training Question Answering), a benchmark designed to address this issue. To construct NeoQA, we generated timelines and knowledge bases of fictional news events and entities along with news articles and Q&A pairs to prevent LLMs from leveraging pretraining knowledge, ensuring that no prior evidence exists in their training data. We propose our dataset as a new platform for evaluating evidence-based question answering, as it requires LLMs to generate responses exclusively from retrieved evidence and only when sufficient evidence is available. NeoQA enables controlled evaluation across various evidence scenarios, including cases with missing or misleading details. Our findings indicate that LLMs struggle to distinguish subtle mismatches between questions and evidence, and suffer from short-cut reasoning when key information required to answer a question is missing from the evidence, underscoring key limitations in evidence-based reasoning.
pdf
bib
abs
ChatMap: Mining Human Thought Processes for Customer Service Chatbots via Multi-Agent Collaboration
Xinyi Jiang
|
Tianyi Hu
|
Yuheng Qin
|
Guoming Wang
|
Zhou Huan
|
Kehan Chen
|
Gang Huang
|
Rongxing Lu
|
Siliang Tang
Leveraging Large Language Models (LLMs) to build domain-specific conversational agents, especially for e-commerce customer service chatbots, is a growing focus. While existing methods enhance dialogue performance by extracting core patterns from dialogue data and integrating them into models, two key challenges persist: (1) heavy reliance on human experts for dialogue strategy induction, and (2) LLM-based automatic extraction often focuses on summarizing specific behaviors, neglecting the underlying thought processes behind strategy selection. In this paper, we present ChatMap, which focuses on enhancing customer service chatbots by mining thought processes using a Multi-Agent aPproach. Specifically, the process begins by extracting customer requests and solutions from a raw dialogue dataset, followed by clustering similar requests, analyzing the thought processes behind solutions, and refining service thoughts. Through a quality inspection and reflection mechanism, the final service thought dataset is generated, helping chatbots provide more appropriate responses. Offline experimental results show that ChatMap performs comparably to manually annotated thought processes and significantly outperforms other baselines, demonstrating its ability to automate human annotation and enhance dialogue capabilities through strategic understanding. Online A/B tests on Taobao, a popular e-commerce platform in China reveal that ChatMap can better improve customer satisfaction and address customer requests from a business perspective.
pdf
bib
abs
P3: Prompts Promote Prompting
Xinyu Zhang
|
Yuanquan Hu
|
Fangchao Liu
|
Zhicheng Dou
Current large language model (LLM) applications often employ multi-component prompts, comprising both system and user prompts, to guide model behaviors. While recent advancements have demonstrated the efficacy of automatically optimizing either the system or user prompt to boost performance, such unilateral approaches often yield suboptimal outcomes due to the interdependent nature of these components. In this work, we introduce P3, a novel self-improvement framework that concurrently optimizes both system and user prompts through an iterative process. The offline optimized prompts are further leveraged to promote online prompting by performing query-dependent prompt optimization. Extensive experiments on general tasks (e.g., Arena-hard and Alpaca-eval) and reasoning tasks (e.g., GSM8K and GPQA) demonstrate that P3 achieves superior performance in the realm of automatic prompt optimization. Our results highlight the effectiveness of a holistic optimization strategy in enhancing LLM performance across diverse domains.
pdf
bib
abs
VAQUUM: Are Vague Quantifiers Grounded in Visual Data?
Hugh Mee Wong
|
Rick Nouwen
|
Albert Gatt
Vague quantifiers such as “a few” and “many” are influenced by various contextual factors, including the number of objects present in a given context. In this work, we evaluate the extent to which vision-and-language models (VLMs) are compatible with humans when producing or judging the appropriateness of vague quantifiers in visual contexts. We release a novel dataset, VAQUUM, containing 20,300 human ratings on quantified statements across a total of 1089 images. Using this dataset, we compare human judgments and VLM predictions using three different evaluation methods. Our findings show that VLMs, like humans, are influenced by object counts in vague quantifier use. However, we find significant inconsistencies across models in different evaluation settings, suggesting that judging and producing vague quantifiers rely on two different processes. We release our dataset and code at https://github.com/hughmee/vaquum.
pdf
bib
abs
Forgotten Polygons: Multimodal Large Language Models are Shape-Blind
William Rudman
|
Michal Golovanevsky
|
Amir Bar
|
Vedant Palit
|
Yann LeCun
|
Carsten Eickhoff
|
Ritambhara Singh
Despite strong performance on vision-language tasks, Multimodal Large Language Models (MLLMs) struggle with mathematical problem-solving, with both open-source and state-of-the-art models falling short of human performance on visual-math benchmarks. To systematically examine visual-mathematical reasoning in MLLMs, we (1) evaluate their understanding of geometric primitives, (2) test multi-step reasoning, and (3) explore a potential solution to improve visual reasoning capabilities. Our findings reveal fundamental shortcomings in shape recognition, with top models achieving under 50% accuracy in identifying regular polygons. We analyze these failures through the lens of dual-process theory and show that MLLMs rely on System 1 (intuitive, memorized associations) rather than System 2 (deliberate reasoning). Consequently, MLLMs fail to count the sides of both familiar and novel shapes, suggesting they have neither learned the concept of “sides” nor effectively process visual inputs. Finally, we propose Visually Cued Chain-of-Thought (VC-CoT) prompting, which enhances multi-step mathematical reasoning by explicitly referencing visual annotations in diagrams, boosting GPT-4o’s accuracy on an irregular polygon side-counting task from 7% to 93%. Our findings suggest that System 2 reasoning in MLLMs remains an open problem, and visually-guided prompting is essential for successfully engaging visual reasoning.
pdf
bib
abs
MindBridge: Scalable and Cross-Model Knowledge Editing via Memory-Augmented Modality
Shuaike Li
|
Kai Zhang
|
Qi Liu
|
Enhong Chen
Knowledge editing is a technique for efficiently and accurately updating the knowledge of large language models (LLMs) to alleviate obsolescence and correct errors. However, most existing methods overfit to specific models, causing edited knowledge to be discarded during each LLM update and requiring frequent re-editing, which is particularly burdensome in today’s rapidly evolving open-source community. To address this issue, we propose the problem of cross-model knowledge editing and introduce **MindBridge**, a scalable solution inspired by the low coupling between modality processing and LLMs in multi-modal models. MindBridge introduces the novel concept of **memory modality**, which encodes edited knowledge as an independent modality. It first performs LLM-agnostic pre-training of the memory modality and then integrates it with various LLMs. Extensive experiments on multiple LLMs and popular knowledge editing datasets demonstrate that MindBridge achieves superior performance even in editing tens of thousands of knowledge entries and can flexibly adapt to different LLMs. Our code is available at https://github.com/CrashBugger/MindBridge.
pdf
bib
abs
FIHA: Automated Fine-grained Hallucinations Evaluations in Large Vision Language Models with Davidson Scene Graphs
Bowen Yan
|
Zhengsong Zhang
|
Liqiang Jing
|
Eftekhar Hossain
|
Xinya Du
The rapid development of Large Vision-Language Models (LVLMs) often comes with widespread hallucination issues, making cost-effective and comprehensive assessments increasingly vital. Current approaches mainly rely on costly annotations and are not comprehensive – in terms of evaluating all aspects, such as relations, attributes, and dependencies between aspects. Therefore, we introduce the FIHA (automated Fine-graIned Hallucination evAluation in LVLMs), which could access LVLMs hallucination in an LLM-free and annotation-free way and model the dependency between different types of hallucinations. FIHA can generate Q&A pairs on any image dataset at minimal cost, enabling hallucination assessment from both image and caption. Based on this approach, we introduce a benchmark called FIHA-v1, which consists of diverse questions on various images from three datasets. Furthermore, we use the Davidson Scene Graph (DSG) to organize the structure among Q&A pairs, in which we can increase the reliability of the evaluation. We evaluate representative models using FIHA-v1, highlighting their limitations and challenges. We released our code and data at https://github.com/confidentzzzs/FIHA.
pdf
bib
abs
On the Role of Semantic Proto-roles in Semantic Analysis: What do LLMs know about agency?
Elizabeth Spaulding
|
Shafiuddin Rehan Ahmed
|
James Martin
Large language models (LLMs) are increasingly used in decision-making contexts, yet their ability to reason over event structure—an important component in the situational awareness needed to make complex decisions—is not well understood. By operationalizing proto-role theory, which characterizes agents via properties such as *instigation* and *volition* and patients via properties such as *change of state*, we examine the ability of LLMs to answer questions that require complex, multi-step event reasoning. Specifically, we investigate the extent to which LLMs capture semantic roles such as “agent” and “patient” through zero-shot prompts, and whether incorporating semantic proto-role labeling (SPRL) context improves semantic role labeling (SRL) performance in a zero-shot setting. We find that, while SPRL context sometimes degrades SRL accuracy in high-performing models (e.g., GPT-4o), it also uncovers an internal consistency between SPRL and SRL predictions that mirrors linguistic theory, and provides evidence that LLMs implicitly encode consistent multi-dimensional event role knowledge. Furthermore, our experiments support prior work showing that LLMs underperform human annotators in complex semantic analysis.
pdf
bib
abs
GeAR: Graph-enhanced Agent for Retrieval-augmented Generation
Zhili Shen
|
Chenxin Diao
|
Pavlos Vougiouklis
|
Pascual Merita
|
Shriram Piramanayagam
|
Enting Chen
|
Damien Graux
|
Andre Melo
|
Ruofei Lai
|
Zeren Jiang
|
Zhongyang Li
|
Ye Qi
|
Yang Ren
|
Dandan Tu
|
Jeff Z. Pan
Retrieval-augmented Generation (RAG) relies on effective retrieval capabilities, yet traditional sparse and dense retrievers inherently struggle with multi-hop retrieval scenarios. In this paper, we introduce G\small{E}\normalsize{AR}, a system that advances RAG performance through two key innovations: (i) an efficient graph expansion mechanism that augments any conventional base retriever, such as BM25, and (ii) an agent framework that incorporates the resulting graph-based retrieval into a multi-step retrieval framework. Our evaluation demonstrates G\small{E}\normalsize{AR}‘s superior retrieval capabilities across three multi-hop question answering datasets. Notably, our system achieves state-of-the-art results with improvements exceeding 10% on the challenging MuSiQue dataset, while consuming fewer tokens and requiring fewer iterations than existing multi-step retrieval systems. The project page is available at https://gear-rag.github.io.
pdf
bib
abs
WebNLG-IT: Construction of an aligned RDF-Italian corpus through Machine Translation techniques
Michael Oliverio
|
Pier Felice Balestrucci
|
Alessandro Mazzei
|
Valerio Basile
The main goal of this work is the creation of the Italian version of the WebNLG corpus through the application of Neural Machine Translation (NMT) and post-editing with hand-written rules. To achieve this goal, in a first step, several existing NMT models were analysed and compared in order to identify the system with the highest performance on the original corpus. In a second step, after using the best NMT system, we semi-automatically designed and applied a number of rules to refine and improve the quality of the produced resource, creating a new corpus named WebNLG-IT. We used this resource for fine-tuning several LLMs for RDF-to-text tasks. In this way, comparing the performance of LLM-based generators on both Italian and English, we have (1) evaluated the quality of WebNLG-IT with respect to the original English version, (2) released the first fine-tuned LLM-based system for generating Italian from semantic web triples and (3) introduced an Italian version of a modular generation pipeline for RDF-to-text.
pdf
bib
abs
Towards Adapting Open-Source Large Language Models for Expert-Level Clinical Note Generation
Hanyin Wang
|
Chufan Gao
|
Bolun Liu
|
Qiping Xu
|
Guleid Hussein
|
Mohamad El Labban
|
Kingsley Iheasirim
|
Hariprasad Reddy Korsapati
|
Chuck Outcalt
|
Jimeng Sun
Proprietary Large Language Models (LLMs) such as GPT-4 and Gemini have demonstrated promising capabilities in clinical text summarization tasks. However, due to patient data privacy concerns and computational costs, many healthcare providers prefer using small, locally-hosted models over external generic LLMs. This study presents a comprehensive domain- and task-specific adaptation process for the open-source LLaMA-2 13 billion parameter model, enabling it to generate high-quality clinical notes from outpatient patient-doctor dialogues. Our process incorporates continued pre-training, supervised fine-tuning, and reinforcement learning from both AI and human feedback. We introduced a new approach, DistillDirect, for performing on-policy reinforcement learning with Gemini 1.0 Pro as the teacher model. Our resulting model, LLaMA-Clinic, can generate clinical notes comparable in quality to those authored by physicians. In a blinded physician reader study, the majority (92.8%) of individual evaluations rated the notes generated by LLaMA-Clinic as “acceptable” or higher across all three criteria: real-world readiness, completeness, and accuracy. In the more challenging “Assessment and Plan” section, LLaMA-Clinic received the same score as the notes authored by physicians. We highlight key considerations for future clinical note-generation tasks, emphasizing the importance of pre-defining a best-practice note format, rather than relying on LLMs to determine this for clinical practice.
pdf
bib
abs
Bridging Robustness and Generalization Against Word Substitution Attacks in NLP via the Growth Bound Matrix Approach
Mohammed Bouri
|
Adnane Saoud
Despite advancements in Natural Language Processing (NLP), models remain vulnerable to adversarial attacks, such as synonym substitutions. While prior work has focused on improving robustness for feed-forward and convolutional architectures, the robustness of recurrent networks and modern state space models (SSMs), such as S4, remains understudied. These architectures pose unique challenges due to their sequential processing and complex parameter dynamics. In this paper, we introduce a novel regularization technique based on Growth Bound Matrices (GBM) to improve NLP model robustness by reducing the impact of input perturbations on model outputs. We focus on computing the GBM for three architectures: Long Short-Term Memory (LSTM), State Space models (S4), and Convolutional Neural Networks (CNN). Our method aims to (1) enhance resilience against word substitution attacks, (2) improve generalization on clean text, and (3) providing the first systematic analysis of SSM (S4) robustness. Extensive experiments across multiple architectures and benchmark datasets demonstrate that our method improves adversarial robustness by up to (8.8%) over existing baselines. These results highlight the effectiveness of our approach, outperforming several state-of-the-art methods in adversarial defense. Codes are available at https://github.com/BouriMohammed/GBM
pdf
bib
abs
Neuro-Symbolic Query Compiler
Yuyao Zhang
|
Zhicheng Dou
|
Xiaoxi Li
|
Jiajie Jin
|
Yongkang Wu
|
Zhonghua Li
|
Ye Qi
|
Ji-Rong Wen
Precise recognition of search intent in Retrieval-Augmented Generation (RAG) systems remains a challenging goal, especially under resource constraints and for complex queries with nested structures and dependencies. This paper presents **QCompiler**, a neuro-symbolic framework inspired by linguistic grammar rules and compiler design, to bridge this gap. It theoretically presents a minimal yet sufficient Backus-Naur Form (BNF) grammar G[q] to formalize complex queries. Unlike previous methods, this grammar maintains completeness while minimizing redundancy. Based on this, QCompiler includes a query expression translator, a Lexical syntax parser, and a Recursive Descent Processor to compile queries into Abstract Syntax Trees (ASTs) for execution. The atomicity of the sub-queries in the leaf nodes ensures more precise document retrieval and response generation, significantly improving the RAG system’s ability to address complex queries.
pdf
bib
abs
Revealing and Mitigating the Local Pattern Shortcuts of Mamba
WangJie You
|
Zecheng Tang
|
Juntao Li
|
Lili Yao
|
Min Zhang
Large language models (LLMs) have advanced significantly due to the attention mechanism, but their quadratic complexity and linear memory demands limit their performance on long-context tasks. Recently, researchers introduced Mamba, an advanced model built upon State Space Models (SSMs) that offers linear complexity and constant memory. Although Mamba is reported to match or surpass the performance of attention-based models, our analysis reveals a performance gap: Mamba excels in tasks that involve localized key information but faces challenges with tasks that require handling distributed key information. Our controlled experiments suggest that the inconsistency arises from Mamba’s reliance on **local pattern shortcuts** across model scales (10M to 1.4B), which enable Mamba to remember local key information within its limited memory but hinder its ability to retain more dispersed information. Therefore, we introduce a global gate module into the Mamba model to address this issue. Experiments on extensive synthetic tasks, as well as real-world tasks, demonstrate the effectiveness of our method. Notably, with the introduction of only 4M extra parameters, our approach enables the Mamba model (130M) to achieve a significant improvement on tasks with distributed information, increasing its performance from **below 5% to 80%**.
pdf
bib
abs
Forget the Token and Pixel: Rethinking Gradient Ascent for Concept Unlearning in Multimodal Generative Models
Jiaqi Li
|
Chuanyi Zhang
|
Miaozeng Du
|
Hui Zhang
|
Yongrui Chen
|
Qianshan Wei
|
Junfeng Fang
|
Ruipeng Wang
|
Sheng Bi
|
Guilin Qi
Gradient Ascent (GA) has emerged as a promising approach for concept unlearning in Multimodal Generative Models (MGMs), such as Multimodal Large Language Models (MLLMs) and Stable Diffusion Models (SDMs). Despite its effectiveness in removing undesired knowledge, GA leads to severe utility degradation in MGMs. In this paper, we explore the mechanism behind this degradation by quantifying two distinct forms of knowledge in MGMs: (i) Conceptual Knowledge, which represents specific information about concepts; (ii) Natural Knowledge, which refers to the ability to produce coherent and logically structured outputs. Our analysis reveals that applying GA globally not only removes the targeted Conceptual Knowledge but also inadvertently diminishes Natural Knowledge, resulting in utility collapse. To address this issue, we propose Forget the Token and Pixel (FTTP), a novel approach that selectively applies GA to targeted Conceptual Knowledge while preserving Natural Knowledge through Gradient Descent (GD). FTTP eliminates the need for additional retain sets and a large number of training steps, thereby reducing computational resource costs. Extensive experiments demonstrate FTTP’s efficiency and superior utility-unlearning tradeoff for both text and image generation tasks. Our source code will be released in the near future.
pdf
bib
abs
Slamming: Training a Speech Language Model on One GPU in a Day
Gallil Maimon
|
Avishai Elmakies
|
Yossi Adi
We introduce *Slam*, a recipe for training high-quality Speech Language Models (SLMs) on a single academic GPU in 24 hours. We do so through empirical analysis of model initialisation and architecture, synthetic training data, preference optimisation with synthetic data and tweaking all other components. We empirically demonstrate that this training recipe also scales well with more compute getting results on par with leading SLMs in a fraction of the compute cost. We hope these insights will make SLM training and research more accessible. In the context of SLM scaling laws, our results far outperform predicted compute optimal performance, giving an optimistic view to SLM feasibility. See code, data, models, samples - https://pages.cs.huji.ac.il/adiyoss-lab/slamming .
pdf
bib
abs
Boosting LLM Translation Skills without General Ability Loss via Rationale Distillation
Junhong Wu
|
Yang Zhao
|
Yangyifan Xu
|
Bing Liu
|
Chengqing Zong
Large Language Models (LLMs) have achieved impressive results across numerous NLP tasks, and fine-tuning them for Machine Translation (MT) has improved their performance. However, vanilla fine-tuning often leads to catastrophic forgetting, compromising the broad general abilities of LLMs and introducing potential security risks. These abilities, which are developed using proprietary and unavailable training data, make simple data replay methods ineffective. To overcome this issue, we propose a novel approach called **Ra**tionale **Dis**tillation. RaDis harnesses the strong generative capabilities of LLMs to create rationales for training data, which are then “replayed” to prevent forgetting. These rationales connect prior knowledge with new tasks, acting as self-distillation targets to regulate the training process. By jointly training on reference translations and self-generated rationales, the model can learn new translation skills while preserving its general abilities across other tasks. Additionally, RaDis provides a fresh perspective on using rationales in the CL field and has the potential to serve as a general continual learning method for a variety of tasks.
pdf
bib
abs
Clarifying Underspecified Discourse Relations in Instructional Texts
Berfin Aktas
|
Michael Roth
Discourse relations contribute to the structure of a text and can optionally be realized through explicit connectives such as “but” and “while”. But when are these connectives necessary to avoid possible misunderstandings? We investigate this question by first building a corpus of 4,274 text revisions in each of which a connective was explicitly inserted. For a subset of 250 cases, we collect plausibility annotations on other connectives to check whether they would represent suitable alternative relations. The results of this annotation show that several relations are often perceived as plausible in our data. Furthermore, we analyze the extent to which large language models can identify instances with multiple plausible relations as a possible source of misunderstandings. We find that the models predict plausibility of individual connectives with up to 66% accuracy, but they are not reliable in estimating when multiple relations are plausible.
pdf
bib
abs
WMT24++: Expanding the Language Coverage of WMT24 to 55 Languages & Dialects
Daniel Deutsch
|
Eleftheria Briakou
|
Isaac Rayburn Caswell
|
Mara Finkelstein
|
Rebecca Galor
|
Juraj Juraska
|
Geza Kovacs
|
Alison Lui
|
Ricardo Rei
|
Jason Riesa
|
Shruti Rijhwani
|
Parker Riley
|
Elizabeth Salesky
|
Firas Trabelsi
|
Stephanie Winkler
|
Biao Zhang
|
Markus Freitag
As large language models (LLM) become more and more capable in languages other than English, it is important to collect benchmark datasets in order to evaluate their multilingual performance, including on tasks like machine translation (MT). In this work, we extend the WMT24 dataset to cover 55 languages by collecting new human-written references and post-edits for 46 new languages/dialects in addition to post-edits of the references in 8 out of 9 languages in the original WMT24 dataset. We benchmark a variety of MT providers and LLMs on the collected dataset using automatic metrics and find that LLMs are the best-performing MT systems in all 55 languages. However, we caution against using our results to reach strong conclusions about MT quality without a human-based evaluation due to limitations of automatic evaluation metrics, which we leave for future work.
pdf
bib
abs
Exploring Graph Representations of Logical Forms for Language Modeling
Michael Sullivan
We make the case for language models over logical forms (LFLMs), arguing that such models are more data-efficient than their textual counterparts. To that end, we introduce the ̲Graph-based ̲Formal- ̲Logical ̲Distributional ̲Semantics (GFoLDS) prototype, a pretrained LM over graph representations of logical forms, as a proof-of-concept of LFLMs. Using GFoLDS, we present strong experimental evidence that LFLMs can leverage the built-in, basic linguistic knowledge inherent in such models to immediately begin learning more complex patterns. On downstream tasks, we show that GFoLDS vastly outperforms textual, transformer LMs (BERT) pretrained on the same data, indicating that LFLMs can learn with substantially less data than models over plain text. Furthermore, we show that the performance of this model is likely to scale with additional parameters and pretraining data, suggesting the viability of LFLMs in real-world applications.
pdf
bib
abs
SEA-HELM: Southeast Asian Holistic Evaluation of Language Models
Yosephine Susanto
|
Adithya Venkatadri Hulagadri
|
Jann Railey Montalan
|
Jian Gang Ngui
|
Xianbin Yong
|
Wei Qi Leong
|
Hamsawardhini Rengarajan
|
Peerat Limkonchotiwat
|
Yifan Mai
|
William Chandra Tjhi
With the rapid emergence of novel capabilities in Large Language Models (LLMs), the need for rigorous multilingual and multiculturalbenchmarks that are integrated has become more pronounced. Though existing LLM benchmarks are capable of evaluating specificcapabilities of LLMs in English as well as in various mid- to low-resource languages, including those in the Southeast Asian (SEA)region, a comprehensive and culturally representative evaluation suite for the SEA languages has not been developed thus far.Here, we present SEA-HELM, a holistic linguistic and cultural LLM evaluation suite that emphasises SEA languages, comprisingfive core pillars: (1) NLP CLASSICS, (2) LLM-SPECIFICS, (3) SEA LINGUISTICS, (4) SEA CULTURE, (5) SAFETY. SEA-HELMcurrently supports Filipino, Indonesian, Tamil, Thai, and Vietnamese. We also introduce the SEA-HELM leaderboard, which allows users to understand models’ multilingual and multicultural performance in a systematic and user-friendly manner. We make the SEA-HELM evaluation code publicly available.
pdf
bib
abs
TRANS-ZERO: Self-Play Incentivizes Large Language Models for Multilingual Translation Without Parallel Data
Wei Zou
|
Sen Yang
|
Yu Bao
|
Shujian Huang
|
Jiajun Chen
|
Shanbo Cheng
The rise of Large Language Models (LLMs) has reshaped machine translation (MT), but multilingual MT still relies heavily on parallel data for supervised fine-tuning (SFT), facing challenges like data scarcity for low-resource languages and catastrophic forgetting. To address these issues, we propose TRANS-ZERO, a self-play framework that leverages only monolingual data and the intrinsic multilingual knowledge of LLM. TRANS-ZERO combines Genetic Monte-Carlo Tree Search (G-MCTS) with preference optimization, achieving strong translation performance that rivals supervised methods. Experiments demonstrate that this approach not only matches the performance of models trained on large-scale parallel data but also excels in non-English translation directions. Further analysis reveals that G-MCTS itself significantly enhances translation quality by exploring semantically consistent candidates through iterative translations, providing a robust foundation for the framework’s success.
pdf
bib
abs
A Conformal Risk Control Framework for Granular Word Assessment and Uncertainty Calibration of CLIPScore Quality Estimates
Goncalo Emanuel Cavaco Gomes
|
Bruno Martins
|
Chrysoula Zerva
This study explores current limitations of learned image captioning evaluation metrics, specifically the lack of granular assessments for errors within captions, and the reliance on single-point quality estimates without considering uncertainty. To address the limitations, we propose a simple yet effective strategy for generating and calibrating distributions of CLIPScore values. Leveraging a model-agnostic conformal risk control framework, we calibrate CLIPScore values for task-specific control variables, tackling the aforementioned limitations. Experimental results demonstrate that using conformal risk control, over score distributions produced with simple methods such as input masking, can achieve competitive performance compared to more complex approaches. Our method effectively detects erroneous words, while providing formal guarantees aligned with desired risk levels. It also improves the correlation between uncertainty estimations and prediction errors, thus enhancing the overall reliability of caption evaluation metrics.
pdf
bib
abs
SGDPO: Self-Guided Direct Preference Optimization for Language Model Alignment
Wenqiao Zhu
|
Ji Liu
|
Lulu Wang
|
Jun Wu
|
Yulun Zhang
Direct Preference Optimization (DPO) is broadly utilized for aligning Large Language Models (LLMs) with human values because of its flexibility. Despite its effectiveness, it has been observed that the capability of DPO to generate human-preferred response is limited and the results of DPO are far from resilient. To address these limitations, in this paper we propose a novel Self-Guided Direct Preference Optimization algorithm, i.e., SGDPO, which incorporates a pilot term to steer the gradient flow during the optimization process, allowing for fine-grained control over the updates of chosen and rejected rewards. We provide a detailed theoretical analysis of our proposed method and elucidate its operational mechanism. Furthermore, we conduct comprehensive experiments on various models and benchmarks. The extensive experimental results demonstrate the consistency between the empirical results and our theoretical analysis and confirm the effectiveness of our proposed approach (up to 9.19% higher score).
pdf
bib
abs
Socratic Style Chain-of-Thoughts Help LLMs to be a Better Reasoner
Jiangbo Pei
|
Peiyu Liu
|
Xin Zhao
|
Aidong Men
|
Yang Liu
Synthetic data generation has emerged as a promising approach to enhance the reasoning capabilities of large language models. However, existing methods remain hindered by high costs—either through expensive API access or additional intermediate training—and are limited in their ability to generalize across different domains. To address these challenges, we propose a multi-agent debate framework based on the Socratic questioning strategy, abbreviated as SoDa. Distinguished from previous methods that prioritize data quantity, we highlight the wisdom of Socratic questioning in augmenting reasoning quality by deepening the thinking process to encourage exploration and broadening it to motivate self-reflection on each question. Combined with our efficient production pipeline, SoDa enables scaling while maintaining affordable costs. We use SoDa to generate diverse datasets for mathematics and code generation tasks with the Qwen2.5-7B-Instruct model, successfully fine-tuning a range of foundation models, from general-purpose ones to OpenAI o1-like ones. For mathematics, the experimental results show that SoDa outperforms the performance of existing datasets at the same scale, achieving improvements ranging from 1.3% to 13.5%. Remarkably, SoDa with 30K examples even surpasses the ScaleQuest dataset with 1000K samples, demonstrating significant efficiency. Our findings highlight the potential of SoDa as a universal, scalable, and cost-effective method for enhancing reasoning capabilities in large models across domains.
pdf
bib
abs
Quantile Regression with Large Language Models for Price Prediction
Nikhita Vedula
|
Dushyanta Dhyani
|
Laleh Jalali
|
Boris N. Oreshkin
|
Mohsen Bayati
|
Shervin Malmasi
Large Language Models (LLMs) have shown promise in structured prediction tasks, including regression, but existing approaches primarily focus on point estimates and lack systematic comparison across different methods.We investigate probabilistic regression using LLMs for unstructured inputs, addressing challenging text-to-distribution prediction tasks such as price estimation where both nuanced text understanding and uncertainty quantification are critical.We propose a novel quantile regression approach that enables LLMs to produce full predictive distributions, improving upon traditional point estimates. Through extensive experiments across three diverse price prediction datasets, we demonstrate that a Mistral-7B model fine-tuned with quantile heads significantly outperforms traditional approaches for both point and distributional estimations, as measured by three established metrics each for prediction accuracy and distributional calibration.Our systematic comparison of LLM approaches, model architectures, training approaches, and data scaling reveals that Mistral-7B consistently outperforms encoder architectures, embedding-based methods, and few-shot learning methods.Our experiments also reveal the effectiveness of LLM-assisted label correction in achieving human-level accuracy without systematic bias. Our curated datasets are made available at https://github.com/vnik18/llm-price-quantile-reg/ to support future research.
pdf
bib
abs
Training Turn-by-Turn Verifiers for Dialogue Tutoring Agents: The Curious Case of LLMs as Your Coding Tutors
Jian Wang
|
Yinpei Dai
|
Yichi Zhang
|
Ziqiao Ma
|
Wenjie Li
|
Joyce Chai
Intelligent tutoring agents powered by large language models (LLMs) have been increasingly explored to deliver personalized knowledge in areas such as language learning and science education. However, their capabilities in guiding users to solve complex real-world tasks remain underexplored. To address this limitation, in this work, we focus on coding tutoring, a challenging problem that requires tutors to proactively guide students towards completing predefined coding tasks. We propose a novel agent workflow, Trace-and-Verify (TRAVER), which combines knowledge tracing to estimate a student’s knowledge state and turn-by-turn verification to ensure effective guidance toward task completion. We introduce DICT, an automatic evaluation protocol that assesses tutor agents using controlled student simulation and code generation tests. Extensive experiments reveal the challenges of coding tutoring and demonstrate that TRAVER achieves a significantly higher success rate. Although we use code tutoring as an example in this paper, our approach can be extended beyond coding, providing valuable insights into advancing tutoring agents for human task learning.
pdf
bib
abs
AIGuard: A Benchmark and Lightweight Detection for E-commerce AIGC Risks
Wenhua Zhang
|
Weicheng Li
|
Xuanrong Rao
|
Lixin Zou
|
Xiangyang Luo
|
Chubin Zhuang
|
Yongjie Hong
|
Zhen Qin
|
Hengyu Chang
|
Chenliang Li
|
Bo Zheng
Recent advancements in AI-generated content (AIGC) have heightened concerns about harmful outputs, such as misinformation and malicious misuse.Existing detection methods face two key limitations:(1) lacking real-world AIGC scenarios and corresponding risk datasets, and(2) both traditional and multimodal large language models (MLLMs) struggle to detect risks in AIGC.Towards this end, we introduce **AIGuard**, the first benchmark for AIGC risk detection in real-world e-commerce. It includes 253,420 image-text pairs (i.e., the risk content and risk description) across four critical categories: *abnormal body*, *violating physical laws*, *misleading or illogical context*, and *harmful or problematic message*.To effectively detect these risks, we propose distilling text annotations into dense soft prompts and identifying risk content through image soft prompt matching during inference.Experiments on the benchmark show that this method achieves a 9.68% higher recall than leading multimodal models while using only 25% of the training resources and improving inference speed by 37.8 times.For further research, our benchmark and code are available at [https://github.com/wenh-zhang/aiguard-dataset](https://github.com/wenh-zhang/aiguard-dataset).
pdf
bib
abs
A2ATS: Retrieval-Based KV Cache Reduction via Windowed Rotary Position Embedding and Query-Aware Vector Quantization
Junhui He
|
Junna Xing
|
Nan Wang
|
Rui Xu
|
Shangyu Wu
|
Peng Zhou
|
Qiang Liu
|
Chun Jason Xue
|
Qingan Li
Long context large language models (LLMs) pose significant challenges for efficient serving due to the large memory footprint and high access overhead of KV cache.Retrieval-based KV cache reduction methods can mitigate these challenges, typically by offloading the complete KV cache to CPU and retrieving necessary tokens on demand during inference.However, these methods still suffer from unsatisfactory accuracy degradation and extra retrieval overhead.To address these limitations, this paper proposes A2ATS, a novel retrieval-based KV cache reduction method.A2ATS aims to obtain an accurate approximation of attention scores by applying the vector quantization technique to key states, thereby enabling efficient and precise retrieval of the top-K tokens.First, we propose Windowed Rotary Position Embedding, which decouples the positional dependency from query and key states after position embedding.Then, we propose query-aware vector quantization that optimizes the objective of attention score approximation directly.Finally, we design the heterogeneous inference architecture for KV cache offloading, enabling long context serving with larger batch sizes.Experimental results demonstrate that A2ATS can achieve a lower performance degradation with similar or lower overhead compared to existing methods, thereby increasing long context serving throughput by up to 2.7 ×.
pdf
bib
abs
TransBench: Breaking Barriers for Transferable Graphical User Interface Agents in Dynamic Digital Environments
Yuheng Lu
|
Qian Yu
|
Hongru Wang
|
Zeming Liu
|
Wei Su
|
Yanping Liu
|
Yuhang Guo
|
Maocheng Liang
|
Yunhong Wang
|
Haifeng Wang
Graphical User Interface (GUI) agents, which autonomously operate on digital interfaces through natural language instructions, hold transformative potential for accessibility, automation, and user experience. A critical aspect of their functionality is grounding — the ability to map linguistic intents to visual and structural interface elements. However, existing GUI agents often struggle to adapt to the dynamic and interconnected nature of real-world digital environments, where tasks frequently span multiple platforms and applications while also being impacted by version updates. To address this, we introduce TransBench, the first benchmark designed to systematically evaluate and enhance the transferability of GUI agents across three key dimensions: cross-version transferability (adapting to version updates), cross-platform transferability (generalizing across platforms like iOS, Android, and Web), and cross-application transferability (handling tasks spanning functionally distinct apps). TransBench includes 15 app categories with diverse functionalities, capturing essential pages across versions and platforms to enable robust evaluation. Our experiments demonstrate significant improvements in grounding accuracy, showcasing the practical utility of GUI agents in dynamic, real-world environments. Our code and data will be publicly available at GitHub.
pdf
bib
abs
Order Matters: Investigate the Position Bias in Multi-constraint Instruction Following
Jie Zeng
|
Qianyu He
|
Qingyu Ren
|
Jiaqing Liang
|
Weikang Zhou
|
Zeye Sun
|
Fei Yu
|
Yanghua Xiao
Real-world instructions with multiple constraints pose a significant challenge to existing large language models (LLMs). An observation is that the LLMs exhibit dramatic performance fluctuation when disturbing the order of the incorporated constraints. Yet, none of the existing works has systematically investigated this position bias problem in the field of multi-constraint instruction following. To bridge this gap, we design a probing task where we quantitatively measure the difficulty distribution of the constraints by a novel Difficulty Distribution Index (CDDI). Through the experimental results, we find that LLMs are more performant when presented with the constraints in a “hard-to-easy” order. This preference can be generalized to LLMs with different architecture or different sizes of parameters. Additionally, we conduct an explanation study, providing an intuitive insight into the correlation between the LLM’s attention and constraint orders. Our code and dataset are publicly available at https://github.com/meowpass/PBIF.
pdf
bib
abs
CoT-VTM: Visual-to-Music Generation with Chain-of-Thought Reasoning
Xikang Guan
|
Zheng Gu
|
Jing Huo
|
Tianyu Ding
|
Yang Gao
The application of visual-to-music generation (VTM) is rapidly growing. However, current VTM methods struggle with capturing the relationship between visuals and music in open-domain settings, mainly due to two challenges: the lack of large-scale, high-quality visual-music paired datasets and the absence of direct semantic correspondence between visuals and music. In this work, we propose CoT-VTM, a framework that distills Chain-of-Thought (CoT) reasoning to enable visual-to-music generation without paired data, while efficiently producing music aligned with visual content in open-domain settings. We first bridge the gap between visual, music, and text data using appropriate foundation models. Next, we identify key elements of the visual-music relationship and design a CoT prompt for visual-to-music mapping. To fully distill the reasoning of CoT, we incorporate latent information from intermediate reasoning steps as supervisory signals alongside visual and music supervision. Finally, we design a two-stage mapping distillation training process: the first stage uses discriminative MLP modules, while the second uses a generative embedding diffusion model (EDM). Our model achieves optimal performance on both image-to-music and video-to-music tasks. Project page: https://xxkkxxx.github.io/cot-vtm/
pdf
bib
abs
A Tale of Evaluating Factual Consistency: Case Study on Long Document Summarization Evaluation
Yang Zhong
|
Diane Litman
Ensuring factual consistency in summarization remains a challenge, especially for long-document evaluation. While automated, reference-free evaluation models are essential given the impracticality of large-scale human assessment for lengthy texts, challenges persist in evaluating different systems on how to handle different summary granularities and evolving model generations. In this work, we conduct a systematic study on diverse factual-consistency evaluation systems across four long-document datasets, encompassing summaries generated by models from non-LLMs to proprietary LLMs. Our analysis reveals that fine-grained continuous scores can provide more reliable assessments of different evaluation systems’ capabilities than binary classification. We also examine the relationship between sentence-level and summary-level model performance, highlighting its dependency on dataset characteristics. Moreover, our study reveals that advanced systems can achieve higher recall in error detection for older summaries, yet struggle with false positives and fine-grained error detection. Our analysis and case studies provide further insights into designing robust factuality evaluation systems, which are becoming increasingly in demand as generative models advance rapidly.
pdf
bib
abs
Evaluating Pretrained Causal Language Models for Synonymy
Ioana Ivan
|
Carlos Ramisch
|
Alexis Nasr
The scaling of causal language models in size and training data enabled them to tackle increasingly complex tasks. Despite the development of sophisticated tests to reveal their new capabilities, the underlying basis of these complex skills remains unclear. We argue that complex skills might be explained using simpler ones, represented by linguistic concepts. As an initial step in exploring this hypothesis, we focus on the lexical-semantic concept of synonymy, laying the groundwork for research into its relationship with more complex skills. We develop a comprehensive test suite to assess various aspects of synonymy under different conditions, and evaluate causal open-source models ranging up to 10 billion parameters. We find that these models effectively recognize synonymy but struggle to generate synonyms when prompted with relevant context.
pdf
bib
abs
MDIT-Bench: Evaluating the Dual-Implicit Toxicity in Large Multimodal Models
Bohan Jin
|
Shuhan Qi
|
Kehai Chen
|
Xinyi Guo
|
Xuan Wang
The widespread use of Large Multimodal Models (LMMs) has raised concerns about model toxicity. However, current research mainly focuses on explicit toxicity, with less attention to some more implicit toxicity regarding prejudice and discrimination. To address this limitation, we introduce a subtler type of toxicity named dual-implicit toxicity and a novel toxicity benchmark termed MDIT-Bench: Multimodal Dual-Implicit Toxicity Benchmark. Specifically, we first create the MDIT-Dataset with dual-implicit toxicity using the proposed Multi-stage Human-in-loop In-context Generation method. Based on this dataset, we construct the MDIT-Bench, a benchmark for evaluating the sensitivity of models to dual-implicit toxicity, with 317,638 questions covering 12 categories, 23 subcategories, and 780 topics. MDIT-Bench includes three difficulty levels, and we propose a metric to measure the toxicity gap exhibited by the model across them. In the experiment, we conducted MDIT-Bench on 13 prominent LMMs, and the results show that these LMMs cannot handle dual-implicit toxicity effectively. The model’s performance drops significantly in hard level, revealing that these LMMs still contain a significant amount of hidden but activatable toxicity. The data will be released upon the paper’s acceptance.
pdf
bib
abs
CoVE: Compressed Vocabulary Expansion Makes Better LLM-based Recommender Systems
Haochen Zhang
|
Tianyi Zhang
|
Junze Yin
|
Oren Gal
|
Anshumali Shrivastava
|
Vladimir Braverman
Recommender systems play a pivotal role in providing relevant content to users. With the rapid development of large language models (LLMs), researchers have begun utilizing LLMs to build more powerful recommender systems. However, existing approaches that focus on aligning LLMs with recommendation tasks do not fully leverage their sequential information processing capabilities, leading to suboptimal performance. In this paper, we propose a novel system called compressed vocabulary expansion (CoVE). In CoVE, each item is assigned a unique ID within the expanded vocabulary. Our framework effectively capitalizes on sequence understanding abilities of LLMs, significantly enhancing their performance on recommendation tasks. Additionally, we compress the embedding layer, making CoVE practical for large-scale industrial applications. The effectiveness and performance of CoVE are demonstrated through comprehensive experiments on multiple recommendation datasets and comparisons with prior works. Our code can be found at
https://github.com/HaochenZhang717/CoVE-official-Repo.
pdf
bib
abs
CtrlA: Adaptive Retrieval-Augmented Generation via Inherent Control
Liu Huanshuo
|
Hao Zhang
|
Zhijiang Guo
|
Jing Wang
|
Kuicai Dong
|
Xiangyang Li
|
Yi Quan Lee
|
Cong Zhang
|
Yong Liu
Retrieval-augmented generation (RAG) has emerged as a promising solution for mitigating hallucinations of large language models (LLMs) with retrieved external knowledge. Adaptive RAG enhances this approach by enabling dynamic retrieval during generation, activating retrieval only when the query exceeds LLM’s internal knowledge. Existing methods primarily focus on detecting LLM’s confidence via statistical uncertainty. Instead, we present the first attempts to solve adaptive RAG from a representation perspective and develop an inherent control-based framework, termed CtrlA. Specifically, we extract the features that represent the honesty and confidence directions of LLM and adopt them to control LLM behavior and guide retrieval timing decisions. We also design a simple yet effective query formulation strategy to support adaptive retrieval. Experiments show that CtrlA is superior to existing adaptive RAG methods on a diverse set of tasks. Honesty steering can effectively make LLMs more honest and confidence monitoring is a promising indicator of retrieval trigger.
pdf
bib
abs
Maximum Score Routing For Mixture-of-Experts
Bowen Dong
|
Yilong Fan
|
Yutao Sun
|
Zhenyu Li
|
Tengyu Pan
|
Zhou Xun
|
Jianyong Wang
Routing networks in sparsely activated mixture-of-experts (MoE) dynamically allocate input tokens to top-k experts through differentiable sparse transformations, enabling scalable model capacity while preserving computational efficiency. Traditional MoE networks impose an expert capacity constraint to ensure GPU-friendly computation. However, this leads to token dropping when capacity is saturated and results in low hardware efficiency due to padding in underutilized experts. Removing the capacity constraint, in turn, compromises load balancing and computational efficiency.To address these issues, we propose Maximum Score Routing (**MaxScore**), a novel MoE routing paradigm that models routing as a minimum-cost maximum-flow problem and integrates a SoftTopk operator. MaxScore resolves the fundamental limitations of iterative rerouting and optimal transport formulations, achieving lower training losses and higher evaluation scores at equivalent FLOPs compared to both constrained and unconstrained baselines.
pdf
bib
abs
Time Course MechInterp: Analyzing the Evolution of Components and Knowledge in Large Language Models
Ahmad Dawar Hakimi
|
Ali Modarressi
|
Philipp Wicke
|
Hinrich Schuetze
Understanding how large language models (LLMs) acquire and store factual knowledge is crucial for enhancing their interpretability, reliability, and efficiency. In this work, we analyze the evolution of factual knowledge representation in the OLMo-7B model by tracking the roles of its Attention Heads and Feed Forward Networks (FFNs) over training. We classify these components into four roles—general, entity, relation-answer, and fact-answer specific—and examine their stability and transitions. Our results show that LLMs initially depend on broad, general-purpose components, which later specialize as training progresses. Once the model reliably predicts answers, some components are repurposed, suggesting an adaptive learning process. Notably, answer-specific attention heads display the highest turnover, whereas FFNs remain stable, continually refining stored knowledge. These insights offer a mechanistic view of knowledge formation in LLMs and have implications for model pruning, optimization, and transparency.
pdf
bib
abs
Well Begun is Half Done: Low-resource Preference Alignment by Weak-to-Strong Decoding
Feifan Song
|
Shaohang Wei
|
Wen Luo
|
Yuxuan Fan
|
Tianyu Liu
|
Guoyin Wang
|
Houfeng Wang
Large Language Models (LLMs) require alignment with human preferences to avoid generating offensive, false, or meaningless content. Recently, low-resource methods for LLM alignment have been popular, while still facing challenges in obtaining both high-quality and aligned content. Motivated by the observation that the difficulty of generating aligned responses is concentrated at the beginning of decoding, we propose a novel framework, Weak-to-Strong Decoding (WSD), to enhance the alignment ability of base models by the guidance of a small aligned model. The small model first drafts well-aligned beginnings, followed by the large base model to continue the rest, controlled by a well-designed auto-switch mechanism. We also collect a new dataset, GenerAlign, to fine-tune a small-sized Pilot-3B as the draft model, which effectively enhances different base models under the WSD framework to outperform all baseline methods, while avoiding degradation on downstream tasks, termed as the alignment tax. Extensive experiments are further conducted to examine the impact of different settings and time efficiency, as well as analyses on the intrinsic mechanisms of WSD in depth.
pdf
bib
abs
Disentangling Text and Math in Word Problems: Evidence for the Bidimensional Structure of Large Language Models’ Reasoning
Pedro Calais
|
Gabriel Franco
|
Zilu Tang
|
Themistoklis Nikas
|
Wagner Meira Jr.
|
Evimaria Terzi
|
Mark Crovella
Do LLMs process text and mathematics as a unified skill, or do these components rely on distinct underlying mechanisms? We investigate this question by disentangling the textual interpretation and mathematical solving steps in word problems drawn from Brazil’s largest college entrance exam (ENEM) and GSM8K, a popular grade school-level benchmark. Using the symbolic solver SymPy, we transform word problems into equivalent purely mathematical representations, isolating equation formulation from textual comprehension. Our extended benchmarks enable a structured analysis of LLM performance across these two dimensions. Through empirical evaluations, we find that small-scale LLMs struggle significantly more with text interpretation than with equation solving, with accuracy dropping by a factor of 2 to 7 when solving full word problems compared to their math-only counterparts. Exploratory factor analysis confirms a bidimensional structure in LLM reasoning, where models exhibit distinct proficiencies in textual and mathematical components, underscoring the need for targeted improvements in language comprehension. By analyzing the latent factors associated with each model, our findings provide a framework for researchers and practitioners to make informed choices when selecting models based on computational costs and the nature of their tasks.
pdf
bib
abs
Human-LLM Coevolution: Evidence from Academic Writing
Mingmeng Geng
|
Roberto Trotta
With a statistical analysis of arXiv paper abstracts, we report a marked drop in the frequency of several words previously identified as overused by ChatGPT, such as “delve”, starting soon after they were pointed out in early 2024. The frequency of certain other words favored by ChatGPT, such as “significant”, has instead kept increasing. These phenomena suggest that some authors of academic papers have adapted their use of large language models (LLMs), for example, by selecting outputs or applying modifications to the LLM-generated content. Such coevolution and cooperation of humans and LLMs thus introduce additional challenges to the detection of machine-generated text in real-world scenarios. Estimating the impact of LLMs on academic writing by examining word frequency remains feasible, and more attention should be paid to words that were already frequently employed, including those that have decreased in frequency due to LLMs’ disfavor. The coevolution between humans and LLMs also merits further study.
pdf
bib
abs
Disentangled Multi-span Evolutionary Network against Temporal Knowledge Graph Reasoning
Hao Dong
|
Ziyue Qiao
|
Zhiyuan Ning
|
Qi Hao
|
Yi Du
|
Pengyang Wang
|
Yuanchun Zhou
Temporal Knowledge Graphs (TKGs) incorporate the temporal feature to express the transience of knowledge by describing when facts occur. TKG extrapolation aims to infer possible future facts based on known history, which has garnered significant attention in recent years. Some existing methods treat TKG as a sequence of independent subgraphs to model temporal evolution patterns, demonstrating impressive reasoning performance. However, they still have limitations: 1) In modeling subgraph semantic evolution, they usually neglect the internal structural interactions between subgraphs, which are actually crucial for encoding TKGs. 2) They overlook the potential smooth features that do not lead to semantic changes, which should be distinguished from the semantic evolution process. Therefore, we propose Disentangled Multi-span Evolutionary Network (DiMNet) for TKG reasoning. Specifically, we design a multi-span evolution strategy that captures local neighbor features while perceiving historical neighbor semantic information, thus enabling internal interactions between subgraphs during the evolution process. To maximize the capture of semantic change patterns, we design a disentangle component that adaptively separates nodes’ active and stable features, used to dynamically control the influence of historical semantics on future evolution. Extensive experiments demonstrate that DiMNet achieves substantial performance in TKG reasoning, outperforming the state-of-the-art up to 22.7% in MRR.
pdf
bib
abs
GRAF: Graph Retrieval Augmented by Facts for Romanian Legal Multi-Choice Question Answering
Cristian-George Craciun
|
Răzvan-Alexandru Smădu
|
Dumitru-Clementin Cercel
|
Mihaela-Claudia Cercel
Pre-trained language models have shown remarkable performance in recent years, setting a new paradigm for natural language processing (NLP) research. The legal domain has received some attention from the NLP community, in part due to its textual nature. Question answering (QA) systems represent some of the tasks in this domain. This work explores the legal multiple-choice QA (MCQA) for Romanian. The contribution of this work is multi-fold. We introduce JuRO, the first openly available Romanian legal MCQA dataset, comprising 10,836 questions from three examinations. Along with this dataset, we introduce CROL, an organized corpus of laws comprising a total of 93 distinct documents with their modifications over 763 time spans, which we used for information retrieval techniques in this work. Additionally, we construct Law-RoG, the first graph of legal knowledge for the Romanian language, derived from the aforementioned corpus. Lastly, we propose a novel approach for MCQA, namely Graph Retrieval Augmented by Facts (GRAF), which achieves competitive results with generally accepted state-of-the-art methods and even exceeds them in most settings.
pdf
bib
abs
Express What You See: Can Multimodal LLMs Decode Visual Ciphers with Intuitive Semiosis Comprehension?
Jiayi Kuang
|
Yinghui Li
|
Chen Wang
|
Haohao Luo
|
Ying Shen
|
Wenhao Jiang
Bridging the gap between visual and language remains a pivotal challenge for the multimodal community. Traditional VQA benchmarks encounter a modality gap and over-reliance on language priors, whereas human cognition excels at intuitive semiosis, associating abstract visual symbols to linguistic semantics. Inspired by this neurocognitive mechanism, we focus on emojis, the visual cipher conveying abstract textual semantics. Specifically, we propose a novel task of generating abstract linguistics from emoji sequence images, where such reasoning underpins critical applications in cryptography, thus challenging MLLMs’ reasoning of decoding complex semantics of visual ciphers. We introduce eWe-bench (Express What you SeE), assessing MLLMs’ capability of intuitive semiosis like humans. Our data construction framework ensures high visual sensitivity and data quality, which can be extended to future data enhancement. Evaluation results on advanced MLLMs highlight critical deficiencies in visual intuitive symbolic reasoning. We believe our interesting insights for advancing visual semiosis in MLLMs will pave the way for cryptographic analysis and high-level intuitive cognition intelligence of MLLMs.
pdf
bib
abs
ConFit v2: Improving Resume-Job Matching using Hypothetical Resume Embedding and Runner-Up Hard-Negative Mining
Xiao Yu
|
Ruize Xu
|
Chengyuan Xue
|
Jinzhong Zhang
|
Xu Ma
|
Zhou Yu
A reliable resume-job matching system helps a company recommend suitable candidates from a pool of resumes and helps a job seeker find relevant jobs from a list of job posts. However, since job seekers apply only to a few jobs, interaction labels in resume-job datasets are sparse. We introduce ConFit v2, an improvement over ConFit to tackle this sparsity problem. We propose two techniques to enhance the encoder’s contrastive training process: augmenting job data with hypothetical reference resume generated by a large language model; and creating high-quality hard negatives from unlabeled resume/job pairs using a novel hard-negative mining strategy. This method also simplifies the representation space of the encoder. We evaluate ConFit v2 on two real-world datasets and demonstrate that it outperforms ConFit and prior methods (including BM25 and OpenAI text-embedding-003), achieving an average absolute improvement of 13.8% in recall and 17.5% in nDCG across job-ranking and resume-ranking tasks.
pdf
bib
abs
Knowing Before Saying: LLM Representations Encode Information About Chain-of-Thought Success Before Completion
Anum Afzal
|
Florian Matthes
|
Gal Chechik
|
Yftah Ziser
We investigate whether the success of a zero-shot Chain-of-Thought (CoT) process can be predicted before completion. Our classifier, based on LLM representations, performs well even before a single token is generated, suggesting that crucial information about the reasoning process is already present in the initial steps representations. In contrast, a strong BERT-based baseline, which relies solely on the generated tokens, performs worse—likely because it depends on shallow linguistic cues rather than deeper reasoning dynamics. Surprisingly, using later reasoning steps does not always improve classification. When additional context is unhelpful, earlier representations resemble later ones more, suggesting LLMs encode key information early. This implies reasoning can often stop early without loss. To test this, we conduct early stopping experiments, showing that truncating CoT reasoning still improves performance over not using CoT at all, though a gap remains compared to full reasoning. However, approaches like supervised learning or reinforcement learning designed to shorten CoT chains could leverage our classifier’s guidance to identify when early stopping is effective. Our findings provide insights that may support such methods, helping to optimize CoT’s efficiency while preserving its benefits.
pdf
bib
abs
Grounding Task Assistance with Multimodal Cues from a Single Demonstration
Gabriel Herbert Sarch
|
Balasaravanan Thoravi Kumaravel
|
Sahithya Ravi
|
Vibhav Vineet
|
Andrew D Wilson
A person’s demonstration often serves as a key reference for others learning the same task. However, RGB video, the dominant medium for representing these demonstrations, often fails to capture fine-grained contextual cues such as intent, safety-critical environmental factors, and subtle preferences embedded in human behavior. This sensory gap fundamentally limits the ability of Vision Language Models (VLMs) to reason about why actions occur and how they should adapt to individual users. To address this, we introduce MICA (Multimodal Interactive Contextualized Assistance), a framework that improves conversational agents for task assistance by integrating eye gaze and speech cues. MICA segments demonstrations into meaningful sub-tasks and extracts keyframes and captions that capture fine-grained intent and user-specific cues, enabling richer contextual grounding for visual question answering. Evaluations on questions derived from real-time chat-assisted task replication show that multimodal cues significantly improve response quality over frame-based retrieval. Notably, gaze cues alone achieves 93% of speech performance, and their combination yields the highest accuracy. Task type determines the effectiveness of implicit (gaze) vs. explicit (speech) cues, underscoring the need for adaptable multimodal models. These results highlight the limitations of frame-based context and demonstrate the value of multimodal signals for real-world AI task assistance.
pdf
bib
abs
Awes, Laws, and Flaws From Today’s LLM Research
Adrian de Wynter
We perform a critical examination of the scientific methodology behind contemporary large language model (LLM) research. For this we assess over 2,000 research works released between 2020 and 2024 based on criteria typical of what is considered good research (e.g. presence of statistical tests and reproducibility), and cross-validate it with arguments that are at the centre of controversy (e.g., claims of emergent behaviour). We find multiple trends, such as declines in ethics disclaimers, a rise of LLMs as evaluators, and an increase on claims of LLM reasoning abilities without leveraging human evaluation. We note that conference checklists are effective at curtailing some of these issues, but balancing velocity and rigour in research cannot solely rely on these. We tie all these findings to findings from recent meta-reviews and extend recommendations on how to address what does, does not, and should work in LLM research.
pdf
bib
abs
Dual Debiasing for Noisy In-Context Learning for Text Generation
Siqi Liang
|
Sumyeong Ahn
|
Paramveer Dhillon
|
Jiayu Zhou
In-context learning (ICL) relies heavily on high-quality demonstrations drawn from large annotated corpora. Existing approaches detect noisy annotations by ranking local perplexities, presuming that noisy samples yield higher perplexities than their clean counterparts. However, this assumption breaks down when the noise ratio is high and many demonstrations are flawed.We re-examine the perplexity-based paradigm for text generation under noisy annotations, highlighting two sources of bias in perplexity: the annotation itself and the domain-specific knowledge inherent in large language models (LLMs). To overcome these biases, we introduce a dual-debiasing framework that uses synthesized neighbors to explicitly correct perplexity estimates, yielding a robust Sample Cleanliness Score. This metric uncovers absolute sample cleanliness regardless of the overall corpus noise level.Extensive experiments demonstrate our method’s superior noise-detection capabilities and show that its final ICL performance is comparable to that of a fully clean demonstration corpus. Moreover, our approach remains robust even when noise ratios are extremely high.
pdf
bib
abs
DRS: Deep Question Reformulation With Structured Output
Zhecheng Li
|
Yiwei Wang
|
Bryan Hooi
|
Yujun Cai
|
Nanyun Peng
|
Kai-Wei Chang
Question answering represents a core capability of large language models (LLMs). However, when individuals encounter unfamiliar knowledge in texts, they often formulate questions that the text itself cannot answer due to insufficient understanding of the underlying information. Recent studies reveal that while LLMs can detect unanswerable questions, they struggle to assist users in reformulating these questions. Even advanced models like GPT-3.5 demonstrate limited effectiveness in this regard. To address this limitation, we propose DRS: Deep Question Reformulation with Structured Output, a novel zero-shot method aimed at enhancing LLMs’ ability to assist users in reformulating questions to extract relevant information from new documents. DRS combines the strengths of LLMs with a DFS-based algorithm to iteratively explore potential entity combinations and constrain outputs using predefined entities. This structured approach significantly enhances the reformulation capabilities of LLMs. Comprehensive experimental evaluations demonstrate that DRS improves the reformulation accuracy of GPT-3.5 from 23.03% to 70.42%, while also enhancing the performance of open-source models, such as Gemma2-9B, from 26.35% to 56.75%.
pdf
bib
abs
Towards Explainable Hate Speech Detection
Happy Khairunnisa Sariyanto
|
Diclehan Ulucan
|
Oguzhan Ulucan
|
Marc Ebner
Recent advancements in deep learning have significantly enhanced the efficiency and accuracy of natural language processing (NLP) tasks. However, these models often require substantial computational resources, which remains a major drawback. Reducing the complexity of deep learning architectures, and exploring simpler yet effective approaches can lead to cost-efficient NLP solutions. This is also a step towards explainable AI, i.e., uncovering how a particular task is carried out. For this analysis, we chose the task of hate speech detection. We address hate speech detection by introducing a model that employs a weighted sum of valence, arousal, and dominance (VAD) scores for classification. To determine the optimal weights and classification strategies, we analyze hate speech and non-hate speech words based on both their individual and summed VAD-values. Our experimental results demonstrate that this straightforward approach can compete with state-of-the-art neural network methods, including GPT-based models, in detecting hate speech.
pdf
bib
abs
BioHopR: A Benchmark for Multi-Hop, Multi-Answer Reasoning in Biomedical Domain
Yunsoo Kim
|
Yusuf Abdulle
|
Honghan Wu
Biomedical reasoning often requires traversing interconnected relationships across entities such as drugs, diseases, and proteins. Despite the increasing prominence of large language models (LLMs), existing benchmarks lack the ability to evaluate multi-hop reasoning in the biomedical domain, particularly for queries involving one-to-many and many-to-many relationships. This gap leaves the critical challenges of biomedical multi-hop reasoning underexplored. To address this, we introduce BioHopR, a novel benchmark designed to evaluate multi-hop, multi-answer reasoning in structured biomedical knowledge graphs. Built from the comprehensive PrimeKG, BioHopR includes 1-hop and 2-hop reasoning tasks that reflect real-world biomedical complexities.Evaluations of state-of-the-art models reveal that O3-mini, a proprietary reasoning-focused model, achieves 37.93% precision on 1-hop tasks and 14.57% on 2-hop tasks, outperforming proprietary models such as GPT4O and open-source biomedical models including HuatuoGPT-o1-70B and Llama-3.3-70B. However, all models exhibit significant declines in multi-hop performance, underscoring the challenges of resolving implicit reasoning steps in the biomedical domain. By addressing the lack of benchmarks for multi-hop reasoning in biomedical domain, BioHopR sets a new standard for evaluating reasoning capabilities and highlights critical gaps between proprietary and open-source models while paving the way for future advancements in biomedical LLMs. BioHopR is available at https://huggingface.co/datasets/knowlab-research/BioHopR.
pdf
bib
abs
PipeSpec: Breaking Stage Dependencies in Hierarchical LLM Decoding
Bradley McDanel
|
Sai Qian Zhang
|
Yunhai Hu
|
Zining Liu
Speculative decoding accelerates large language model inference by using smaller draft models to generate candidate tokens for parallel verification. However, current approaches are limited by sequential stage dependencies that prevent full hardware utilization. We present PipeSpec, a framework that generalizes speculative decoding to use multiple models arranged in a hierarchical pipeline, enabling asynchronous execution with lightweight coordination for prediction verification and rollback. Our analytical model characterizes token generation rates across pipeline stages and proves guaranteed throughput improvements over traditional decoding for any non-zero acceptance rate. We further derive closed-form expressions for steady-state verification probabilities that explain the empirical benefits of pipeline depth. We validate PipeSpec across text summarization, mathematical reasoning, and code generation tasks using LLaMA 2 and 3 models, demonstrating that pipeline efficiency increases with model depth, providing a scalable approach to accelerating LLM inference on multi-device systems. Our code is available at https://github.com/BradMcDanel/PipeSpec.
pdf
bib
abs
LAM SIMULATOR: Advancing Data Generation for Large Action Model Training via Online Exploration and Trajectory Feedback
Thai Quoc Hoang
|
Kung-Hsiang Huang
|
Shirley Kokane
|
Jianguo Zhang
|
Zuxin Liu
|
Ming Zhu
|
Jake Grigsby
|
Tian Lan
|
Michael S Ryoo
|
Chien-Sheng Wu
|
Shelby Heinecke
|
Huan Wang
|
Silvio Savarese
|
Caiming Xiong
|
Juan Carlos Niebles
Large Action Models (LAMs) for AI Agents offer incredible potential but face challenges due to the need for high-quality training data, especially for multi-steps tasks that involve planning, executing tool calls, and responding to feedback. To address these issues, we present LAM SIMULATOR, a comprehensive framework designed for online exploration of agentic tasks with high-quality feedback. Our framework features a dynamic task query generator, an extensive collection of tools, and an interactive environment where Large Language Model (LLM) Agents can call tools and receive real-time feedback. This setup enables LLM Agents to explore and solve tasks autonomously, facilitating the discovery of multiple approaches to tackle any given task. The resulting action trajectory data are then used to create high-quality training datasets for LAMs. Our experiments on popular agentic benchmarks, ToolBench and CRMArena, highlight the effectiveness of LAM SIMULATOR: models trained with self-generated datasets using our framework achieve significant performance gains, up to a 49.3% improvement over their original baselines. LAM SIMULATOR requires minimal human input during dataset creation, highlighting LAM SIMULATOR’s efficiency and effectiveness in speeding up development of AI agents.
pdf
bib
abs
Rank, Chunk and Expand: Lineage-Oriented Reasoning for Taxonomy Expansion
Sahil Mishra
|
Kumar Arjun
|
Tanmoy Chakraborty
Taxonomies are hierarchical knowledge graphs crucial for recommendation systems, and web applications. As data grows, expanding taxonomies is essential, but existing methods face key challenges: (1) discriminative models struggle with representation limits and generalization, while (2) generative methods either process all candidates at once, introducing noise and exceeding context limits, or discard relevant entities by selecting noisy candidates. We propose LORex (Lineage-Oriented Reasoning for Taxonomy Expansion), a plug-and-play framework that combines discriminative ranking and generative reasoning for efficient taxonomy expansion. Unlike prior methods, LORex ranks and chunks candidate terms into batches, filtering noise and iteratively refining selections by reasoning candidates’ hierarchy to ensure contextual efficiency. Extensive experiments across four benchmarks and twelve baselines show that LORex improves accuracy by 12% and Wu & Palmer similarity by 5% over state-of-the-art methods.
pdf
bib
abs
Probing Subphonemes in Morphology Models
Gal Astrach
|
Yuval Pinter
Transformers have achieved state-of-the-art performance in morphological inflection tasks, yet their ability to generalize across languages and morphological rules remains limited. One possible explanation for this behavior can be the degree to which these models are able to capture implicit phenomena at the phonological and subphonemic levels. We introduce a language-agnostic probing method to investigate phonological feature encoding in transformers trained directly on phonemes, and perform it across seven morphologically diverse languages. We show that phonological features which are local, such as final-obstruent devoicing in Turkish, are captured well in phoneme embeddings, whereas long-distance dependencies like vowel harmony are better represented in the transformer’s encoder. Finally, we discuss how these findings inform empirical strategies for training morphological models, particularly regarding the role of subphonemic feature acquisition.
pdf
bib
abs
Exploiting Instruction-Following Retrievers for Malicious Information Retrieval
Parishad BehnamGhader
|
Nicholas Meade
|
Siva Reddy
Instruction-following retrievers have been widely adopted alongside LLMs in real-world applications, but little work has investigated the safety risks surrounding their increasing search capabilities. We empirically study the ability of retrievers to satisfy malicious queries, both when used directly and when used in a retrieval augmented generation-based setup. Concretely, we investigate six leading retrievers, including NV-Embed and LLM2Vec, and find that given malicious requests, most retrievers can (for >50% of queries) select relevant harmful passages. For example, LLM2Vec correctly selects passages for 61.35% of our malicious queries. We further uncover an emerging risk with instruction-following retrievers, where highly relevant harmful information can be surfaced by exploiting their instruction-following capabilities. Finally, we show that even safety-aligned LLMs, such as Llama3, can satisfy malicious requests when provided with harmful retrieved passages in-context. In summary, our findings underscore the malicious misuse risks associated with increasing retriever capability.
pdf
bib
abs
Improving Causal Interventions in Amnesic Probing with Mean Projection or LEACE
Alicja Dobrzeniecka
|
Antske Fokkens
|
Pia Sommerauer
Amnesic probing is a technique used to examine the influence of specific linguistic information on the behaviour of a model. This involves identifying and removing the relevant information and then assessing whether the model’s performance on the main task changes. If the removed information is relevant, the model’s performance should decline. The difficulty with this approach lies in removing only the target information while leaving other information unchanged. It has been shown that Iterative Nullspace Projection (INLP), a widely used removal technique, introduces random modifications to representations when eliminating target information. We demonstrate that Mean Projection (MP) and LEACE, two proposed alternatives, remove information in a more targeted manner, thereby enhancing the potential for obtaining behavioural explanations through Amnesic Probing.
pdf
bib
abs
The Threat of PROMPTS in Large Language Models: A System and User Prompt Perspective
Zixuan Xia
|
Haifeng Sun
|
Jingyu Wang
|
Qi Qi
|
Huazheng Wang
|
Xiaoyuan Fu
|
Jianxin Liao
Prompts, especially high-quality ones, play an invaluable role in assisting large language models (LLMs) to accomplish various natural language processing tasks. However, carefully crafted prompts can also manipulate model behavior. Therefore, the security risks that “prompts themselves face” and those “arising from harmful prompts” cannot be overlooked and we define the Prompt Threat (PT) issues. In this paper, we review the latest attack methods related to prompt threats, focusing on prompt leakage attacks and prompt jailbreak attacks. Additionally, we summarize the experimental setups of these methods and explore the relationship between prompt threats and prompt injection attacks.
pdf
bib
abs
RoseRAG: Robust Retrieval-augmented Generation with Small-scale LLMs via Margin-aware Preference Optimization
Tianci Liu
|
Haoxiang Jiang
|
Tianze Wang
|
Ran Xu
|
Yue Yu
|
Linjun Zhang
|
Tuo Zhao
|
Haoyu Wang
Large language models (LLMs) have achieved impressive performance but face high computational costs and latency, limiting their deployment in resource-constrained settings. In contrast, small-scale LLMs (SLMs) are more efficient yet struggle to capture evolving real-world knowledge. Retrieval-augmented generation (RAG) helps by integrating external knowledge, but imperfect retrieval can introduce distracting noise that misleads SLMs. We propose RoseRAG, a robust RAG framework for SLMs via Margin-aware Preference Optimization. RoseRAG employs multi-turn prompting for detailed reasoning, rejection sampling for high-quality explanations, and contrastive preference selection to refine responses by maximizing the likelihood gap between preferred and non-preferred outputs. By integrating these components into a margin-aware optimization process, RoseRAG robustly enhances the accuracy and reliability of SLMs for RAG applications. Extensive experiments on three open-domain question answering benchmarks indicate that our innovative RoseRAG surpasses state-of-the-art baselines significantly.
pdf
bib
abs
Instruction-Tuning LLMs for Event Extraction with Annotation Guidelines
Saurabh Srivastava
|
Sweta Pati
|
Ziyu Yao
In this work, we study the effect of annotation guidelines–textual descriptions of event types and arguments, when instruction-tuning large language models for event extraction. We conducted a series of experiments with both human-provided and machine-generated guidelines in both full- and low-data settings. Our results demonstrate the promise of annotation guidelines when there is a decent amount of training data and highlight its effectiveness in improving cross-schema generalization and low-frequency event-type performance.
pdf
bib
abs
mRAKL: Multilingual Retrieval-Augmented Knowledge Graph Construction for Low-Resourced Languages
Hellina Hailu Nigatu
|
Min Li
|
Maartje Ter Hoeve
|
Saloni Potdar
|
Sarah Chasins
Knowledge Graphs represent real-world entities and the relationships between them. Multilingual Knowledge Graph Construction (mKGC) refers to the task of automatically constructing or predicting missing entities and links for knowledge graphs in a multilingual setting. In this work, we reformulate the mKGC task as a Question Answering (QA) task and introduce mRAKL: a Retrieval-Augmented Generation (RAG) based system to perform mKGC. We achieve this by using the head entity and linking relation in a question, and having our model predict the tail entity as an answer. Our experiments focus primarily on two low-resourced languages: Tigrinya and Amharic. We experiment with using higher-resourced languages, Arabic and English, to utilize cross-lingual transfer for mKGC. With a BM25 retriever, we find that the RAG-based approach improves performance over a no-context setting. Further, our ablation studies show that with an idealized retrieval system, mRAKL improves accuracy by up to 4.92 and 8.79 percentage points for Tigrinya and Amharic, respectively.
pdf
bib
abs
Mechanistic Interpretability of Emotion Inference in Large Language Models
Ala N. Tak
|
Amin Banayeeanzade
|
Anahita Bolourani
|
Mina Kian
|
Robin Jia
|
Jonathan Gratch
Large language models (LLMs) show promising capabilities in predicting human emotions from text. However, the mechanisms through which these models process emotional stimuli remain largely unexplored. Our study addresses this gap by investigating how autoregressive LLMs infer emotions, showing that emotion representations are functionally localized to specific regions in the model. Our evaluation includes diverse model families and sizes, and is supported by robustness checks. We then show that the identified representations are psychologically plausible by drawing on cognitive appraisal theory—a well-established psychological framework positing that emotions emerge from evaluations (appraisals) of environmental stimuli. By causally intervening on construed appraisal concepts, we steer the generation and show that the outputs align with theoretical and intuitive expectations. This work highlights a novel way to causally intervene and control emotion inference, potentially benefiting safety and alignment in sensitive affective domains.
pdf
bib
abs
RL-Guider: Leveraging Historical Decisions and Feedback for Drug Editing with Large Language Models
Xufeng Liu
|
Yixuan Ding
|
Jingxiang Qu
|
Yichi Zhang
|
Wenhan Gao
|
Yi Liu
Recent success of large language models (LLMs) in diverse domains showcases their potential to revolutionize scientific fields, including drug editing. Traditional drug editing relies on iterative conversations with domain experts, refining the drug until the desired property is achieved. This interactive and iterative process mirrors the strengths of LLMs, making them well-suited for drug editing. *In existing works, LLMs edit each molecule independently without leveraging knowledge from past edits.* However, human experts develop intuition about effective modifications over time through historical experience; accumulating past knowledge is pivotal for human experts, and so it is for LLMs. *In this work, we propose RL-Guider — a reinforcement-learning agent to provide suggestions to LLMs; it uses the rich information provided from evaluating editing results made by the LLM based on the recommendations to improve itself over time.* RL-Guider is the first work that leverages both the comprehensive “world-level” knowledge of LLMs and the knowledge accumulated from historical feedback. As a result, RL-Guider mitigates several shortcomings of existing approaches and demonstrates superior performance. The code is available at [https://github.com/xufliu/RL-Guider](https://github.com/xufliu/RL-Guider).
pdf
bib
abs
BriefMe: A Legal NLP Benchmark for Assisting with Legal Briefs
Jesse Woo
|
Fateme Hashemi Chaleshtori
|
Ana Marasovic
|
Kenneth Marino
A core part of legal work that has been underexplored in Legal NLP is the writing and editing of legal briefs. This requires not only a thorough understanding of the law of a jurisdiction, from judgments to statutes, but also the ability to make new arguments to try to expand the law in a new direction and make novel and creative arguments that are persuasive to judges. To capture and evaluate these legal skills in language models, we introduce BRIEFME, a new dataset focused on legal briefs. It contains three tasks for language models to assist legal professionals in writing briefs: argument summarization, argument completion, and case retrieval. In this work, we describe the creation of these tasks, analyze them, and show how current models perform. We see that today’s large language models (LLMs) are already quite good at the summarization and guided completion tasks, even beating human-generated headings. Yet, they perform poorly on other tasks in our benchmark: realistic argument completion and retrieving relevant legal cases. We hope this dataset encourages more development in Legal NLP in ways that will specifically aid people in performing legal work.
pdf
bib
abs
I see what you mean: Co-Speech Gestures for Reference Resolution in Multimodal Dialogue
Esam Ghaleb
|
Bulat Khaertdinov
|
Asli Ozyurek
|
Raquel Fernández
In face-to-face interaction, we use multiple modalities, including speech and gestures, to communicate information and resolve references to objects. However, how representational co-speech gestures refer to objects remains understudied from a computational perspective. In this work, we address this gap by introducing a multimodal reference resolution task centred on representational gestures, while simultaneously tackling the challenge of learning robust gesture embeddings. We propose a self-supervised pre-training approach to gesture representation learning that grounds body movements in spoken language. Our experiments show that the learned embeddings align with expert annotations and have significant predictive power. Moreover, reference resolution accuracy further improves when (1) using multimodal gesture representations, even when speech is unavailable at inference time, and (2) leveraging dialogue history. Overall, our findings highlight the complementary roles of gesture and speech in reference resolution, offering a step towards more naturalistic models of human-machine interaction.
pdf
bib
abs
World Knowledge Resolves Some Aspectual Ambiguity
Katarzyna Pruś
|
Mark Steedman
|
Adam Lopez
Annotating event descriptions with their aspectual features is often seen as a pre-requisite to temporal reasoning. However, a recent study by Pruś et al. (2024) has shown that non-experts’ annotations of the aspectual class of English verb phrases can disagree with both expert linguistic annotations and each another. They hypothesised that people use their world knowledge to tacitly conjure their own contexts, leading to disagreement between them. In this paper, we test that hypothesis by adding context to Pruś et al.’s examples and mirroring their experiment. Our results show that whilst their hypothesis explains some of the disagreement, some examples continue to yield divided responses even with the additional context. Finally, we show that outputs from GPT-4, despite to some degree capturing the aspectual class division, are not an accurate predictor of human answers.
pdf
bib
abs
ACCESS DENIED INC: The First Benchmark Environment for Sensitivity Awareness
Dren Fazlija
|
Arkadij Orlov
|
Sandipan Sikdar
Large language models (LLMs) are increasingly becoming valuable to corporate data management due to their ability to process text from various document formats and facilitate user interactions through natural language queries. However, LLMs must consider the sensitivity of information when communicating with employees, especially given access restrictions. Simple filtering based on user clearance levels can pose both performance and privacy challenges. To address this, we propose the concept of sensitivity awareness (SA), which enables LLMs to adhere to predefined access rights rules. In addition, we developed a benchmarking environment called ACCESS DENIED INC to evaluate SA. Our experimental findings reveal significant variations in model behavior, particularly in managing unauthorized data requests while effectively addressing legitimate queries. This work establishes a foundation for benchmarking sensitivity-aware language models and provides insights to enhance privacy-centric AI systems in corporate environments.
pdf
bib
abs
Spatial Coordinates as a Cell Language: A Multi-Sentence Framework for Imaging Mass Cytometry Analysis
Chi-Jane Chen
|
Yuhang Chen
|
Sukwon Yun
|
Natalie Stanley
|
Tianlong Chen
Image mass cytometry (IMC) enables high-dimensional spatial profiling by combining mass cytometry’s analytical power with spatial distributions of cell phenotypes. Recent studies leverage large language models (LLMs) to extract cell states by translating gene or protein expression into biological context. However, existing single-cell LLMs face two major challenges: (1) Integration of spatial information—they struggle to generalize spatial coordinates and effectively encode spatial context as text, and (2) Treating each cell independently—they overlook cell-cell interactions, limiting their ability to capture biological relationships. To address these limitations, we propose Spatial2Sentence, a novel framework that integrates both single-cell expression and spatial information into natural language using a multi-sentence approach. Given an expression matrix and spatial coordinates, Spatial2Sentence constructs expression similarity and distance matrices, pairing spatially adjacent and expressionally similar cells as positive pairs while using distant and dissimilar cells as negatives. These multi-sentence representations are processed by LLMs, enabling them to learn cellular interactions in both expression and spatial contexts. Equipped with multi-task learning, Spatial2Sentence outperforms existing single-cell LLMs on preprocessed IMC datasets for diabetes and brain tumors, improving cell-type classification by 5.98% and clinical status prediction by 4.18% on the diabetes dataset while enhancing interpretability. The source code can be found here:
https://github.com/UNITES-Lab/Spatial2Sentence.
pdf
bib
abs
HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation Task
Zhaojian Yu
|
Yilun Zhao
|
Arman Cohan
|
Xiao-Ping Zhang
In this paper, we present HumanEval Pro and MBPP Pro, a series of benchmarks to evaluate LLMs on self-invoking code generation task. This task involves providing LLMs with a base problem alongside a related, more complex problem. The models must solve the base problem and leverage its solution to address the more complex one, thereby showcasing their capacity for progressive reasoning and problem-solving. This work features three key contributions. First, we propose a general recipe for generating more challenging versions of existing benchmarks. Second, from the analysis of experimental results over twenty large language models (LLM) on our benchmarks, we have two important observations: (i) Most LLMs excel in traditional code generation benchmarks like HumanEval and MBPP, but their performance declines on self-invoking tasks. For example, o1-mini achieves 96.2% pass@1 on HumanEval but only 76.2% on HumanEval Pro. (ii) On self-invoking code generation task, the instruction-tuned models demonstrate only marginal improvements compared to the base models. Third, we disclose the types of failure modes that exist in our evaluation results. All these results underscore the need for further advancements in this area and provide a new prospective to future research.
pdf
bib
abs
TCSinger 2: Customizable Multilingual Zero-shot Singing Voice Synthesis
Yu Zhang
|
Wenxiang Guo
|
Changhao Pan
|
Dongyu Yao
|
Zhiyuan Zhu
|
Ziyue Jiang
|
Yuhan Wang
|
Tao Jin
|
Zhou Zhao
Customizable multilingual zero-shot singing voice synthesis (SVS) has various potential applications in music composition and short video dubbing. However, existing SVS models overly depend on phoneme and note boundary annotations, limiting their robustness in zero-shot scenarios and producing poor transitions between phonemes and notes. Moreover, they also lack effective multi-level style control via diverse prompts. To overcome these challenges, we introduce TCSinger 2, a multi-task multilingual zero-shot SVS model with style transfer and style control based on various prompts. TCSinger 2 mainly includes three key modules: 1) Blurred Boundary Content (BBC) Encoder, predicts duration, extends content embedding, and applies masking to the boundaries to enable smooth transitions. 2) Custom Audio Encoder, uses contrastive learning to extract aligned representations from singing, speech, and textual prompts. 3) Flow-based Custom Transformer, leverages Cus-MOE, with F0 supervision, enhancing both the synthesis quality and style modeling of the generated singing voice. Experimental results show that TCSinger 2 outperforms baseline models in both subjective and objective metrics across multiple related tasks.
pdf
bib
abs
Compute Optimal Scaling of Skills: Knowledge vs Reasoning
Nicholas Roberts
|
Niladri S. Chatterji
|
Sharan Narang
|
Mike Lewis
|
Dieuwke Hupkes
Scaling laws are a critical component of the LLM development pipeline, most famously as a way to forecast training decisions such as ‘compute-optimally’ trading-off parameter count and dataset size, alongside a more recent growing list of other crucial decisions. In this work, we ask whether compute-optimal scaling behaviour can be skill-dependent. In particular, we examine knowledge and reasoning-based skills such as knowledge-based QA and code generation, and we answer this question in the affirmative: scaling laws are skill-dependent. Next, to understand whether skill-dependent scaling is an artefact of the pretraining datamix, we conduct an extensive ablation of different datamixes and find that, also when correcting for datamix differences, knowledge and code exhibit fundamental differences in scaling behaviour. We conclude with an analysis of how our findings relate to standard compute-optimal scaling using a validation set, and find that a misspecified validation set can impact compute-optimal parameter count by nearly 50%, depending on its skill composition.
pdf
bib
abs
PECAN: LLM-Guided Dynamic Progress Control with Attention-Guided Hierarchical Weighted Graph for Long-Document QA
Xinyu Wang
|
Yanzheng Xiang
|
Lin Gui
|
Yulan He
Long-document QA presents challenges with large-scale text and long-distance dependencies. Recent advances in Large Language Models (LLMs) enable entire documents to be processed in a single pass. However, their computational cost is significantly high. Retrieval-Augmented Generation (RAG) methods split text into smaller chunks, but they often yield inferior results and may lose global context. Recent approaches that integrate LLMs into RAG via iterative summarization either underutilize LLM capabilities or still incur high computational costs. In this paper, we combine the high accuracy of LLMs with the efficiency of RAG and propose LLM-Guided Dynamic Progress Control with Attention-Based Hierarchical Weighted Graph (PECAN). Our method introduces two key improvements: (1) LLM-Guided Dynamic Progress Control: We leverage LLMs to dynamically control the retrieval process, adjusting the amount of retrieved information based on different queries to achieve a better balance of effectiveness and efficiency. (2) Attention-Guided Retrieval: We propose a novel retrieval method that constructs a hierarchical graph where edges are derived by LLM attention weights. Experimental results demonstrate that PECAN achieves LLM-level performance while maintaining computational complexity comparable to that of RAG methods on two single-document and two multi-document QA datasets.
pdf
bib
abs
Lifelong Model Editing with Graph-Based External Memory
Yash Kumar Atri
|
Ahmed Alaa
|
Thomas Hartvigsen
Large language models (LLMs) have revolutionized natural language processing, yet their practical utility is often limited by persistent issues of hallucinations and outdated parametric knowledge. Although post-training model editing offers a pathway for dynamic updates, existing methods frequently suffer from overfitting and catastrophic forgetting. To tackle these challenges, we propose a novel framework that leverages hyperbolic geometry and graph neural networks for precise and stable model edits. We introduce HYPE, (HYperbolic Parameter Editing), which comprises three key components: (i) Hyperbolic Graph Construction, which uses Poincaré embeddings to represent knowledge triples in hyperbolic space, preserving hierarchical relationships and preventing unintended side effects by ensuring that edits to parent concepts do not inadvertently affect child concepts; (ii) Möbius-Transformed Updates, which apply hyperbolic addition to propagate edits while maintaining structural consistency within the hyperbolic manifold, unlike conventional Euclidean updates that distort relational distances; and (iii) Dual Stabilization, which combines gradient masking and periodic GNN parameter resetting to prevent catastrophic forgetting by focusing updates on critical parameters and preserving long-term knowledge. Experiments on CounterFact, CounterFact+, and MQuAKE with GPT-J and GPT2-XL demonstrate that HYPE significantly enhances edit stability, factual accuracy, and multi-hop reasoning.
pdf
bib
abs
Multi-Sense Embeddings for Language Models and Knowledge Distillation
Qitong Wang
|
Mohammed J Zaki
|
Georgios Kollias
|
Vasileios Kalantzis
Transformer-based large language models (LLMs) rely on contextual embeddings which generate different (continuous) representations for the same token depending on its surrounding context. Nonetheless, words and tokens typically have a limited number of senses (or meanings). We propose multi-sense embeddings as a drop-in replacement for each token in order to capture the range of their uses in a language. To construct a sense embedding dictionary, we apply a clustering algorithm to embeddings generated by an LLM and consider the cluster centers as representative sense embeddings. In addition, we propose a novel knowledge distillation method that leverages the sense dictionary to learn a smaller student model that mimics the senses from the much larger base LLM model, offering significant space and inference time savings, while maintaining competitive performance. Via thorough experiments on various benchmarks, we showcase the effectiveness of our sense embeddings and knowledge distillation approach.
pdf
bib
abs
CodeScientist: End-to-End Semi-Automated Scientific Discovery with Code-based Experimentation
Peter Jansen
|
Oyvind Tafjord
|
Marissa Radensky
|
Pao Siangliulue
|
Tom Hope
|
Bhavana Dalvi Mishra
|
Bodhisattwa Prasad Majumder
|
Daniel S Weld
|
Peter Clark
Despite the surge of interest in autonomous scientific discovery (ASD) of software artifacts (e.g., improved ML algorithms), current ASD systems face two key limitations: (1) they largely explore variants of existing codebases or similarly constrained design spaces, and (2) they produce large volumes of research artifacts (such as automatically generated papers and code) that are typically evaluated using conference-style paper review with limited evaluation of code. In this work we introduce CodeScientist, a novel ASD system that frames ideation and experiment construction as a form of genetic search jointly over combinations of research articles and codeblocks defining common actions in a domain (like prompting a language model). We use this paradigm to conduct hundreds of automated experiments on machine-generated ideas broadly in the domain of agents and virtual environments, with the system returning 19 discoveries, 6 of which were judged as being both at least minimally sound and incrementally novel after a multi-faceted evaluation beyond that typically conducted in prior work, including external (conference-style) review, code review, and replication attempts. Moreover, the discoveries span new tasks, agents, metrics, and data, suggesting a qualitative shift from benchmark optimization to broader discoveries.
pdf
bib
abs
Beyond Factual Accuracy: Evaluating Coverage of Diverse Factual Information in Long-form Text Generation
Chris Samarinas
|
Alexander Krubner
|
Alireza Salemi
|
Youngwoo Kim
|
Hamed Zamani
This paper presents ICAT, an evaluation framework for measuring coverage of diverse factual information in long-form text generation. ICAT breaks down a long output text into a list of atomic claims and not only verifies each claim through retrieval from a (reliable) knowledge source, but also computes the alignment between the atomic factual claims and various aspects expected to be presented in the output. We study three implementations of the ICAT framework, each with a different assumption on the availability of aspects and alignment method. By adopting data from the diversification task in the TREC Web Track and the ClueWeb corpus, we evaluate the ICAT framework. We demonstrate strong correlation with human judgments and provide comprehensive evaluation across multiple state-of-the-art LLMs. Our framework further offers interpretable and fine-grained analysis of diversity and coverage. Its modular design allows for easy adaptation to different domains and datasets, making it a valuable tool for evaluating the qualitative aspects of long-form responses produced by LLMs.
pdf
bib
abs
Continual Quantization-Aware Pre-Training: When to transition from 16-bit to 1.58-bit pre-training for BitNet language models?
Jacob Nielsen
|
Peter Schneider-Kamp
|
Lukas Galke
Large language models (LLMs) require immense resources for training and inference. Quantization, a technique that reduces the precision of model parameters, offers a promising solution for improving LLM efficiency and sustainability. While post-training quantization methods typically achieve 4-8 bits per parameter, recent research suggests that training LLMs with 1.58 bits per weight parameter from scratch can maintain model accuracy while greatly reducing memory requirements and energy consumption at inference time. Here, we investigate a training strategy for quantization-aware pre-training, where the models are first trained with 16-bit precision and then transition into 1.58-bit quantization-aware training. Our results on 11 downstream tasks, show that this 16-to-1.58-bit training strategy is preferable over full 1.58-bit training and leaves models closer to those which have undergone 16-bit training. We further investigate the effects of retaining the optimizer state at the transition point and gradually phasing in quantization strength - finding that both techniques alleviate the magnitude of loss spikes, but also that these effects can be compensated through further training.
pdf
bib
abs
When Detection Fails: The Power of Fine-Tuned Models to Generate Human-Like Social Media Text
Hillary Dawkins
|
Kathleen C. Fraser
|
Svetlana Kiritchenko
Detecting AI-generated text is a difficult problem to begin with; detecting AI-generated text on social media is made even more difficult due to the short text length and informal, idiosyncratic language of the internet. It is nonetheless important to tackle this problem, as social media represents a significant attack vector in online influence campaigns, which may be bolstered through the use of mass-produced AI-generated posts supporting (or opposing) particular policies, decisions, or events. We approach this problem with the mindset and resources of a reasonably sophisticated threat actor, and create a dataset of 505,159 AI-generated social media posts from a combination of open-source, closed-source, and fine-tuned LLMs, covering 11 different controversial topics. We show that while the posts can be detected under typical research assumptions about knowledge of and access to the generating models, under the more realistic assumption that an attacker will not release their fine-tuned model to the public, detectability drops dramatically. This result is confirmed with a human study. Ablation experiments highlight the vulnerability of various detection algorithms to fine-tuned LLMs. This result has implications across all detection domains, since fine-tuning is a generally applicable and realistic LLM use case.
pdf
bib
abs
Not quite Sherlock Holmes: Language model predictions do not reliably differentiate impossible from improbable events
James A. Michaelov
|
Reeka Estacio
|
Zhien Zhang
|
Ben Bergen
Can language models reliably predict that possible events are more likely than merely improbable ones? By teasing apart possibility, typicality, and contextual relatedness, we show that despite the results of previous work, language models’ ability to do this is far from robust. In fact, under certain conditions, all models tested—including Llama 3, Gemma 2, and Mistral NeMo—perform at worse-than-chance level, assigning higher probabilities to impossible sentences such as ‘the car was given a parking ticket by the brake’ than to merely unlikely sentences such as ‘the car was given a parking ticket by the explorer’.
pdf
bib
abs
The Rotary Position Embedding May Cause Dimension Inefficiency in Attention Heads for Long-Distance Retrieval
Ting-Rui Chiang
|
Dani Yogatama
The Rotary Position Embedding (RoPE) is widely used in the attention heads of many large language models (LLM). It rotates dimensions in the query and the key vectors by different angles according to their positions in the input sequence. For long context modeling, the range of positions may vary a lot, and thus RoPE rotates some dimensions by a great range of angles. We hypothesize that the wide range of rotation angles may prevent LLMs from utilizing those dimensions. To validate this hypothesis, we present a controlled experiment showing that applying RoPE causes low utility of certain dimensions. Our analyses on three LLMs also indicate that these dimensions do not help LLMs do long-context question answering.
pdf
bib
abs
IDEA: Enhancing the Rule Learning Ability of Large Language Model Agent through Induction, Deduction, and Abduction
Kaiyu He
|
Mian Zhang
|
Shuo Yan
|
Peilin Wu
|
Zhiyu Chen
While large language models (LLMs) have been thoroughly evaluated for deductive and inductive reasoning, their proficiency in holistic rule learning in interactive environments remains less explored. We introduce RULEARN, a novel benchmark to assess the rule-learning abilities of LLM agents in interactive settings. In RULEARN, agents strategically interact with simulated environments to gather observations, discern patterns, and solve complex problems. To enhance the rule-learning capabilities for LLM agents, we propose IDEA, a novel reasoning framework that integrates the process of **I**nduction, **De**duction, and **A**bduction. The IDEA agent generates initial hypotheses from limited observations through abduction, devises plans to validate these hypotheses or leverages them to solve problems via deduction, and refines previous hypotheses through induction, dynamically establishing and applying rules that mimic human rule-learning behaviors. Our evaluation of the IDEA framework, which involves five representative LLMs, demonstrates significant improvements over the baseline. Furthermore, our study with human participants reveals notable discrepancies in rule-learning behaviors between humans and LLMs. We believe our benchmark will serve as a valuable and challenging resource, and IDEA will provide crucial insights for the development of LLM agents capable of human-like rule learning in real-world scenarios. Our code and data have been released at: https://github.com/KaiyuHe998/RULEARN_IDEA.
pdf
bib
abs
EnigmaToM: Improve LLMs’ Theory-of-Mind Reasoning Capabilities with Neural Knowledge Base of Entity States
Hainiu Xu
|
Siya Qi
|
Jiazheng Li
|
Yuxiang Zhou
|
Jinhua Du
|
Caroline Catmur
|
Yulan He
Theory-of-Mind (ToM), the ability to infer others’ perceptions and mental states, is fundamental to human interaction but remains challenging for Large Language Models (LLMs). While existing ToM reasoning methods show promise with reasoning via perceptual perspective-taking, they often rely excessively on off-the-shelf LLMs, reducing their efficiency and limiting their applicability to high-order ToM reasoning. To address these issues, we present EnigmaToM, a novel neuro-symbolic framework that enhances ToM reasoning by integrating a Neural Knowledge Base of entity states (Enigma) for (1) a psychology-inspired iterative masking mechanism that facilitates accurate perspective-taking and (2) knowledge injection that elicits key entity information. Enigma generates structured knowledge of entity states to build spatial scene graphs for belief tracking across various ToM orders and enrich events with fine-grained entity state details. Experimental results on ToMi, HiToM, and FANToM benchmarks show that EnigmaToM significantly improves ToM reasoning across LLMs of varying sizes, particularly excelling in high-order reasoning scenarios.
pdf
bib
abs
ReasonerRank: Redefining Language Model Evaluation with Ground-Truth-Free Ranking Frameworks
Jiamu Zhang
|
Jiayi Yuan
|
Andrew Wen
|
Hoang Anh Duy Le
|
Yu-Neng Chuang
|
Soo-Hyun Choi
|
Rui Chen
|
Xia Hu
Large Language Models (LLMs) are increasingly adopted across real-world applications, yet traditional evaluations rely on expensive, domain-specific ground-truth labels that are often unavailable or infeasible. We introduce a ground-truth-free evaluation framework focused on reasoning consistency and instruction following, shifting the emphasis from correctness—which is elusive without labels—to transparent, coherent, evidence-based reasoning. Each model response must include a direct answer, a structured multi-step explanation, and supporting evidence, all assessed via semantic similarity and output adherence checks. We further propose TopK-ReRank, which refines rankings by constructing a consensus answer from the most reliable models, reducing ambiguity across diverse reasoning styles. Experiments show that our framework outperforms existing label-free methods, including majority voting, triplet ranking, and peer-review approaches, providing a more interpretable and efficient alternative for evaluating LLMs in the absence of ground-truth labels.
pdf
bib
abs
HyGenar: An LLM-Driven Hybrid Genetic Algorithm for Few-Shot Grammar Generation
Weizhi Tang
|
Yixuan Li
|
Chris Sypherd
|
Elizabeth Polgreen
|
Vaishak Belle
Grammar plays a critical role in natural language processing and text/code generation by enabling the definition of syntax, the creation of parsers, and guiding structured outputs. Although large language models (LLMs) demonstrate impressive capabilities across domains, their ability to infer and generate grammars has not yet been thoroughly explored. In this paper, we aim to study and improve the ability of LLMs for few-shot grammar generation, where grammars are inferred from sets of a small number of positive and negative examples and generated in Backus-Naur Form. To explore this, we introduced a novel dataset comprising 540 structured grammar generation challenges, devised 6 metrics, and evaluated 8 various LLMs against it. Our findings reveal that existing LLMs perform sub-optimally in grammar generation. To address this, we propose an LLM-driven hybrid genetic algorithm, namely HyGenar, to optimize grammar generation. HyGenar achieves substantial improvements in both the syntactic and semantic correctness of generated grammars across LLMs.
pdf
bib
abs
Can Large Language Models Understand Argument Schemes?
Elfia Bezou-Vrakatseli
|
Oana Cocarascu
|
Sanjay Modgil
Argument schemes represent stereotypical patterns of reasoning that occur in everyday arguments. However, despite their usefulness, argument scheme classification, that is classifying natural language arguments according to the schemes they are instances of, is an under-explored task in NLP. In this paper we present a systematic evaluation of large language models (LLMs) for classifying argument schemes based on Walton’s taxonomy. We experiment with seven LLMs in zero-shot, few-shot, and chain-of-thought prompting, and explore two strategies to enhance task instructions: employing formal definitions and LLM-generated descriptions. Our analysis on both manually annotated and automatically generated arguments, including enthymemes, indicates that while larger models exhibit satisfactory performance in identifying argument schemes, challenges remain for smaller models. Our work offers the first comprehensive assessment of LLMs in identifying argument schemes, and provides insights for advancing reasoning capabilities in computational argumentation.
pdf
bib
abs
MMInA: Benchmarking Multihop Multimodal Internet Agents
Shulin Tian
|
Ziniu Zhang
|
Liangyu Chen
|
Ziwei Liu
Autonomous embodied agents live on an Internet of multimedia websites. Can they hop around multimodal websites to complete complex user tasks? Existing benchmarks fail to assess them in a realistic, evolving environment for their embodiment across websites. To answer this question, we present MMInA, a multihop and multimodal benchmark to evaluate the embodied agents for compositional Internet tasks, with several appealing properties: ***1) Evolving real-world multimodal websites.*** Our benchmark uniquely operates on evolving real-world websites, ensuring a high degree of realism and applicability to natural user tasks. Our data includes 1,050 human-written tasks covering various domains such as shopping and travel, with each task requiring the agent to extract multimodal information from web pages as observations autonomously. ***2) Multihop web browsing.*** Our dataset features naturally compositional tasks that require information from or actions on multiple websites to solve, to assess long-range reasoning capabilities on web tasks. ***3) Holistic evaluation.*** We propose a novel protocol for evaluating an agent’s progress in completing multihop tasks. We experiment with both standalone (multimodal) language models and heuristic-based web agents. Extensive experiments demonstrate that while long-chain multihop web tasks are easy for humans, they remain challenging for state-of-the-art web agents. We identify that agents are more likely to fail on the early hops when solving tasks of more hops, which results in lower task success rates. To address this issue, we propose a simple memory augmentation approach replaying past action trajectories to reflect. Our method significantly improves the performance of both the single-hop and multihop web browsing abilities.
pdf
bib
abs
ThinkGuard: Deliberative Slow Thinking Leads to Cautious Guardrails
Xiaofei Wen
|
Wenxuan Zhou
|
Wenjie Jacky Mo
|
Muhao Chen
Ensuring the safety of large language models (LLMs) is critical as they are deployed in real-world applications. Existing guardrails rely on rule-based filtering or single-pass classification, limiting their ability to handle nuanced safety violations. To address this, we propose ThinkGuard, a critique-augmented guardrail model that distills knowledge from high-capacity LLMs by generating structured critiques alongside safety labels. Fine-tuned on critique-augmented data, the captured deliberative thinking ability drastically enhances the guardrail’s cautiousness and interpretability. Evaluated on multiple safety benchmarks, ThinkGuard achieves the highest average F1 and AUPRC, outperforming all baselines. Compared to LLaMA Guard 3, ThinkGuard improves accuracy by 16.1% and macro F1 by 27.0%. Moreover, it surpasses label-only fine-tuned models, confirming that structured critiques enhance both classification precision and nuanced safety reasoning while maintaining computational efficiency.
pdf
bib
abs
Neutralizing Bias in LLM Reasoning using Entailment Graphs
Liang Cheng
|
Tianyi Li
|
Zhaowei Wang
|
Tianyang Liu
|
Mark Steedman
LLMs are often claimed to be capable of Natural Language Inference (NLI), which is widely regarded as a cornerstone of more complex forms of reasoning. However, recent works show that LLMs still suffer from hallucinations in NLI due to attestation bias, where LLMs overly rely on propositional memory to build shortcuts. To solve the issue, we design an unsupervised framework to construct counterfactual reasoning data and fine-tune LLMs to reduce attestation bias. To measure bias reduction, we build bias-adversarial variants of NLI datasets with randomly replaced predicates in premises while keeping hypotheses unchanged. Extensive evaluations show that our framework can significantly reduce hallucinations from attestation bias. Then, we further evaluate LLMs fine-tuned with our framework on original NLI datasets and their bias-neutralized versions, where original entities are replaced with randomly sampled ones. Extensive results show that our framework consistently improves inferential performance on both original and bias-neutralized NLI datasets.
pdf
bib
abs
Dynamic Steering With Episodic Memory For Large Language Models
Van Dai Do
|
Quan Hung Tran
|
Svetha Venkatesh
|
Hung Le
Large Language Models (LLMs) exhibit emergent in-context learning (ICL) capabilities, allowing them to adapt to unseen tasks based on example demonstrations. Traditional ICL embeds examples within the prompt, while activation steering, uses a vector derived from examples to guide the latent states of LLMs toward desired behaviors. However, traditional ICL is difficult to control quantitatively and consumes valuable context space. Existing activation steering methods apply a single sentence-level steering vector uniformly across all tokens, ignoring LLMs’ token-wise, auto-regressive nature. This coarse control can lead to inconsistencies and suboptimal adjustments during generation. To address this problem, we introduce Dynamic Steering with Episodic Memory (DSEM), a novel training-free framework that aligns LLMs to given demonstrations by steering at the token level conditioned on the input query. DSEM employs a key-value memory to store associations between generated tokens and steering vectors. During inference, it uses a nearest-neighbor mechanism to dynamically compute steering vectors for each token chunk, enabling more precise and adaptive guidance. Our method surpasses strong baselines across diverse alignment tasks - including safety, style transfer, and role-playing - demonstrating improved alignment as demonstration size scales.
pdf
bib
abs
Eeyore: Realistic Depression Simulation via Expert-in-the-Loop Supervised and Preference Optimization
Siyang Liu
|
Bianca Brie
|
Wenda Li
|
Laura Biester
|
Andrew Lee
|
James Pennebaker
|
Rada Mihalcea
Large Language Models (LLMs) have been previously explored for mental healthcare training and therapy client simulation, but they still fall short in authentically capturing diverse client traits and psychological conditions. We introduce Eeyore , an 8B model optimized for realistic depression simulation through a structured alignment framework, incorporating expert input at every stage.First, we systematically curate real-world depression-related conversations, extracting depressive traits to guide data filtering and psychological profile construction, and use this dataset to instruction-tune Eeyore for profile adherence. Next, to further enhance realism, Eeyore undergoes iterative preference optimization—first leveraging model-generated preferences and then calibrating with a small set of expert-annotated preferences.Throughout the entire pipeline, we actively collaborate with domain experts, developing interactive interfaces to validate trait extraction and iteratively refine structured psychological profiles for clinically meaningful role-play customization.Despite its smaller model size, the Eeyore depression simulation outperforms GPT-4o with SOTA prompting strategies, both in linguistic authenticity and profile adherence.
pdf
bib
abs
Lost in Translation: Benchmarking Commercial Machine Translation Models for Dyslexic-Style Text
Gregory Price
|
Shaomei Wu
Dyslexia can affect writing, leading to unique patterns such as letter and homophone swapping. As a result, text produced by people with dyslexia often differs from the text typically used to train natural language processing (NLP) models, raising concerns about their effectiveness for dyslexic users. This paper examines the fairness of four commercial machine translation (MT) systems towards dyslexic text through a systematic audit using both synthetically generated dyslexic text and real writing from individuals with dyslexia. By programmatically introducing various dyslexic-style errors into the WMT dataset, we present insights on how dyslexic biases manifest in MT systems as the text becomes more dyslexic, especially with real-word errors. Our results shed light on the NLP biases affecting people with dyslexia – a population that often relies on NLP tools as assistive technologies, highlighting the need for more diverse data and user representation in the development of foundational NLP models.
pdf
bib
abs
Divide-Verify-Refine: Can LLMs Self-align with Complex Instructions?
Xianren Zhang
|
Xianfeng Tang
|
Hui Liu
|
Zongyu Wu
|
Qi He
|
Dongwon Lee
|
Suhang Wang
Recent studies show LLMs struggle with complex instructions involving multiple constraints (e.g., length, format, sentiment). Existing research enhances open-source LLMs using closed-source guidance (e.g., GPT-4), but this heavily relies on generated data quality. An alternative is leveraging LLMs’ self-correction to refine responses for better constraint adherence. However, this is limited by the feedback quality, as we found LLMs cannot generate reliable feedback or detect errors. Moreover, the self-correction effectiveness relies on few-shot examples illustrating response modifications. As constraints in complex instructions are diverse, manually crafting such examples for each constraint type can be labor-intensive and sub-optimal. To address these two challenges, we propose the Divide-Verify-Refine (DVR) framework with three steps: (1) Divide complex instructions into single constraints and prepare appropriate tools; (2) Verify responses using tools that provide rigorous check and textual guidance (e.g., Python scripts for format checks or pre-trained classifiers for content analysis); (3) Refine: To maximize refinement effectiveness, we propose dynamic few-shot prompting, where a refinement repository collects successful refinements, and these examples are selectively retrieved for future refinements. Recognizing the lack of complexity in existing datasets, we create a new dataset of complex instructions. DVR doubles Llama3.1-8B’s constraint adherence and triples Mistral-7B’s performance.
pdf
bib
abs
LlamaPIE: Proactive In-Ear Conversation Assistants
Tuochao Chen
|
Nicholas Scott Batchelder
|
Alisa Liu
|
Noah A. Smith
|
Shyamnath Gollakota
We introduce LlamaPIE, the first real-time proactive assistant designed to enhance human conversations through discreet, concise guidance delivered via hearable devices. Unlike traditional language models that require explicit user invocation, this assistant operates in the background, anticipating user needs without interrupting conversations. We address several challenges, including determining when to respond, crafting concise responses that enhance conversations, leveraging knowledge of the user for context-aware assistance, and real-time, on-device processing. To achieve this, we construct a semi-synthetic dialogue dataset and propose a two-model pipeline: a small model that decides when to respond and a larger model that generates the response. We evaluate our approach on real-world datasets, demonstrating its effectiveness in providing helpful, unobtrusive assistance. User studies with our assistant, implemented on Apple Silicon M2 hardware, show a strong preference for the proactive assistant over both a baseline with no assistance and a reactive AI assistant, highlighting the potential of LlamaPIE to enhance live conversations.
pdf
bib
abs
Task-Oriented Automatic Fact-Checking with Frame-Semantics
Jacob Devasier
|
Akshith Reddy Putta
|
Rishabh Mediratta
|
Chengkai Li
We propose a novel paradigm for automatic fact-checking that leverages frame semantics to enhance the structured understanding of claims and guide the process of fact-checking them. To support this, we introduce a pilot dataset of real-world claims extracted from PolitiFact, specifically annotated for large-scale structured data. This dataset underpins two case studies: the first investigates voting-related claims using the Vote semantic frame, while the second explores various semantic frames based on data sources from the Organisation for Economic Co-operation and Development (OECD). Our findings demonstrate the effectiveness of frame semantics in improving evidence retrieval and explainability for fact-checking. Finally, we conducted a survey of frames evoked in fact-checked claims, identifying high-impact frames to guide future work in this direction.
pdf
bib
abs
Craw4LLM: Efficient Web Crawling for LLM Pretraining
Shi Yu
|
Zhiyuan Liu
|
Chenyan Xiong
Web crawl is a main source of large language models’ (LLMs) pretraining data, but the majority of crawled web pages are discarded in pretraining due to low data quality. This paper presents Craw4LLM, an efficient web crawling method that explores the web graph based on the preference of LLM pretraining. Specifically, it leverages the influence of a webpage in LLM pretraining as the priority score of the web crawler’s scheduler, replacing the standard graph-connectivity-based priority. Our experiments on a web graph containing 900 million webpages from a commercial search engine’s index demonstrate the efficiency of Craw4LLM in obtaining high-quality pretraining data. With just 21% URLs crawled, LLMs pretrained on Craw4LLM data reach the same downstream performances of previous crawls, significantly reducing the crawling waste and alleviating the burdens on websites. Our code is publicly available at https://github.com/cxcscmu/Craw4LLM.
pdf
bib
abs
Be Cautious When Merging Unfamiliar LLMs: A Phishing Model Capable of Stealing Privacy
Guo Zhenyuan
|
Yi Shi
|
Wenlong Meng
|
Chen Gong
|
Chengkun Wei
|
Wenzhi Chen
Model merging is a widespread technology in large language models (LLMs) that integrates multiple task-specific LLMs into a unified one, enabling the merged model to inherit the specialized capabilities of these LLMs. Most task-specific LLMs are sourced from open-source communities and have not undergone rigorous auditing, potentially imposing risks in model merging. This paper highlights an overlooked privacy risk: *an unsafe model could compromise the privacy of other LLMs involved in the model merging*. Specifically, we propose *PhiMM*, a privacy attack approach that trains a phishing model capable of stealing privacy using a crafted privacy phishing instruction dataset. Furthermore, we introduce a novel model cloaking method that mimics a specialized capability to conceal attack intent, luring users into merging the phishing model. Once victims merge the phishing model, the attacker can extract personally identifiable information (PII) or infer membership information (MI) by querying the merged model with the phishing instruction. Experimental results show that merging a phishing model increases the risk of privacy breaches. Compared to the results before merging, PII leakage increased by 3.9% and MI leakage increased by 17.4% on average. We release the code of *PhiMM* through an anonymous link.
pdf
bib
abs
Understand User Opinions of Large Language Models via LLM-Powered In-the-Moment User Experience Interviews
Mengqiao Liu
|
Tevin Wang
|
Cassandra A. Cohen
|
Sarah Li
|
Chenyan Xiong
Which large language model (LLM) is better? Every evaluation tells a story, but what do users really think about current LLMs? This paper presents CLUE, an LLM-powered interviewer that conducts in-the-moment user experience interviews, right after users interact with LLMs, and automatically gathers insights about user opinions from massive interview logs. We conduct a study with thousands of users to understand user opinions on mainstream LLMs, recruiting users to first chat with a target LLM and then be interviewed by CLUE. Our experiments demonstrate that CLUE captures interesting user opinions, e.g., the bipolar views on the displayed reasoning process of DeepSeek-R1 and demands for information freshness and multi-modality. Our code and data are at https://github.com/cxcscmu/LLM-Interviewer.
pdf
bib
abs
HiCOT: Improving Neural Topic Models via Optimal Transport and Contrastive Learning
Hoang Tran Vuong
|
Tue Le
|
Tu Vu
|
Tung Nguyen
|
Linh Ngo Van
|
Sang Dinh
|
Thien Huu Nguyen
Recent advances in neural topic models (NTMs) have improved topic quality but still face challenges: weak document-topic alignment, high inference costs due to large pretrained language models (PLMs), and limited modeling of hierarchical topic structures. To address these issues, we introduce HiCOT (Hierarchical Clustering and Contrastive Learning with Optimal Transport for Neural Topic Modeling), a novel framework that enhances topic coherence and efficiency. HiCOT integrates Optimal Transport to refine document-topic relationships using compact PLM-based embeddings, captures semantic structure of the documents. Additionally, it employs hierarchical clustering combine with contrastive learning to disentangle topic-word and topic-topic relationships, ensuring clearer structure and better coherence. Experimental results on multiple benchmark datasets demonstrate HiCOT’s superior effectiveness over existing NTMs in topic coherence, topic performance, representation quality, and computational efficiency.
pdf
bib
abs
FLAG-TRADER: Fusion LLM-Agent with Gradient-based Reinforcement Learning for Financial Trading
Guojun Xiong
|
Zhiyang Deng
|
Keyi Wang
|
Yupeng Cao
|
Haohang Li
|
Yangyang Yu
|
Xueqing Peng
|
Mingquan Lin
|
Kaleb E Smith
|
Xiao-Yang Liu
|
Jimin Huang
|
Sophia Ananiadou
|
Qianqian Xie
Large language models (LLMs) fine-tuned on multimodal financial data have demonstrated impressive reasoning capabilities in various financial tasks. However, they often struggle with multi-step, goal-oriented scenarios in interactive financial markets, such as trading, where complex agentic approaches are required to improve decision-making. To address this, we propose FLAG-Trader, a unified architecture integrating linguistic processing (via LLMs) with gradient-driven reinforcement learning (RL) policy optimization, in which a partially fine-tuned LLM acts as the policy network, leveraging pre-trained knowledge while adapting to the financial domain through parameter-efficient fine-tuning. Through policy gradient optimization driven by trading rewards, our framework not only enhances LLM performance in trading but also improves results on other financial-domain tasks. We present extensive empirical evidence to validate these enhancements.
pdf
bib
abs
The Silent Saboteur: Imperceptible Adversarial Attacks against Black-Box Retrieval-Augmented Generation Systems
Hongru Song
|
Yu-An Liu
|
Ruqing Zhang
|
Jiafeng Guo
|
Jianming Lv
|
Maarten de Rijke
|
Xueqi Cheng
We explore adversarial attacks against retrieval-augmented generation (RAG) systems to identify their vulnerabilities. We focus on generating human-imperceptible adversarial examples and introduce a novel imperceptible retrieve-to-generate attack against RAG. This task aims to find imperceptible perturbations that retrieve a target document, originally excluded from the initial top-k candidate set, in order to influence the final answer generation. To address this task, we propose ReGENT, a reinforcement learning-based framework that tracks interactions between the attacker and the target RAG and continuously refines attack strategies based on relevance-generation-naturalness rewards. Experiments on newly constructed factual and non-factual question-answering benchmarks demonstrate that ReGENT significantly outperforms existing attack methods in misleading RAG systems with small imperceptible text perturbations.
pdf
bib
abs
CROSSAGENTIE: Cross-Type and Cross-Task Multi-Agent LLM Collaboration for Zero-Shot Information Extraction
Meng Lu
|
Yuzhang Xie
|
Zhenyu Bi
|
Shuxiang Cao
|
Xuan Wang
Large language models (LLMs) excel in generating unstructured text. However, they struggle with producing structured output while maintaining accuracy in zero-shot information extraction (IE), such as named entity recognition (NER) and relation extraction (RE). To address these challenges, we propose CROSSAGENTIE, a multi-agent framework that enhances zero-shot IE through multi-agent LLM collaboration. CROSSAGENTIE refines LLM predictions iteratively through two mechanisms: intra-group cross-type debate, which resolves entity-label conflicts through context-based evidence and confidence aggregation, and inter-group cross-task debate, where NER and RE mutually refine outputs via bidirectional feedback. Furthermore, we introduce template fine-tuning, distilling high-confidence multi-agent outputs into a single model, significantly reducing inference cost while preserving accuracy. Experiments across five NER and five RE datasets show that CROSSAGENTIE significantly outperforms state-of-the-art zero-shot baselines by a large margin. CROSSAGENTIE effectively addresses LLMs limitations in structured prediction with an effective and efficient approach for zero-shot information extraction.
pdf
bib
abs
Decoupling Memories, Muting Neurons: Towards Practical Machine Unlearning for Large Language Models
Lishuai Hou
|
Zixiong Wang
|
Gaoyang Liu
|
Chen Wang
|
Wei Liu
|
Kai Peng
Machine Unlearning (MU) has emerged as a promising solution for removing the influence of data that an owner wishes to unlearn from Large Language Models (LLMs). However, existing MU methods, which require tuning the entire model parameters on the unlearned data with random labels or perturbed gradients, significantly degrade model utility, especially given the difficulty of accessing the original training data. This presents a key challenge: how can we achieve MU using only the unlearned data while preserving model utility?In this paper, we propose NeuMuter, a simple but effective MU method that eliminates the influence of unlearned data from LLMs by modulating the outputs of merely 1% of the neurons in the feed-forward network (FFN) modules within the Transformer blocks, minimizing disruption to the model’s performance. We design a trainable masking scheme that decouples the memorization of different training data within the neurons of LLMs, allowing us to precisely identify and modify neurons associated with the unlearned data. Through comprehensive evaluations on two benchmarks across four different LLMs, we demonstrate that modifying the outputs of a few fraction of the total neurons can effectively achieve MU while preserving the model’s utility across downstream tasks.
pdf
bib
abs
Assimilation and Accommodation: Task-Adaptive Hierarchical Abstraction for Solving Web Tasks
Xinyu Pang
|
Ruixin Hong
|
Hongming Zhang
|
Changshui Zhang
Web tasks, which involve processing data from online resources, challenge agents to generalize beyond fixed knowledge to unseen task contexts. Learning from experience, the ability to derive reusable patterns from past tasks, is crucial for improving generalization. However, existing methods focus on summarizing workflows, i.e., common sub-routines, which may introduce excessive low-level details that distract models. Additionally, the absence of task-specific objectives can lead to inconsistencies between workflows and future task queries, hindering reasoning performance. This paper seeks to mitigate these issues by proposing A2, a framework that derives task-adaptive hierarchical abstraction to enhance web task reasoning. Our approach first extracts general-purpose semantic abstraction from past task-solution pairs. Combined with the next task query, this abstraction forms a task-adaptive episodic abstraction that guides subsequent reasoning. Experiments show that A2 achieves superior performance with competitive cost-efficiency, improving success rates by 0.7% on Mind2web and 4.6% on Webarena.
pdf
bib
abs
SafeLawBench: Towards Safe Alignment of Large Language Models
Chuxue Cao
|
Han Zhu
|
Jiaming Ji
|
Qichao Sun
|
Zhenghao Zhu
|
Wu Yinyu
|
Josef Dai
|
Yaodong Yang
|
Sirui Han
|
Yike Guo
With the growing prevalence of large language models (LLMs), the safety of LLMs has raised significant concerns. However, there is still a lack of definitive standards for evaluating their safety due to the subjective nature of current safety benchmarks. To address this gap, we conducted the first exploration of LLMs’ safety evaluation from a legal perspective by proposing the SafeLawBench benchmark. SafeLawBench categorizes safety risks into three levels based on legal standards, providing a systematic and comprehensive framework for evaluation. It comprises 24,860 multi-choice questions and 1,106 open-domain question-answering (QA) tasks. Our evaluation included 2 closed-source LLMs and 18 open-source LLMs using zero-shot and few-shot prompting, highlighting the safety features of each model. We also evaluated the LLMs’ safety-related reasoning stability and refusal behavior. Additionally, we found that a majority voting mechanism can enhance model performance. Notably, even leading SOTA models like Claude-3.5-Sonnet and GPT-4o have not exceeded 80.5% accuracy in multi-choice tasks on SafeLawBench, while the average accuracy of 20 LLMs remains at 68.8%. We urge the community to prioritize research on the safety of LLMs.
pdf
bib
abs
3DM: Distill, Dynamic Drop, and Merge for Debiasing Multi-modal Large Language Models
Zhaoxi Zhang
|
Sanwoo Lee
|
Zhixiang Wang
|
Yunfang Wu
The rapid advancement of Multi-modal Language Models (MLLMs) has significantly enhanced performance in multimodal tasks, yet these models often exhibit inherent biases that compromise their reliability and fairness. Traditional debiasing methods face a trade-off between the need for extensive labeled datasets and high computational costs. Model merging, which efficiently combines multiple models into a single one, offers a promising alternative but its usage is limited to MLLMs with the same architecture. We propose 3DM, a novel framework integrating Distill, Dynamic Drop, and Merge to address these challenges. 3DM employs knowledge distillation to harmonize models with divergent architectures and introduces a dynamic dropping strategy that assigns parameter-specific drop rates based on their contributions to bias and overall performance. This approach preserves critical weights while mitigating biases, as validated on the MMSD2.0 sarcasm detection dataset. Our key contributions include architecture-agnostic merging, dynamic dropping, and the introduction of the Bias Ratio (BR) metric for systematic bias assessment. Empirical results demonstrate that 3DM outperforms existing methods in balancing debiasing and enhancing the overall performance, offering a practical and scalable solution for deploying fair and efficient MLLMs in real-world applications.
pdf
bib
abs
CausalAbstain: Enhancing Multilingual LLMs with Causal Reasoning for Trustworthy Abstention
Yuxi Sun
|
Aoqi Zuo
|
Wei Gao
|
Jing Ma
Large Language Models (LLMs) often exhibit knowledge disparities across languages. Encouraging LLMs to abstain when faced with knowledge gaps is a promising strategy to reduce hallucinations in multilingual settings. Current abstention strategies for multilingual scenarios primarily rely on generating feedback in various languages using LLMs and performing self-reflection. However, these methods can be adversely impacted by inaccuracies and biases in the generated feedback. To address this, from a causal perspective, we introduce CausalAbstain, a method that helps LLMs determine whether to utilize multiple generated feedback responses and how to identify the most useful ones. Extensive experiments demonstrate that CausalAbstain effectively selects helpful feedback and enhances abstention decisions with interpretability in both native language (Casual-native) and multilingual (Causal-multi) settings, outperforming strong baselines on two benchmark datasets covering encyclopedic and commonsense knowledge QA tasks.
pdf
bib
abs
CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era
Kanzhi Cheng
|
Wenpo Song
|
Jiaxin Fan
|
Zheng Ma
|
Qiushi Sun
|
Fangzhi Xu
|
Chenyang Yan
|
Nuo Chen
|
Jianbing Zhang
|
Jiajun Chen
Image captioning has been a longstanding challenge in vision-language research. With the rise of LLMs, modern Vision-Language Models (VLMs) generate detailed and comprehensive image descriptions. However, benchmarking the quality of such captions remains unresolved. This paper addresses two key questions: (1) How well do VLMs actually perform on image captioning, particularly compared to humans? We built CapArena, a platform with over 6000 pairwise caption battles and high-quality human preference votes. Our Arena-style evaluation marks a milestone, showing that leading models like GPT-4o achieve or even surpass human performance, while most open-source models lag behind. (2) Can automated metrics reliably assess caption quality? Using human annotations from CapArena, we evaluate traditional and recent captioning metrics, as well as VLM-as-a-Judge. Our analysis reveals that while some metrics (e.g., METEOR) show high caption-level agreement with humans, their systematic biases lead to inconsistencies in model ranking. In contrast, VLM-as-a-Judge demonstrates robust discernment at both the caption and model levels. Building on these insights, we release CapArena-Auto, an accurate and efficient automated benchmark for detailed captioning, achieving 93.4% correlation with human rankings at just $4 per test. All data and evaluation resources have been open-sourced.
pdf
bib
abs
LLM-Empowered Class Imbalanced Graph Prompt Learning for Online Drug Trafficking Detection
Tianyi Ma
|
Yiyue Qian
|
Zehong Wang
|
Zheyuan Zhang
|
Chuxu Zhang
|
Yanfang Ye
As the market for illicit drugs remains extremely profitable, major online platforms have become direct-to-consumer intermediaries for illicit drug trafficking participants. These online activities raise significant social concerns that require immediate actions. Existing approaches to combat this challenge are generally impractical due to the scarcity of labeled samples and imbalance of classes in real-world applications. To this end, we propose a novel Large Language Model-empowered Heterogeneous Graph Prompt Learning framework for illicit Drug Trafficking detection, called LLM-HetGDT that leverages LLM to facilitate heterogeneous graph neural networks (HGNNs) to effectively identify minority classes, i.e., drug trafficking participants, in the class-imbalanced scenarios. Specifically, we first pre-train HGNN over a contrastive pretext task to capture the inherent node and structure information over an unlabeled drug trafficking heterogeneous graph (HG). Afterward, to alleviate the class-imbalanced issue, we leverage LLMs to augment the HG by generating high-quality synthetic user nodes in the minority classes. Then, we fine-tune the soft prompts on the augmented HG to capture the important information in the minority classes for the downstream drug trafficking detection task. To comprehensively study online illicit drug trafficking activities, we collect a new HG dataset over Twitter, called Twitter-HetDrug. Extensive experiments on this dataset demonstrate the effectiveness, efficiency, and applicability of our proposed method by comparing it with state-of-the-art baseline methods. Our source code is available at https://github.com/GraphResearcher/LLM-HetGDT.
pdf
bib
abs
CoLA: Collaborative Low-Rank Adaptation
Yiyun Zhou
|
Chang Yao
|
Jingyuan Chen
The scaling law of Large Language Models (LLMs) reveals a power-law relationship, showing diminishing return on performance as model scale increases. While training LLMs from scratch is resource-intensive, fine-tuning a pre-trained model for specific tasks has become a practical alternative. Full fine-tuning (FFT) achieves strong performance; however, it is computationally expensive and inefficient. Parameter-efficient fine-tuning (PEFT) methods, like LoRA, have been proposed to address these challenges by freezing the pre-trained model and adding lightweight task-specific modules. LoRA, in particular, has proven effective, but its application to multi-task scenarios is limited by interference between tasks. Recent approaches, such as Mixture-of-Experts (MOE) and asymmetric LoRA, have aimed to mitigate these issues but still struggle with sample scarcity and noise interference due to their fixed structure. In response, we propose CoLA, a more flexible LoRA architecture with an efficient initialization scheme, which introduces three collaborative strategies to enhance performance by better utilizing the quantitative relationships between matrices A and B. Our experiments demonstrate the effectiveness and robustness of CoLA, outperforming existing PEFT methods, especially in low-sample scenarios. Our data and code are fully publicly available: https://github.com/zyy-2001/CoLA.
pdf
bib
abs
GLiM: Integrating Graph Transformer and LLM for Document-Level Biomedical Relation Extraction with Incomplete Labeling
Hao Fang
|
Yuejie Zhang
|
Rui Feng
|
Yingwen Wang
|
Qing Wang
|
Wen He
|
Xiaobo Zhang
|
Tao Zhang
|
Shang Gao
Document-level relation extraction (DocRE) identifies relations between entities across an entire document. However, as the number and complexity of entities and entity-pair relations grow, the problem space expands quadratically, causing incomplete annotations and frequent false negatives, especially in biomedical datasets due to high construction costs. This leads to low recall in real-world scenarios. To address this, we propose GLiM, a novel framework that reduces the problem space using a graph-enhanced Transformer-based model and leverages large language models (LLMs) for reasoning. GLiM employs a cascaded approach: first, a graph-enhanced Transformer processes entity-pair relations with finer granularity by dynamically adjusting the graph size based on the number of entities; then, LLM inference handles challenging cases. Experiments show that GLiM boosts average recall and F1 scores by +6.34 and +4.41, respectively, outperforming state-of-the-art models on biomedical benchmarks. These results demonstrate the effectiveness of combining graph-enhanced Transformers with LLM inference for biomedical DocRE. Code will be released at https://github.com/HaoFang10/GLiM.
pdf
bib
abs
AnalyticKWS: Towards Exemplar-Free Analytic Class Incremental Learning for Small-footprint Keyword Spotting
Yang Xiao
|
Peng Tianyi
|
Rohan Kumar Das
|
Yuchen Hu
|
Huiping Zhuang
Keyword spotting (KWS) offers a vital mechanism to identify spoken commands in voice-enabled systems, where user demands often shift, requiring models to learn new keywords continually over time. However, a major problem is catastrophic forgetting, where models lose their ability to recognize earlier keywords. Although several continual learning methods have proven their usefulness for reducing forgetting, most existing approaches depend on storing and revisiting old data to combat catastrophic forgetting. Though effective, these methods face two practical challenges: 1) privacy risks from keeping user data and 2) large memory and time consumption that limit deployment on small devices. To address these issues, we propose an exemplar-free Analytic Continual Learning (AnalyticKWS) method that updates model parameters without revisiting earlier data. Inspired by efficient learning principles, AnalyticKWS computes a closed-form analytical solution for model updates and requires only a single epoch of adaptation for incoming keywords. AnalyticKWS demands fewer computational resources by avoiding gradient-based updates and does not store old data. By eliminating the need for back-propagation during incremental learning, the model remains lightweight and efficient. As a result, AnalyticKWS meets the challenges mentioned earlier and suits resource-limited settings well. Extensive experiments on various datasets and settings show that AnalyticKWS consistently outperforms existing continual learning methods.
pdf
bib
abs
Sleepless Nights, Sugary Days: Creating Synthetic Users with Health Conditions for Realistic Coaching Agent Interactions
Taedong Yun
|
Eric Yang
|
Mustafa Safdari
|
Jong Ha Lee
|
Vaishnavi Vinod Kumar
|
S. Sara Mahdavi
|
Jonathan Amar
|
Derek Peyton
|
Reut Aharony
|
Andreas Michaelides PhD
|
Logan Douglas Schneider
|
Isaac Galatzer-Levy
|
Yugang Jia
|
John Canny
|
Arthur Gretton
|
Maja Mataric
We present an end-to-end framework for generating synthetic users for evaluating interactive agents designed to encourage positive behavior changes, such as in health and lifestyle coaching. The synthetic users are grounded in health and lifestyle conditions, specifically sleep and diabetes management in this study, to ensure realistic interactions with the health coaching agent. Synthetic users are created in two stages: first, structured data are generated grounded in real-world health and lifestyle factors in addition to basic demographics and behavioral attributes; second, full profiles of the synthetic users are developed conditioned on the structured data. Interactions between synthetic users and the coaching agent are simulated using generative agent-based models such as Concordia, or directly by prompting a language model. Using two independently-developed agents for sleep and diabetes coaching as case studies, the validity of this framework is demonstrated by analyzing the coaching agent’s understanding of the synthetic users’ needs and challenges. Finally, through multiple blinded evaluations of user-coach interactions by human experts, we demonstrate that our synthetic users with health and behavioral attributes more accurately portray real human users with the same attributes, compared to generic synthetic users not grounded in such attributes. The proposed framework lays the foundation for efficient development of conversational agents through extensive, realistic, and grounded simulated interactions.
pdf
bib
abs
Imagine to Hear: Auditory Knowledge Generation can be an Effective Assistant for Language Models
Suho Yoo
|
Hyunjong Ok
|
Jaeho Lee
Language models pretrained on text-only corpora often struggle with tasks that require auditory commonsense knowledge.Previous work addresses this problem by augmenting the language model to retrieve knowledge from external audio databases.This approach has several limitations, such as the potential lack of relevant audio in databases and the high costs associated with constructing the databases. To address these issues, we propose Imagine to Hear, a novel approach that dynamically generates auditory knowledge using generative models. Our framework detects multiple audio-related textual spans from the given prompt and generates corresponding auditory knowledge. We develop several mechanisms to efficiently process multiple auditory knowledge, including a CLAP-based rejection sampler and a language-audio fusion module. Our experiments show that our method achieves state-of-the-art performance on AuditoryBench without relying on external databases, highlighting the effectiveness of our generation-based approach.
pdf
bib
abs
SafeEraser: Enhancing Safety in Multimodal Large Language Models through Multimodal Machine Unlearning
Junkai Chen
|
Zhijie Deng
|
Kening Zheng
|
Yibo Yan
|
Shuliang Liu
|
PeiJun Wu
|
Peijie Jiang
|
Jia Liu
|
Xuming Hu
As Multimodal Large Language Models (MLLMs) develop, their potential security issues have become increasingly prominent. **Machine Unlearning (MU)**, as an effective strategy for forgetting specific knowledge in training data, has been widely used in privacy protection. However, *MU for safety in MLLM has yet to be fully explored*. To address this issue, we propose , a safety unlearning benchmark for MLLMs, consisting of 3,000 images and 28.8K VQA pairs. We comprehensively evaluate unlearning methods from two perspectives: **_forget quality_** and **_model utility_**. Our findings show that existing MU methods struggle to maintain model performance while implementing the forget operation and often suffer from **_over-forgetting_**. Hence, we introduce **Prompt Decouple (PD) Loss** to alleviate over-forgetting through decouple prompt during unlearning process. To quantitatively measure over-forgetting mitigated by PD Loss, we propose a new metric called **Safe Answer Refusal Rate (SARR)**. Experimental results demonstrate that combining PD Loss with existing unlearning methods can effectively prevent over-forgetting and achieve a decrease of 79.5% in the SARR metric of LLaVA-7B and LLaVA-13B, while maintaining forget quality and model utility. Our code and dataset will be released upon acceptance. **Warning: This paper contains examples of harmful language and images, and reader discretion is recommended.**
pdf
bib
abs
Prediction-Augmented Generation for Automatic Diagnosis Tasks
Chan-Yang Ju
|
Dong-Ho Lee
Most Large language models (LLMs) adopt an autoregressive architecture, predicting the next word token based on the preceding context. While this approach is robust for language generation tasks such as writing and summarization, it has limitations for high-level reasoning tasks, such as prediction and decision-making. To overcome these limitations, we introduce a new method called Prediction-Augmented Generation (PAG). PAG can improve the generation quality and predictive accuracy of large language models in inference-driven tasks by integrating task-specific predictive models as external tools, enabling more structured and precise reasoning. Moreover, our method does not simply copy the inferences of a predictive model, but improves the inference results with knowledge from the large language model to create better predictions. We comprehensively evaluate our proposed method on diverse datasets for automatic diagnosis tasks requiring extensive domain knowledge and advanced reasoning.
pdf
bib
abs
FedLEKE: Federated Locate-then-Edit Knowledge Editing for Multi-Client Collaboration
Zongkai Zhao
|
Guozeng Xu
|
Xiuhua Li
|
Kaiwen Wei
|
Jiang Zhong
Locate-then-Edit Knowledge Editing (LEKE) is a key technique for updating large language models (LLMs) without full retraining. However, existing methods assume a single-user setting and become inefficient in real-world multi-client scenarios, where decentralized organizations (e.g., hospitals, financial institutions) independently update overlapping knowledge, leading to redundant mediator knowledge vector (MKV) computations and privacy concerns.To address these challenges, we introduce Federated Locate-then-Edit Knowledge Editing (FedLEKE), a novel task that enables multiple clients to collaboratively perform LEKE while preserving privacy and reducing computational overhead. To achieve this, we propose FedEdit, a two-stage framework that optimizes MKV selection and reuse.In the first stage, clients locally apply LEKE and upload the computed MKVs. In the second stage, rather than relying solely on server-based MKV sharing, FedLEKE allows clients retrieve relevant MKVs based on cosine similarity, enabling knowledge re-edit and minimizing redundant computations.Experimental results on two benchmark datasets demonstrate that FedEdit retains over 96% of the performance of non-federated LEKE while significantly outperforming a FedAvg-based baseline by approximately twofold. Besides, we find that MEMIT performs more consistently than PMET in the FedLEKE task with our FedEdit framework. Our code is available at https://github.com/zongkaiz/FedLEKE.
pdf
bib
abs
DiSCo: Device-Server Collaborative LLM-based Text Streaming Services
Ting Sun
|
Penghan Wang
|
Fan Lai
The rapid rise of large language models (LLMs) in text streaming services has introduced significant cost and Quality of Experience (QoE) challenges in serving millions of daily requests, especially in meeting Time-To-First-Token (TTFT) and Time-Between-Token (TBT) requirements for real-time interactions. Our real-world measurements show that both server-based and on-device deployments struggle to meet diverse QoE demands: server deployments face high costs and last-hop issues (e.g., Internet latency and dynamics), while on-device LLM inference is constrained by resources. We introduce , a device-server cooperative scheduler designed to optimize users’ QoE by adaptively routing requests and migrating response generation between endpoints while maintaining cost constraints. employs cost-aware scheduling, leveraging the predictable speed of on-device LLM inference with the flexible capacity of server-based inference to dispatch requests on the fly, while introducing a token-level migration mechanism to ensure consistent token delivery during migration. Evaluations on real-world workloads—including commercial services like OpenAI GPT and DeepSeek, and open-source deployments such as LLaMA3—show that can improve users’ QoE by reducing tail TTFT (11-52%) and mean TTFT (6-78%) across different model-device configurations, while dramatically reducing serving costs by up to 84% through its migration mechanism while maintaining comparable QoE levels.
pdf
bib
abs
Customizing In-context Learning for Dynamic Interest Adaption in LLM-based Recommendation
Keqin Bao
|
Ming Yan
|
Yang Zhang
|
Jizhi Zhang
|
Wenjie Wang
|
Fuli Feng
|
Xiangnan He
Frequently updating Large Language Model (LLM)-based recommender systems to adapt to dynamic user interests—as done for traditional ones—is impractical due to high training costs, even with acceleration methods. This work explores the possibility of adapting the model to dynamic user interests without any model-level updates via In-context Learning (ICL), which enables adaptation through few-shot examples within input prompts. While using recent user interactions as ICL demonstrations offers a potential solution for dynamic interest adaptation, existing LLM-based recommenders face critical limitations: recommendation-specific tuning often diminishes the model’s in-context learning ability, and the original LLM’s ICL lacks task-specific optimization for recommendations. To bridge this gap, we introduce RecICL, a framework that establishes recommendation-oriented in-context learning by structuring recent user interactions and current inputs into ICL formats. RecICL achieves dual objectives: (1) preserving fundamental ICL capabilities during recommendation adaptation and (2) dynamically capturing user preference evolution through the most recent interactions. Extensive experiments across multiple benchmarks demonstrate RecICL’s superior performance, achieving better results without model updates. Our implementation is publicly available at
https://anonymous.4open.science/r/RecICL-8003.
pdf
bib
abs
Robust Data Watermarking in Language Models by Injecting Fictitious Knowledge
Xinyue Cui
|
Johnny Wei
|
Swabha Swayamdipta
|
Robin Jia
Data watermarking in language models injects traceable signals, such as specific token sequences or stylistic patterns, into copyrighted text, allowing copyright holders to track and verify training data ownership. Previous data watermarking techniques primarily focus on effective memorization after pretraining, while overlooking challenges that arise in other stages of the LLM pipeline, such as the risk of watermark filtering during data preprocessing, or potential forgetting through post-training, or verification difficulties due to API-only access. We propose a novel data watermarking approach that injects coherent and plausible yet fictitious knowledge into training data using generated passages describing a fictitious entity and its associated attributes. Our watermarks are designed to be memorized by the LLM through seamlessly integrating in its training data, making them harder to detect lexically during preprocessing. We demonstrate that our watermarks can be effectively memorized by LLMs, and that increasing our watermarks’ density, length, and diversity of attributes strengthens their memorization. We further show that our watermarks remain robust throughout LLM development, maintaining their effectiveness after continual pretraining and supervised finetuning. Finally, we show that our data watermarks can be evaluated even under API-only access via question answering.
pdf
bib
abs
LLM-Enhanced Query Generation and Retrieval Preservation for Task-Oriented Dialogue
Jiale Chen
|
Xuelian Dong
|
Wenxiu Xie
|
Ru Peng
|
Kun Zeng
|
Tianyong Hao
Knowledge retrieval and response generation are fundamental to task-oriented dialogue systems. However, dialogue context frequently contains noisy or irrelevant information, leading to sub-optimal result in knowledge retrieval. One possible approach to retrieving knowledge is to manually annotate standard queries for each dialogue. Yet, this approach is hindered by the challenge of data scarcity, as human annotation is costly. To solve the challenge, we propose an LLM-enhanced model of query-guided knowledge retrieval for task-oriented dialogue. It generates high-quality queries for knowledge retrieval in task-oriented dialogue solely using low-resource annotated queries. To strengthen the performance correlation between response generation and knowledge retrieval, we propose a retrieval preservation mechanism by further selecting the most relevant knowledge from retrieved top-K records and explicitly incorporating these as prompts to guide a generator in response generation. Experiments on three standard benchmarks demonstrate that our model and mechanism outperform previous state-of-the-art by 3.26% on average with two widely used evaluation metrics.
pdf
bib
abs
ClozeMath: Improving Mathematical Reasoning in Language Models by Learning to Fill Equations
Quang Hieu Pham
|
Thuy Duong Nguyen
|
Tung Pham
|
Anh Tuan Luu
|
Dat Quoc Nguyen
The capabilities of large language models (LLMs) have been enhanced by training on data that reflects human thought processes, such as the Chain-of-Thought format. However, evidence suggests that the conventional scheme of next-word prediction may not fully capture how humans learn to think. Inspired by how humans generalize mathematical reasoning, we propose a new approach named ClozeMath to fine-tune LLMs for mathematical reasoning. Our ClozeMath involves a text-infilling task that predicts masked equations from a given solution, analogous to cloze exercises used in human learning. Experiments on GSM8K, MATH, and GSM-Symbolic show that ClozeMath surpasses the strong baseline Masked Thought in performance and robustness, with two test-time scaling decoding algorithms, Beam Search and Chain-of-Thought decoding. Additionally, we conduct an ablation study to analyze the effects of various architectural and implementation choices on our approach.
pdf
bib
abs
Low-Entropy Watermark Detection via Bayes’ Rule Derived Detector
Beining Huang
|
Du Su
|
Fei Sun
|
Qi Cao
|
Huawei Shen
|
Xueqi Cheng
Text watermarking, which modify tokens to embed watermark, has proven effective in detecting machine-generated texts. Yet its application to low-entropy texts like code and mathematics presents significant challenges. A fair number of tokens in these texts are hardly modifiable without changing the intended meaning, causing statistical measures to falsely indicate the absence of a watermark. Existing research addresses this issue by rely mainly on a limited number of high-entropy tokens, which are considered flexible for modification, and accurately reflecting watermarks. However, their detection accuracy remains suboptimal, as they neglect strong watermark evidences embedded in low entropy tokens modified through watermarking. To overcome this limitation, we introduce Bayes’ Rule derived Watermark Detector (BRWD), which exploit watermark information from every token, by leveraging the posterior probability of watermark’s presence. We theoretically prove the optimality of our method in terms of detection accuracy, and demonstrate its superiority across various datasets, models, and watermark injection strategies. Notably, our method achieves up to 50% and 70% relative improvements in detection accuracy over the best baselines in code generation and math problem-solving tasks, respectively. Our code is available at https://github.com/cczslp/BRWD.
pdf
bib
abs
CoD, Towards an Interpretable Medical Agent using Chain of Diagnosis
Junying Chen
|
Chi Gui
|
Anningzhe Gao
|
Ke Ji
|
Xidong Wang
|
Xiang Wan
|
Benyou Wang
The field of AI healthcare has undergone a significant transformation with the advent of large language models (LLMs), yet the challenges of interpretability within these models remain largely unaddressed. This study introduces **Chain-of-Diagnosis (CoD)** to enhance the interpretability of medical automatic diagnosis. CoD transforms the diagnostic process into a diagnostic chain that mirrors a physician’s thought process, providing a transparent reasoning pathway. Additionally, CoD outputs the disease confidence distribution to ensure transparency in decision-making. This interpretability makes model diagnostics controllable and aids in identifying critical symptoms for inquiry through the entropy reduction of confidences. With CoD, we developed **DiagnosisGPT**, capable of diagnosing 9,604 diseases for validating CoD. Experimental results demonstrate that DiagnosisGPT outperforms other LLMs on automatic diagnostic tasks across three real-world benchmarks. Moreover, DiagnosisGPT provides interpretability while ensuring controllability in diagnostic rigor.
pdf
bib
abs
DaNet: Dual-Aware Enhanced Alignment Network for Multimodal Aspect-Based Sentiment Analysis
Aoqiang Zhu
|
Min Hu
|
Xiaohua Wang
|
Jiaoyun Yang
|
Yiming Tang
|
Ning An
Multimodal Aspect-Based Sentiment Analysis (MABSA) aims to extract aspect-sentiment pairs from text and image data. While significant progress has been made in image-aspect alignment, due to the subtlety and complexity of language expressions, there are not always explicit aspect words in the language to align with images. Existing methods typically assume a direct alignment between images and aspects, matching the entire image with a corresponding aspect. This rough alignment of images and aspects introduces noise. To address the above issues, this paper proposes a Dual-Aware Enhanced Alignment Network (DaNet) designed for fine-grained multimodal aspect-image alignment and denoising. Specifically, we first introduce a Multimodal Denoising Encoder (MDE) that jointly image and text to guide the compression and denoising of visual sequences. And then, aspect-aware and sentiment-aware networks are constructed to jointly enhance fine-grained alignment and denoising of text-image information. To better align implicit aspects, an Implicit Aspect Opinion Generation (IAOG) pretraining is designed under the guidance of large language model. Extensive experiments across three MABSA subtasks demonstrate that DaNet outperforms existing methods. Code will be available at https://github.com/***/DaNet.
pdf
bib
abs
Exploring Multimodal Challenges in Toxic Chinese Detection: Taxonomy, Benchmark, and Findings
Shujian Yang
|
Shiyao Cui
|
Chuanrui Hu
|
Haicheng Wang
|
Tianwei Zhang
|
Minlie Huang
|
Jialiang Lu
|
Han Qiu
Detecting toxic content using language models is important but challenging. While large language models (LLMs) have demonstrated strong performance in understanding Chinese, recent studies show that simple character substitutions in toxic Chinese text can easily confuse the state-of-the-art (SOTA) LLMs. In this paper, we highlight the multimodal nature of Chinese language as a key challenge for deploying LLMs in toxic Chinese detection. First, we propose a taxonomy of 3 perturbation strategies and 8 specific approaches in toxic Chinese content. Then, we curate a dataset based on this taxonomy, and benchmark 9 SOTA LLMs (from both the US and China) to assess if they can detect perturbed toxic Chinese text. Additionally, we explore cost-effective enhancement solutions like in-context learning (ICL) and supervised fine-tuning (SFT). Our results reveal two important findings. (1) LLMs are less capable of detecting perturbed multimodal Chinese toxic contents. (2) ICL or SFT with a small number of perturbed examples may cause the LLMs “overcorrect”: misidentify many normal Chinese contents as toxic.
pdf
bib
abs
LDIR: Low-Dimensional Dense and Interpretable Text Embeddings with Relative Representations
Yile Wang
|
Zhanyu Shen
|
Hui Huang
Semantic text representation is a fundamental task in the field of natural language processing. Existing text embedding (e.g., SimCSE and LLM2Vec) have demonstrated excellent performance, but the values of each dimension are difficult to trace and interpret. Bag-of-words, as classic sparse interpretable embeddings, suffers from poor performance. Recently, Benara et al. (2024) propose interpretable text embeddings using large language models, which forms ”0/1” embeddings based on responses to a series of questions. These interpretable text embeddings are typically high-dimensional (larger than 10,000). In this work, we propose Low-dimensional (lower than 500) Dense and Interpretable text embeddings with Relative representations (LDIR). The numerical values of its dimensions indicate semantic relatedness to different anchor texts through farthest point sampling, offering both semantic representation as well as a certain level of traceability and interpretability. We validate LDIR on multiple semantic textual similarity, retrieval, and clustering tasks. Extensive experimental results show that LDIR performs close to the black-box baseline models and outperforms the interpretable embeddings baselines with much fewer dimensions.
pdf
bib
abs
Ranked Voting based Self-Consistency of Large Language Models
Weiqin Wang
|
Yile Wang
|
Hui Huang
Majority voting is considered an effective method to enhance chain-of-thought reasoning, as it selects the answer with the highest ”self-consistency” among different reasoning paths (Wang et al., 2023). However, previous chain-of-thought reasoning methods typically generate only a single answer in each trial, thereby ignoring the possibility of other potential answers. As a result, these alternative answers are often overlooked in subsequent voting processes. In this work, we propose to generate ranked answers in each reasoning process and conduct ranked voting among multiple ranked answers from different responses, thereby making the overall self-consistency more reliable. Specifically, we use three ranked voting methods: Instant-runoff voting, Borda count voting, and mean reciprocal rank voting. We validate our methods on six datasets, including three multiple-choice and three open-ended question-answering tasks, using both advanced open-source and closed-source large language models. Extensive experimental results indicate that our proposed method outperforms the baselines, showcasing the potential of leveraging the information of ranked answers and using ranked voting to improve reasoning performance. Code and logs will be released.
pdf
bib
abs
SemanticCamo: Jailbreaking Large Language Models through Semantic Camouflage
Jihui Yan
|
Xiaocui Yang
|
Daling Wang
|
Shi Feng
|
Yifei Zhang
|
Yinzhi Zhao
The rapid development and increasingly widespread applications of Large Language Models (LLMs) have made the safety issues of LLMs more prominent and critical. Although safety training is widely used in LLMs, the mismatch between pre-training and safety training still leads to safety vulnerabilities. To expose the safety vulnerabilities in LLMs and improve LLMs’ performance in safety, we propose a novel framework, SemanticCamo, which attacks LLMs through semantic camouflage.SemanticCamo bypasses safety guardrails by replacing the original unsafe content with semantic features, thereby concealing malicious intent while keeping the query’s objectives unchanged. We conduct comprehensive experiments on the state-of-the-art LLMs, including GPT-4o and Claude-3.5, finding that SemanticCamo successfully induces harmful responses from the target models in over 80% of cases on average, outperforming previous counterparts. Additionally, the performance of SemanticCamo against various defenses is evaluated, demonstrating that semantic transformations introduce critical challenges to LLM safety, necessitating targeted alignment strategies to address this vulnerability. Code and data are available at https://github.com/Jihui-Yan/SemanticCamo.
pdf
bib
abs
Assigning Distinct Roles to Quantized and Low-Rank Matrices Toward Optimal Weight Decomposition
Yoonjun Cho
|
Soeun Kim
|
Dongjae Jeon
|
Kyelim Lee
|
Beomsoo Lee
|
Albert No
Decomposing weight matrices into quantization and low-rank components ( W≈ Q+LR) is a widely used technique for compressing large language models (LLMs). Existing joint optimization methods iteratively alternate between quantization and low-rank approximation. However, these methods tend to prioritize one component at the expense of the other, resulting in suboptimal decompositions that fail to leverage each component’s unique strengths. In this work, we introduce Outlier-Driven Low-Rank Initialization (ODLRI), which assigns low-rank components the specific role of capturing activation-sensitive weights. This structured decomposition mitigates outliers’ negative impact on quantization, enabling more effective balance between quantization and low-rank approximation. Experiments on Llama2 (7B, 13B, 70B), Llama3-8B, and Mistral-7B demonstrate that incorporating ODLRI into the joint optimization framework consistently reduces activation-aware error, minimizes quantization scale, and improves perplexity and zero-shot accuracy in low-bit settings.
pdf
bib
abs
Better Process Supervision with Bi-directional Rewarding Signals
Wenxiang Chen
|
Wei He
|
Zhiheng Xi
|
Honglin Guo
|
Boyang Hong
|
Jiazheng Zhang
|
Nijun Li
|
Tao Gui
|
Yun Li
|
Qi Zhang
|
Xuanjing Huang
Process supervision, i.e., evaluating each step, is critical for complex large language model (LLM) reasoning and test-time searching with increased inference compute. Existing approaches, represented by process reward models (PRMs), primarily focus on rewarding signals up to the current step, exhibiting a one-directional nature and lacking a mechanism to model the distance to the final target. To address this problem, we draw inspiration from the A* algorithm, which states that an effective supervisory signal should simultaneously consider the incurred cost and the estimated cost for reaching the target. Building on this key insight, we introduce BiRM, a novel process supervision model that not only evaluates the correctness of previous steps but also models the probability of future success. We conduct extensive experiments on mathematical reasoning tasks and demonstrate that BiRM provides more precise evaluations of LLM reasoning steps, achieving an improvement of 3.1% on Gaokao2023 over PRM under the Best-of-N sampling method. Besides, in search-based strategies, BiRM provides more comprehensive guidance and outperforms ORM by 5.0% and PRM by 3.8% respectively on MATH-500.
pdf
bib
abs
KnowCoder-X: Boosting Multilingual Information Extraction via Code
Yuxin Zuo
|
Wenxuan Jiang
|
Wenxuan Liu
|
Zixuan Li
|
Long Bai
|
Hanbin Wang
|
Yutao Zeng
|
Xiaolong Jin
|
Jiafeng Guo
|
Xueqi Cheng
Empirical evidence indicates that LLMs exhibit spontaneous cross-lingual alignment. However, although LLMs show promising cross-lingual alignment in Information Extraction (IE), a significant imbalance across languages persists, highlighting an underlying deficiency. To address this, we propose KnowCoder-X, a powerful code LLM with advanced cross-lingual and multilingual capabilities for universal IE. Firstly, it standardizes the representation of multilingual schemas using Python classes, ensuring a consistent ontology across different languages. Then, IE across languages is formulated as a unified code generation task. Secondly, we conduct IE cross-lingual alignment instruction tuning on the translated instance prediction task to enhance the model’s cross-lingual transferability. During this phase, we also construct a high-quality and diverse bilingual IE parallel dataset with 257k samples, called ParallelNER, synthesized by our proposed robust three-stage pipeline, with manual annotation to ensure quality. Although without training in 29 unseen languages, KnowCoder-X surpasses ChatGPT by 30.17% and SoTA by 20.03%, thereby demonstrating superior cross-lingual IE capabilities. Comprehensive evaluations on 64 IE benchmarks in Chinese and English under various settings demonstrate that KnowCoder-X significantly enhances cross-lingual IE transfer through boosting the IE alignment. Our code and dataset are available at: https://github.com/ICT-GoKnow/KnowCoder.
pdf
bib
abs
MEIT: Multimodal Electrocardiogram Instruction Tuning on Large Language Models for Report Generation
Zhongwei Wan
|
Che Liu
|
Xin Wang
|
Chaofan Tao
|
Hui Shen
|
Jing Xiong
|
Rossella Arcucci
|
Huaxiu Yao
|
Mi Zhang
Electrocardiogram (ECG) is the primary non-invasive diagnostic tool for monitoring cardiac conditions and is crucial in assisting clinicians. Recent studies have concentrated on classifying cardiac conditions using ECG data but have overlooked ECG report generation, which is time-consuming and requires clinical expertise. To automate ECG report generation and ensure its versatility, we propose the Multimodal ECG Instruction Tuning (MEIT) framework, the first attempt to tackle ECG report generation with LLMs and multimodal instructions. To facilitate future research, we establish a benchmark to evaluate MEIT with various LLMs backbones across two large-scale ECG datasets. Our approach uniquely aligns the representations of the ECG signal and the report, and we conduct extensive experiments to benchmark MEIT with nine open-source LLMs using more than 800,000 ECG reports. MEIT’s results underscore the superior performance of instruction-tuned LLMs, showcasing their proficiency in quality report generation, zero-shot capabilities, resilience to signal perturbation, and alignment with human expert evaluation. These findings emphasize the efficacy of our MEIT framework and its potential for real-world clinical application.
pdf
bib
abs
Harnessing Large Language Models for Disaster Management: A Survey
Zhenyu Lei
|
Yushun Dong
|
Weiyu Li
|
Rong Ding
|
Qi R. Wang
|
Jundong Li
Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains, including their emerging role in mitigating threats to human life, infrastructure, and the environment during natural disasters. Despite increasing research on disaster-focused LLMs, there remains a lack of systematic reviews and in-depth analyses of their applications in natural disaster management. To address this gap, this paper presents a comprehensive survey of LLMs in disaster response, introducing a taxonomy that categorizes existing works based on disaster phases and application scenarios. By compiling public datasets and identifying key challenges and opportunities, this study aims to provide valuable insights for the research community and practitioners in developing advanced LLM-driven solutions to enhance resilience against natural disasters.
pdf
bib
abs
Towards Medical Complex Reasoning with LLMs through Medical Verifiable Problems
Junying Chen
|
Zhenyang Cai
|
Ke Ji
|
Xidong Wang
|
Wanlong Liu
|
Rongsheng Wang
|
Benyou Wang
The breakthrough of OpenAI o1 highlights the potential of enhancing reasoning to improve LLM. Yet, most research in reasoning has focused on mathematical tasks, leaving domains like medicine underexplored. The medical domain, though distinct from mathematics, also demands robust reasoning to provide reliable answers, given the high standards of healthcare. However, verifying medical reasoning is challenging, unlike those in mathematics. To address this, we propose **Medical Verifiable Problems** with a medical verifier to check the correctness of model outputs. This verifiable nature enables advancements in medical reasoning through **a two-stage approach**: (1) using the verifier to guide the search for a complex reasoning trajectory for fine-tuning LLMs, (2) applying reinforcement learning (RL) with verifier-based rewards to enhance complex reasoning further. Finally, we introduce HuatuoGPT-o1, a medical LLM capable of complex reasoning, which outperforms general and medical-specific baselines using only 40K verifiable problems. Experiments show complex reasoning improves medical problem-solving and benefits more from RL. We hope our approach inspires advancements in reasoning across medical and other specialized domains. Code, datasets, and models are publicly available at https://github.com/FreedomIntelligence/HuatuoGPT-o1.
pdf
bib
abs
Monitoring Decoding: Mitigating Hallucination via Evaluating the Factuality of Partial Response during Generation
Yurui Chang
|
Bochuan Cao
|
Lu Lin
While large language models have demonstrated exceptional performance across a wide range of tasks, they remain susceptible to hallucinations – generating plausible yet factually incorrect contents. Existing methods to mitigating such risk often rely on sampling multiple full-length generations, which introduces significant response latency and becomes ineffective when the model consistently produces hallucinated outputs with high confidence. To address these limitations, we introduce Monitoring Decoding (MD), a novel framework that dynamically monitors the generation process and selectively applies in-process interventions, focusing on revising crucial tokens responsible for hallucinations. Instead of waiting until completion of multiple full-length generations, we identify hallucination-prone tokens during generation using a monitor function, and further refine these tokens through a tree-based decoding strategy. This approach ensures an enhanced factual accuracy and coherence in the generated output while maintaining efficiency. Experimental results demonstrate that MD consistently outperforms self-consistency-based approaches in both effectiveness and efficiency, achieving higher factual accuracy while significantly reducing computational overhead.
pdf
bib
abs
LLM Critics Help Catch Bugs in Mathematics: Towards a Better Mathematical Verifier with Natural Language Feedback
Bofei Gao
|
Zefan Cai
|
Runxin Xu
|
Peiyi Wang
|
Ce Zheng
|
Runji Lin
|
Keming Lu
|
Dayiheng Liu
|
Chang Zhou
|
Wen Xiao
|
Tianyu Liu
|
Baobao Chang
In recent progress, mathematical verifiers have achieved success in mathematical reasoning tasks by validating the correctness of solutions generated by policy models. However, existing verifiers are trained with binary classification labels, which are not informative enough for the model to accurately assess the solutions. To mitigate the aforementioned insufficiency of binary labels, we introduce step-wise natural language feedback as rationale labels, that is, the correctness of each step and the detailed explanations. In this paper, we propose Math-Minos, a natural language feedback-enhanced verifier by constructing automatically generated training data and a two-stage training paradigm for effective training and efficient inference. Our experiments reveal that a small set of natural language feedback can significantly boost the performance of the verifier in both verification and reinforcement learning and also significantly alleviates the data-demanding problems of the reward model with an over 700% data efficiency improvement.
pdf
bib
abs
EvoBench: Towards Real-world LLM-Generated Text Detection Benchmarking for Evolving Large Language Models
Xiao Yu
|
Yi Yu
|
Dongrui Liu
|
Kejiang Chen
|
Weiming Zhang
|
Nenghai Yu
|
Jing Shao
With the widespread of Large Language Models (LLMs), there has been an increasing need to detect LLM-generated texts, prompting extensive research in this area. However, existing detection methods mainly evaluate on static benchmarks, which neglect the evolving nature of LLMs. Relying on existing static benchmarks could create a misleading sense of security, overestimating the real-world effectiveness of detection methods.To bridge this gap, we introduce EvoBench, a dynamic benchmark considering a new dimension of generalization across continuously evolving LLMs.EvoBench categorizes the evolving LLMs into (1) updates over time and (2) developments like finetuning and pruning, covering 7 LLM families and their 29 evolving versions. To measure the generalization across evolving LLMs, we introduce a new EMG (Evolving Model Generalization) metric. Our evaluation of 14 detection methods on EvoBench reveals that they all struggle to maintain generalization when confronted with evolving LLMs. To mitigate the generalization problems, we further propose improvement strategies. For zero-shot detectors, we propose pruning the scoring model to extract shared features. For supervised detectors, we also propose a practical training strategy.Our research sheds light on critical challenges in real-world LLM-generated text detection and represents a significant step toward practical applications.
pdf
bib
abs
MMSciBench: Benchmarking Language Models on Chinese Multimodal Scientific Problems
Xinwu Ye
|
Chengfan Li
|
Siming Chen
|
Wei Wei
|
Robert Tang
Recent advances in large language models (LLMs) and vision-language models (LVLMs) have shown promise across many tasks, yet their scientific reasoning capabilities remain untested, particularly in multimodal settings. We present MMSciBench, a benchmark for evaluating mathematical and physical reasoning through text-only and text-image formats, with human-annotated difficulty levels, solutions with detailed explanations, and taxonomic mappings. Evaluation of state-of-the-art models reveals significant limitations, with even the best model achieving only 63.77% accuracy and particularly struggling with visual reasoning tasks. Our analysis exposes critical gaps in complex reasoning and visual-textual integration, establishing MMSciBench as a rigorous standard for measuring progress in multimodal scientific understanding. The code for MMSciBench is open-sourced at GitHub, and the dataset is available at Hugging Face.
pdf
bib
abs
Lightweight Query Checkpoint: Classifying Faulty User Queries to Mitigate Hallucinations in Large Language Model Question Answering
Minjoo Son
|
Jonghak Jang
|
Misuk Kim
Question Answering (QA) with large language models has shown impressive performance, yet hallucinations still persist, particularly when user queries carry incorrect premises, insufficient context, or linguistic ambiguity. To address this issue, we propose Lightweight Query Checkpoint (LQC), a small classification model that detects verification-required queries before the LLM generates a potentially faulty answer. LQC leverages hidden states extracted from intermediate layers of a smaller-scale, non-instruct-tuned LLM to effectively distinguish queries requiring verification from clear queries. We first systematically define categories of queries that need verification, construct a dataset comprising both defective and clear queries, and train a binary contrastive learning model. Through extensive experiments on various QA datasets, we demonstrate that incorporating LQC into QA pipelines reduces hallucinations while preserving strong answer quality.
pdf
bib
abs
Exploring LLM Annotation for Adaptation of Clinical Information Extraction Models under Data-sharing Restrictions
Seiji Shimizu
|
Hisada Shohei
|
Yutaka Uno
|
Shuntaro Yada
|
Shoko Wakamiya
|
Eiji Aramaki
In-hospital text data contains valuable clinical information, yet deploying fine-tuned small language models (SLMs) for information extraction remains challenging due to differences in formatting and vocabulary across institutions. Since access to the original in-hospital data (source domain) is often restricted, annotated data from the target hospital (target domain) is crucial for domain adaptation. However, clinical annotation is notoriously expensive and time-consuming, as it demands clinical and linguistic expertise. To address this issue, we leverage large language models (LLMs) to annotate the target domain data for the adaptation. We conduct experiments on four clinical information extraction tasks, including eight target domain data. Experimental results show that LLM-annotated data consistently enhances SLM performance and, with a larger number of annotated data, outperforms manual annotation in three out of four tasks.
pdf
bib
abs
Enhancing the Comprehensibility of Text Explanations via Unsupervised Concept Discovery
Yifan Sun
|
Danding Wang
|
Qiang Sheng
|
Juan Cao
|
Jintao Li
Concept-based explainable approaches have emerged as a promising method in explainable AI because they can interpret models in a way that aligns with human reasoning. However, their adaption in the text domain remains limited. Most existing methods rely on predefined concept annotations and cannot discover unseen concepts, while other methods that extract concepts without supervision often produce explanations that are not intuitively comprehensible to humans, potentially diminishing user trust. These methods fall short of discovering comprehensible concepts automatically. To address this issue, we propose ECO-Concept, an intrinsically interpretable framework to discover comprehensible concepts with no concept annotations. ECO-Concept first utilizes an object-centric architecture to extract semantic concepts automatically. Then the comprehensibility of the extracted concepts is evaluated by large language models. Finally, the evaluation result guides the subsequent model fine-tuning to obtain more understandable explanations using relatively comprehensible concepts. Experiments show that our method achieves superior performance across diverse tasks. Further concept evaluations validate that the concepts learned by ECO-Concept surpassed current counterparts in comprehensibility.
pdf
bib
abs
RecordTwin: Towards Creating Safe Synthetic Clinical Corpora
Seiji Shimizu
|
Ibrahim Baroud
|
Lisa Raithel
|
Shuntaro Yada
|
Shoko Wakamiya
|
Eiji Aramaki
The scarcity of publicly available clinical corpora hinders developing and applying NLP tools in clinical research. While existing work tackles this issue by utilizing generative models to create high-quality synthetic corpora, their methods require learning from the original in-hospital clinical documents, turning them unfeasible in practice. To address this problem, we introduce RecordTwin, a novel synthetic corpus creation method designed to generate synthetic documents from anonymized clinical entities. In this method, we first extract and anonymize entities from in-hospital documents to ensure the information contained in the synthetic corpus is restricted. Then, we use a large language model to fill the context between anonymized entities. To do so, we use a small, privacy-preserving subset of the original documents to mimic their formatting and writing style. This approach only requires anonymized entities and a small subset of original documents in the generation process, making it more feasible in practice. To evaluate the synthetic corpus created with our method, we conduct a proof-of-concept study using a publicly available clinical database. Our results demonstrate that the synthetic corpus has a utility comparable to the original data and a safety advantage over baselines, highlighting the potential of RecordTwin for privacy-preserving synthetic corpus creation.
pdf
bib
abs
Beyond Surface-Level Patterns: An Essence-Driven Defense Framework Against Jailbreak Attacks in LLMs
Shiyu Xiang
|
Ansen Zhang
|
Yanfei Cao
|
Fan Yang
|
Ronghao Chen
Although Aligned Large Language Models (LLMs) are trained to reject harmful requests, they remain vulnerable to jailbreak attacks. Unfortunately, existing methods often focus on surface-level patterns, overlooking the deeper attack essences. As a result, defenses fail when attack prompts change, even though the underlying “attack essences” remain the same. To address this issue, we introduce EDDF, an Essence-Driven Defense Framework Against Jailbreak Attacks in LLMs. EDDF is a plug-and-play input-filtering method and operates in two stages: 1) offline essence database construction, and 2) online adversarial query detection. The key idea behind EDDF is to extract the “attack essence” from a diverse set of known attack instances and store it in an offline vector database. Experimental results demonstrate that EDDF significantly outperforms existing methods by reducing the Attack Success Rate by at least 20%, underscoring its superior robustness against jailbreak attacks.
pdf
bib
abs
Multimodal Invariant Sentiment Representation Learning
Aoqiang Zhu
|
Min Hu
|
Xiaohua Wang
|
Jiaoyun Yang
|
Yiming Tang
|
Ning An
Multimodal Sentiment Analysis (MSA) integrates diverse modalities to overcome the limitations of unimodal data. However, existing MSA datasets commonly exhibit significant sentiment distribution imbalances and cross-modal sentiment conflicts, which hinder performance improvement. This paper shows that distributional discrepancies and sentiment conflicts can be incorporated into the model training to learn stable multimodal invariant sentiment representation. To this end, we propose a Multimodal Invariant Sentiment Representation Learning (MISR) method. Specifically, we first learn a stable and consistent multimodal joint representation in the latent space of Gaussian distribution based on distributional constraints Then, under invariance constraint, we further learn multimodal invariant sentiment representations from multiple distributional environments constructed by the joint representation and unimodal data, achieving robust and efficient MSA performance. Extensive experiments demonstrate that MISR significantly enhances MSA performance and achieves new state-of-the-art.
pdf
bib
abs
ChuLo: Chunk-Level Key Information Representation for Long Document Understanding
Yan Li
|
Caren Han
|
Yue Dai
|
Feiqi Cao
Transformer-based models have achieved remarkable success in various Natural Language Processing (NLP) tasks, yet their ability to handle long documents is constrained by computational limitations. Traditional approaches, such as truncating inputs, sparse self-attention, and chunking, attempt to mitigate these issues, but they often lead to information loss and hinder the model’s ability to capture long-range dependencies. In this paper, we introduce ChuLo, a novel chunk representation method for long document understanding that addresses these limitations. Our ChuLo groups input tokens using unsupervised keyphrase extraction, emphasizing semantically important keyphrase based chunks to retain core document content while reducing input length. This approach minimizes information loss and improves the efficiency of Transformer-based models. Preserving all tokens in long document understanding, especially token classification tasks, is important to ensure that fine-grained annotations, which depend on the entire sequence context, are not lost. We evaluate our method on multiple long document classification tasks and long document token classification tasks, demonstrating its effectiveness through comprehensive qualitative and quantitative analysis.
pdf
bib
abs
REVS: Unlearning Sensitive Information in Language Models via Rank Editing in the Vocabulary Space
Tomer Ashuach
|
Martin Tutek
|
Yonatan Belinkov
Language models (LMs) risk inadvertently memorizing and divulging sensitive or personally identifiable information (PII) seen in training data, causing privacy concerns. Current approaches to address this issue involve costly dataset scrubbing, or model filtering through unlearning and model editing, which can be bypassed through extraction attacks. We propose REVS, a novel non-gradient-based method for unlearning sensitive information from LMs. REVS identifies and modifies a small subset of neurons relevant for constituent tokens which form sensitive information. To adequately evaluate our method on truly sensitive information, we curate three datasets: an email and URL datasets naturally memorized by the models, and a synthetic social security number dataset that we tune the models to memorize. Compared to other methods, REVS demonstrates superior performance in unlearning sensitive information and robustness to extraction attacks, while retaining underlying model integrity.
pdf
bib
abs
Is External Information Useful for Stance Detection with LLMs?
Quang Minh Nguyen
|
Taegyoon Kim
In the stance detection task, a text is classified as either favorable, opposing, or neutral towards a target. Prior work suggests that the use of external information, e.g., excerpts from Wikipedia, improves stance detection performance. However, whether or not such information can benefit large language models (LLMs) remains an unanswered question, despite their wide adoption in many reasoning tasks. In this study, we conduct a systematic evaluation on how Wikipedia and web search external information can affect stance detection across eight LLMs and in three datasets with 12 targets. Surprisingly, we find that such information degrades performance in most cases, with macro F1 scores dropping by up to 27.9%. We explain this through experiments showing LLMs’ tendency to align their predictions with the stance and sentiment of the provided information rather than the ground truth stance of the given text. We also find that performance degradation persists with chain-of-thought prompting, while fine-tuning mitigates but does not fully eliminate it. Our findings, in contrast to previous literature on BERT-based systems which suggests that external information enhances performance, highlight the risks of information biases in LLM-based stance classifiers.
pdf
bib
abs
Benchmarking Query-Conditioned Natural Language Inference
Marc E. Canby
|
Xinchi Chen
|
Xing Niu
|
Jifan Chen
|
Bonan Min
|
Sergul Aydore
|
Vittorio Castelli
The growing excitement around the ability of large language models (LLMs) to tackle various tasks has been tempered by their propensity for generating unsubstantiated information (hallucination) and by their inability to effectively handle inconsistent inputs. To detect such issues, we propose the novel task of Query-Conditioned Natural Language Inference (QC-NLI), where the goal is to determine the semantic relationship (e.g. entailment or not entailment) between two documents conditioned on a query; we demonstrate that many common tasks regarding inconsistency detection can be formulated as QC-NLI problems. We focus on three applications in particular: fact verification, intrinsic hallucination detection, and document inconsistency detection. We convert existing datasets for these tasks into the QC-NLI format, and manual annotation confirms their high quality. Finally, we employ zero- and few-shot prompting methods to solve the QC-NLI prediction problem for each task, showing the critical importance of conditioning on the query.
pdf
bib
abs
Flowchart-Based Decision Making with Large Language Models
Yuuki Yamanaka
|
Hiroshi Takahashi
|
Tomoya Yamashita
Large language models (LLMs) are widely used for conversational systems, but they face significant challenges in interpretability of dialogue flow and reproducibility of expert knowledge. To address this, we propose a novel method that extracts flowcharts from dialogue data and incorporates them into LLMs. This approach not only makes the decision-making process more interpretable through visual representation, but also ensures the reproducibility of expert knowledge by explicitly modeling structured reasoning flows. By evaluating on dialogue datasets, we demonstrate that our method effectively reconstructs expert decision-making paths with high precision and recall scores. These findings underscore the potential of flowchart-based decision making to bridge the gap between flexibility and structured reasoning, making chatbot systems more interpretable for developers and end-users.
pdf
bib
abs
NarGINA: Towards Accurate and Interpretable Children’s Narrative Ability Assessment via Narrative Graphs
Jun Zhong
|
Longwei Xu
|
Li Kong
|
Xianzhuo Li
|
Dandan Liang
|
Junsheng Zhou
The assessment of children’s narrative ability is crucial for diagnosing language disorders and planning interventions. Distinct from the typical automated essay scoring, this task focuses primarily on evaluating the completeness of narrative content and the coherence of expression, as well as the interpretability of assessment results. To address these issues, we propose a novel computational assessing framework NarGINA, under which the narrative graph is introduced to provide a concise and structured summary representation of narrative text, allowing for explicit narrative measurement. To this end, we construct the first Chinese children’s narrative assessment corpus based on real children’s narrative samples, and we then design a narrative graph construction model and a narrative graph-assisted scoring model to yield accurate narrative ability assessment. Particularly, to enable the scoring model to understand narrative graphs, we propose a multi-view graph contrastive learning strategy to pre-train the graph encoder and apply instruction-tuned large language models to generate scores. The extensive experimental results show that NarGINA can achieve significant performance improvement over the baselines, simultaneously possessing good interpretability. Our findings reveal that the utilization of structured narrative graphs beyond flat text is well suited for narrative ability assessment. The model and data are publicly available at https://github.com/JlexZzz/NarGINA.
pdf
bib
abs
Improving Efficiency in Large Language Models via Extendable Block Floating Point Representation
Dongyang Li
|
Zeyang Li
|
Bosheng Liu
|
Jigang Wu
Large language models (LLMs) have revolutionized natural language processing (NLP) tasks, yet their increasing size poses substantial challenges in terms of computational and memory resources. Block floating-point (BFP) arithmetic offers an effective solution by leveraging the strengths of both floating-point and fixed-point representations, leading to reductions in both storage and computational overhead. However, current low-bit BFP quantization approaches often struggle to handle extreme outliers, leading to significant accuracy degradation. To overcome this limitation, we introduce Extendable Exponent Sharing (EES), a novel BFP representation that extends the exponent bit width to capture a wider dynamic range. EES achieves this by embedding extendable exponent bits into the least significant mantissa bits, thereby increasing the shared exponent’s bit width without incurring additional storage costs. To optimize the trade-off between accuracy and energy efficiency, EES employs a design space exploration strategy to optimize the configuration of extendable exponent bit widths. Experimental results show that EES outperforms representative baselines in both accuracy and computational efficiency.
pdf
bib
abs
EpiCoDe: Boosting Model Performance Beyond Training with Extrapolation and Contrastive Decoding
Mingxu Tao
|
Jie Hu
|
Mingchuan Yang
|
Yunhuai Liu
|
Dongyan Zhao
|
Yansong Feng
The remarkable performance of Large language models (LLMs) relies heavily on the availability of abundant high-quality training data. However, the high cost of acquiring annotated data often prevents models from obtaining capabilities to tackle downstream tasks. In this paper, we introduce a novel method, EpiCoDe that boosts model performance in data-scarcity scenarios without extra training. We first employ model extrapolation to enhance a finetuned model with its inferior version, and then adopt contrastive decoding to further reduce predicted errors, by comparing the logit scores given by the extrapolated and the vanilla finetuned model. Experiments across three domains over four different LLMs show that EpiCoDe consistently outperforms existing methods with significant and robust improvement. We also propose a new theoretical framework to reveal the mechanism behind contrastive decoding in data-scarcity scenarios, which further helps better understand the effectiveness of our EpiCoDe.
pdf
bib
abs
NativQA: Multilingual Culturally-Aligned Natural Query for LLMs
Md. Arid Hasan
|
Maram Hasanain
|
Fatema Ahmad
|
Sahinur Rahman Laskar
|
Sunaya Upadhyay
|
Vrunda N Sukhadia
|
Mucahid Kutlu
|
Shammur Absar Chowdhury
|
Firoj Alam
Natural Question Answering (QA) datasets play a crucial role in evaluating the capabilities of large language models (LLMs), ensuring their effectiveness in real-world applications. Despite the numerous QA datasets that have been developed and some work done in parallel, there is a notable lack of a framework and large-scale region-specific datasets queried by native users in their own languages. This gap hinders effective benchmarking and the development of fine-tuned models for regional and cultural specificities. In this study, we propose a scalable, language-independent framework, NativQA, to seamlessly construct culturally and regionally aligned QA datasets in native languages for LLM evaluation and tuning. We demonstrate the efficacy of the proposed framework by designing a multilingual natural QA dataset, MultiNativQA, consisting of approximately ~64K manually annotated QA pairs in seven languages, ranging from high- to extremely low-resource, based on queries from native speakers from 9 regions covering 18 topics. We benchmark both open- and closed-source LLMs using the MultiNativQA dataset. The dataset and related experimental scripts are publicly available for the community at: https://huggingface.co/datasets/QCRI/MultiNativQAand https://gitlab.com/nativqa/multinativqa.
pdf
bib
abs
DoCIA: An Online Document-Level Context Incorporation Agent for Speech Translation
Xinglin Lyu
|
Wei Tang
|
Yuang Li
|
Xiaofeng Zhao
|
Ming Zhu
|
Junhui Li
|
Yunfei Lu
|
Min Zhang
|
Daimeng Wei
|
Hao Yang
|
Min Zhang
Document-level context is crucial for handling discourse challenges in text-to-text document-level machine translation (MT). Despite the increased discourse challenges introduced by noise from automatic speech recognition (ASR), the integration of document-level context in speech translation (ST) remains insufficiently explored. In this paper, we develop DoCIA, an online framework that enhances ST performance by incorporating document-level context. DoCIA decomposes the ST pipeline into four stages. Document-level context is integrated into the ASR refinement, MT, and MT refinement stages through auxiliary LLM (large language model)-based modules. Furthermore, DoCIA leverages document-level information in a multi-level manner while minimizing computational overhead. Additionally, a simple yet effective determination mechanism is introduced to prevent hallucinations from excessive refinement, ensuring the reliability of the final results. Experimental results show that DoCIA significantly outperforms traditional ST baselines in both sentence and discourse metrics across four LLMs, demonstrating its effectiveness in improving ST performance.
pdf
bib
abs
RISE: Reasoning Enhancement via Iterative Self-Exploration in Multi-hop Question Answering
Bolei He
|
Xinran He
|
Mengke Chen
|
Xianwei Xue
|
Ying Zhu
|
Zhen-Hua Ling
Large Language Models (LLMs) excel in many areas but continue to face challenges with complex reasoning tasks, such as Multi-Hop Question Answering (MHQA). MHQA requires integrating evidence from diverse sources while managing intricate logical dependencies, often leads to errors in reasoning. Retrieval-Augmented Generation (RAG), widely employed in MHQA tasks, faces challenges in effectively filtering noisy data and retrieving all necessary evidence, thereby limiting its effectiveness in addressing MHQA challenges. To address these challenges, we propose RISE:Reasoning Enhancement via Iterative Self-Exploration, a novel framework designed to enhance models’ reasoning capability through iterative self-exploration. Specifically, RISE involves three key steps in addressing MHQA tasks: question decomposition, retrieve-then-read, and self-critique. By leveraging continuous self-exploration, RISE identifies accurate reasoning paths, iteratively self-improving the model’s capability to integrate evidence, maintain logical consistency, and enhance performance in MHQA tasks. Extensive experiments on multiple MHQA benchmarks demonstrate that RISE significantly improves reasoning accuracy and task performance.
pdf
bib
abs
VADE: Visual Attention Guided Hallucination Detection and Elimination
Vishnu Prabhakaran
|
Purav Aggarwal
|
Vinay Kumar Verma
|
Gokul Swamy
|
Anoop Saladi
Vision Language Models (VLMs) have achieved significant advancements in complex visual understanding tasks. However, VLMs are prone to hallucinations—generating outputs that lack alignment with visual content. This paper addresses hallucination detection in VLMs by leveraging the visual grounding information encoded in transformer attention maps. We identify three primary challenges in this approach: the elective nature of visual grounding for certain tokens, the high-dimensional and noisy nature of attention maps, and the dynamic sequence length of attention on previous tokens. To address these, we propose VADE, a novel sequence modelling approach to effectively learn complex sequential patterns from high-dimensional and noisy attention maps for fine-grained hallucination detection and mitigation. VADE achieves an average PR-AUC of 80% in hallucination detection on M-HalDetect across four different model architectures and an 5% improvement in hallucination mitigation on MSCOCO.
pdf
bib
abs
PGPO: Enhancing Agent Reasoning via Pseudocode-style Planning Guided Preference Optimization
Zouying Cao
|
Runze Wang
|
Yifei Yang
|
Xinbei Ma
|
Xiaoyong Zhu
|
Bo Zheng
|
Hai Zhao
Large Language Model (LLM) agents have demonstrated impressive capabilities in handling complex interactive problems. Existing LLM agents mainly generate natural language plans to guide reasoning, which is verbose and inefficient. NL plans are also tailored to specific tasks and restrict agents’ ability to generalize across similar tasks. To this end, we explore pseudocode-style plans (P-code Plan) to capture the structural logic of reasoning. We find that P-code Plan empowers LLM agents with stronger generalization ability and more efficiency. Inspired by this finding, we propose a pseudocode-style ̲Planning ̲Guided ̲Preference ̲Optimization method called PGPO for effective agent learning. With two planning-oriented rewards, PGPO further enhances LLM agents’ ability to generate high-quality P-code Plans and subsequent reasoning. Experiments show that PGPO achieves superior performance on representative agent benchmarks and outperforms the current leading baselines. Analyses reveal the advantage of PGPO in reducing action errors and omissions during reasoning.
pdf
bib
abs
The Effectiveness of Uncased Tokeniziaion for Clinical Notes
Cory Paik
|
Katharina Von Der Wense
The impact of case-sensitive tokenization on clinical notes is not well understood. While clinical notes share similarities with biomedical text in terminology, they often lack the proper casing found in polished publications. Language models, unlike humans, require a fixed vocabulary and case sensitivity is a trade-off that must be considered carefully. Improper casing can lead to sub-optimal tokenization and increased sequence length, degrading downstream performance and increasing computational costs. While most recent open-domain encoder language models use uncased tokenization for all tasks, there is no clear trend in biomedical and clinical models. In this work we (1) show that uncased models exceed the performance of cased models on clinical notes, even on traditionally case-sensitive tasks such as named entity recognition and (2) introduce independent case encoding to better balance model performance on case-sensitive and improperly-cased tasks.
pdf
bib
abs
AMXFP4: Taming Activation Outliers with Asymmetric Microscaling Floating-Point for 4-bit LLM Inference
Janghwan Lee
|
Jiwoong Park
|
Jinseok Kim
|
Yongjik Kim
|
Jungju Oh
|
Jinwook Oh
|
Jungwook Choi
As large language models (LLMs) grow in parameter size and context length, computation precision has been reduced from 16-bit to 4-bit to improve inference efficiency. However, this reduction causes accuracy degradation due to activation outliers. Rotation-based INT4 methods address this via matrix calibration, but they introduce multi-hour overheads and leave key computations in full precision. Microscaling (MX) floating-point (FP) formats offer fine-grained representation with a shared scale, enabling fully quantized matrix multiplications through direct casting without calibration. However, existing research shows unsatisfactory empirical results for MXFP4 inference, and the robustness of MX formats remains largely unexplored. In this work, we uncover the fundamental tradeoffs of the MX format: while it effectively suppresses activation outliers, it does so at the cost of increased group-wise asymmetry. To address this, we propose AMXFP4, a 4-bit asymmetric FP format that handles both issues using asymmetric shared scales, without requiring calibration. Our custom MAC engine adds negligible hardware cost while improving accuracy: AMXFP4 outperforms MXFP4 by 3% on VQA and exceeds rotation-based methods by 1.6% on CSQA. It also surpasses recently deployed commercial MXFP4 variants. Code: https://github.com/aiha-lab/MX-QLLM
pdf
bib
abs
Improving Continual Pre-training Through Seamless Data Packing
Ruicheng Yin
|
Xuan Gao
|
Changze Lv
|
Xiaohua Wang
|
Xiaoqing Zheng
|
Xuanjing Huang
Continual pre-training has demonstrated significant potential in enhancing model performance, particularly in domain-specific scenarios. The most common approach for packing data before continual pre-training involves concatenating input texts and splitting them into fixed-length sequences. While straightforward and efficient, this method often leads to excessive truncation and context discontinuity, which can hinder model performance. To address these issues, we explore the potential of data engineering to enhance continual pre-training, particularly its impact on model performance and efficiency. We propose Seamless Packing (SP), a novel data packing strategy aimed at preserving contextual information and enhancing model performance. Our approach employs a sliding window technique in the first stage that synchronizes overlapping tokens across consecutive sequences, ensuring better continuity and contextual coherence. In the second stage, we adopt a First-Fit-Decreasing algorithm to pack shorter texts into bins slightly larger than the target sequence length, thereby minimizing padding and truncation. Empirical evaluations across various model architectures and corpus domains demonstrate the effectiveness of our method, outperforming baselines in 99% of all settings. Code is available at https://github.com/Infernus-WIND/Seamless-Packing.
pdf
bib
abs
The Impact of Name Age Perception on Job Recommendations in LLMs
Mahammed Kamruzzaman
|
Gene Louis Kim
Names often carry generational connotations, with certain names stereotypically associated with younger or older age groups. This study examines implicit age-related name bias in LLMs used for job recommendations. Analyzing six LLMs and 117 American names categorized by perceived age across 30 occupations, we find systematic bias: older-sounding names are favored for senior roles, while younger-sounding names are linked to youth-dominant jobs, reinforcing generational stereotypes. We also find that this bias is based on perceived rather than real ages associated with the names.
pdf
bib
abs
DAPI: Domain Adaptive Toxicity Probe Vector Intervention, for Fine-Grained Detoxification
Cho Hyeonsu
|
Dooyoung Kim
|
Youngjoong Ko
There have been attempts to utilize linear probe for detoxification, with existing studies relying on a single toxicity probe vector to reduce toxicity. However, toxicity can be fine-grained into various subcategories, making it difficult to remove certain types of toxicity by using a single toxicity probe vector. To address this limitation, we propose a category-specific toxicity probe vector approach. First, we train multiple toxicity probe vectors for different toxicity categories. During generation, we dynamically select the most relevant toxicity probe vector based on the current context. Finally, the selected vector is dynamically scaled and subtracted from model. Our method successfully mitigated toxicity from categories that the single probe vector approach failed to detoxify. Experiments demonstrate that our approach achieves up to a 78.52% reduction in toxicity on the evaluation dataset, while fluency remains nearly unchanged, with only a 0.052% drop compared to the unsteered model.
pdf
bib
abs
Task Knowledge Injection via Interpolations and Reinstatement for Large Language Model Generalization
Yukun Zhao
|
Lingyong Yan
|
Zhenyang Li
|
Shuaiqiang Wang
|
Zhumin Chen
|
Zhaochun Ren
|
Dawei Yin
Large language models have shown tremendous potential across various NLP tasks, and instruction tuning has been widely adopted to elicit their superior performance. However, instruction tuning may overly tailor the models to task-specific formats, potentially compromising their generalization on unseen tasks. We attribute the issue to the spurious correlations learned between inputs and targets. We propose explicit task knowledge injection to mitigate these shortcuts with latent task adaptation and knowledge reinstatement. Latent tasks serve as interpolations between new tasks and facilitate knowledge sharing with joint adaptation enabling the model to build task knowledge more smoothly. Knowledge reinstatement helps optimize building new knowledge with prior knowledge. Specifically, we retrieve input-relevant latent tasks and jointly learn the task and the relevant latent tasks. Moreover, we prompt the model to recall the forms of inputs corresponding to the target and build the task knowledge through the reinstatement of prior knowledge while learning the new task.We conduct extensive experiments on state-of-the-art large language models including Llama3.1-8B and Vicuna-13B across 1000+ instruction-following tasks to demonstrate the effectiveness of our method. The results demonstrate our method improves generalization on both in-domain and out-of-domain unseen tasks.
pdf
bib
abs
STARS: A Unified Framework for Singing Transcription, Alignment, and Refined Style Annotation
Wenxiang Guo
|
Yu Zhang
|
Changhao Pan
|
Zhiyuan Zhu
|
Ruiqi Li
|
ZheTao Chen
|
Wenhao Xu
|
Fei Wu
|
Zhou Zhao
Recent breakthroughs in singing voice synthesis (SVS) have heightened the demand for high-quality annotated datasets, yet manual annotation remains prohibitively labor-intensive and resource-intensive. Existing automatic singing annotation (ASA) methods, however, primarily tackle isolated aspects of the annotation pipeline. To address this fundamental challenge, we present STARS, which is, to our knowledge, the first unified framework that simultaneously addresses singing transcription, alignment, and refined style annotation. Our framework delivers comprehensive multi-level annotations encompassing: (1) precise phoneme-audio alignment, (2) robust note transcription and temporal localization, (3) expressive vocal technique identification, and (4) global stylistic characterization including emotion and pace. The proposed architecture employs hierarchical acoustic feature processing across frame, word, phoneme, note, and sentence levels. The novel non-autoregressive local acoustic encoders enable structured hierarchical representation learning. Experimental validation confirms the framework’s superior performance across multiple evaluation dimensions compared to existing annotation approaches. Furthermore, applications in SVS training demonstrate that models utilizing STARS-annotated data achieve significantly enhanced perceptual naturalness and precise style control. This work not only overcomes critical scalability challenges in the creation of singing datasets but also pioneers new methodologies for controllable singing voice synthesis.
pdf
bib
abs
Unveiling the Key Factors for Distilling Chain-of-Thought Reasoning
Xinghao Chen
|
Zhijing Sun
|
Guo Wenjin
|
Miaoran Zhang
|
Yanjun Chen
|
Yirong Sun
|
Hui Su
|
Yijie Pan
|
Dietrich Klakow
|
Wenjie Li
|
Xiaoyu Shen
Large Language Models (LLMs) excel in reasoning tasks through Chain-of-Thought (CoT) prompting. However, CoT prompting greatly increases computational demands, which has prompted growing interest in distilling CoT capabilities into Small Language Models (SLMs). This study systematically examines the factors influencing CoT distillation, including the choice of granularity, format and teacher model. Through experiments involving four teacher models and seven student models across seven mathematical and commonsense reasoning datasets, we uncover three key findings: (1) Unlike LLMs, SLMs exhibit a *non-monotonic* relationship with granularity, with stronger models benefiting from finer-grained reasoning and weaker models performing better with simpler CoT supervision; (2) CoT format significantly impacts LLMs but has *minimal* effect on SLMs, likely due to their reliance on supervised fine-tuning rather than pretraining preferences; (3) Stronger teacher models do *NOT* always produce better student models, as diversity and complexity in CoT supervision can outweigh accuracy alone. These findings emphasize the need to tailor CoT strategies to specific student model, offering actionable insights for optimizing CoT distillation in SLMs.
pdf
bib
abs
INT: Establishing Information Transfer for Multilingual Intent Detection and Slot Filling
Di Wu
|
Liting Jiang
|
Bohui Mao
|
Hongyan Xie
|
Haoxiang Su
|
Zhongjiang He
|
Ruiyu Fang
|
Shuangyong Song
|
Hao Huang
|
Xuelong Li
Multilingual spoken language understanding (SLU) involves intent detection (ID) and slot filling (SF) across multiple languages. The inherent linguistic diversity presents significant challenges in achieving performance comparable to traditional SLU. Recent studies have attempted to improve multilingual SLU performance by sharing multilingual encoders. However, these approaches have not directly established information flow between languages. To address this, we first demonstrate the feasibility of such information transfer and pinpoint the key challenges: prediction error mitigation and multilingual slot alignment. We then propose the INformation Transfer network (INT) to tackle these challenges. The gate unit in INT controls the information flow between languages, reducing the adverse impact of prediction errors on both ID and SF. Additionally, we reformulate SF as a span prediction problem and introduce a slot-matching attention mechanism to achieve slot alignment across languages. Experimental results on the MASSIVE and MASSIVE-UG datasets show that our model outperforms all baselines in overall accuracy across all languages, and demonstrates robust performance when different languages are used as the source.
pdf
bib
abs
Enhancing LLM Agent Safety via Causal Influence Prompting
Dongyoon Hahm
|
Woogyeol Jin
|
June Suk Choi
|
Sungsoo Ahn
|
Kimin Lee
As autonomous agents powered by large language models (LLMs) continue to demonstrate potential across various assistive tasks, ensuring their safe and reliable behavior is crucial for preventing unintended consequences. In this work, we introduce CIP, a novel technique that leverages causal influence diagrams (CIDs) to identify and mitigate risks arising from agent decision-making. CIDs provide a structured representation of cause-and-effect relationships, enabling agents to anticipate harmful outcomes and make safer decisions. Our approach consists of three key steps: (1) initializing a CID based on task specifications to outline the decision-making process, (2) guiding agent interactions with the environment using the CID, and (3) iteratively refining the CID based on observed behaviors and outcomes. Experimental results demonstrate that our method effectively enhances safety in both code execution and mobile device control tasks.
pdf
bib
abs
Position Paper: MeMo: Towards Language Models with Associative Memory Mechanisms
Fabio Massimo Zanzotto
|
Elena Sofia Ruzzetti
|
Giancarlo A. Xompero
|
Leonardo Ranaldi
|
Davide Venditti
|
Federico Ranaldi
|
Cristina Giannone
|
Andrea Favalli
|
Raniero Romagnoli
Memorization is a fundamental ability of Transformer-based Large Language Models, achieved through learning. In this position/theory paper, we propose a paradigm shift by designing an architecture to memorize text directly, bearing in mind the principle that memorization precedes learning. We introduce MeMo, a novel architecture for language modeling that explicitly memorizes sequences of tokens in layered associative memories. By design, MeMo offers transparency and the possibility of model editing, including forgetting texts. We experimented with the MeMo architecture, showing the memorization power of the one-layer and the multi-layer configurations.
pdf
bib
abs
DeRAGEC: Denoising Named Entity Candidates with Synthetic Rationale for ASR Error Correction
Solee Im
|
Wonjun Lee
|
JinMyeong An
|
Yunsu Kim
|
Jungseul Ok
|
Gary Lee
We present DeRAGEC, a method for improving Named Entity (NE) correction in Automatic Speech Recognition (ASR) systems. By extending the Retrieval-Augmented Generative Error Correction (RAGEC) framework, DeRAGEC employs synthetic denoising rationales to filter out noisy NE candidates before correction. By leveraging phonetic similarity and augmented definitions, it refines noisy retrieved NEs using in-context learning, requiring no additional training. Experimental results on CommonVoice and STOP datasets show significant improvements in Word Error Rate (WER) and NE hit ratio, outperforming baseline ASR and RAGEC methods. Specifically, we achieved a 28% relative reduction in WER compared to ASR without postprocessing.
pdf
bib
abs
Rehearse With User: Personalized Opinion Summarization via Role-Playing based on Large Language Models
Yanyue Zhang
|
Yulan He
|
Deyu Zhou
Personalized opinion summarization is crucial as it considers individual user interests while generating product summaries.Recent studies show that although large language models demonstrate powerful text summarization and evaluation capabilities without the need for training data, they face difficulties in personalized tasks involving long texts. To address this, Rehearsal, a personalized opinion summarization framework via LLM-based role-playing is proposed. Having the model act as the user, the model can better understand the user’s personalized needs.Additionally, a role-playing supervisor and practice process are introduced to improve the role-playing ability of the LLMs, leading to a better expression of user needs.Furthermore, the summary generation process is guided by suggestions from virtual users, ensuring that the generated summary includes the user’s interest, thus achieving personalized summary generation. Experiment results demonstrate that our method can effectively improve the level of personalization in large model-generated summaries.
pdf
bib
abs
AdParaphrase v2.0: Generating Attractive Ad Texts Using a Preference-Annotated Paraphrase Dataset
Soichiro Murakami
|
Peinan Zhang
|
Hidetaka Kamigaito
|
Hiroya Takamura
|
Manabu Okumura
Identifying factors that make ad text attractive is essential for advertising success. This study proposes AdParaphrase v2.0, a dataset for ad text paraphrasing, containing human preference data, to enable the analysis of the linguistic factors and to support the development of methods for generating attractive ad texts. Compared with v1.0, this dataset is 20 times larger, comprising 16,460 ad text paraphrase pairs, each annotated with preference data from ten evaluators, thereby enabling a more comprehensive and reliable analysis. Through the experiments, we identified multiple linguistic features of engaging ad texts that were not observed in v1.0 and explored various methods for generating attractive ad texts. Furthermore, our analysis demonstrated the relationships between human preference and ad performance, and highlighted the potential of reference-free metrics based on large language models for evaluating ad text attractiveness.The dataset is publicly available at: https://github.com/CyberAgentAILab/AdParaphrase-v2.0.
pdf
bib
abs
Beyond the Average Reader: the Reader Embedding Approach
Calogero Jerik Scozzaro
|
Matteo Delsanto
|
Daniele P. Radicioni
Focus of this work is the prediction of reading times as the task is customarily dealt with in literature: that is, by collecting eye-tracking data that are averaged and employed to train learning models. We start by observing that systems trained on average values are ill-suited for the prediction of the reading times for specific subjects, as they fail to account for individual variability and accurately analyze the reading gestures of specific reader groups, or to target specific user needs. To overcome such limitation, that is to predict the reading times for a specific subject, we propose a novel approach based on creating an embedding to compactly describe her/his fixations. Embeddings are used to individuate readers that share same or similar reading behavior from a reference corpus. Models are then trained on values averaged over this subset of similar readers. Experimental results indicate that the proposed approach consistently outperforms its corresponding variants, in which predictions of reading times for specific readers are based on data from all subjects rather than from the most similar ones.
pdf
bib
abs
PredictaBoard: Benchmarking LLM Score Predictability
Lorenzo Pacchiardi
|
Konstantinos Voudouris
|
Ben Slater
|
Fernando Martínez-Plumed
|
Jose Hernandez-Orallo
|
Lexin Zhou
|
Wout Schellaert
Despite possessing impressive skills, Large Language Models (LLMs) often fail unpre-dictably, demonstrating inconsistent success in even basic common sense reasoning tasks. This unpredictability poses a significant challenge to ensuring their safe deployment, as identifying and operating within a reliable “safe zone” is essential for mitigating risks. To address this, we present PredictaBoard, a novel collabo-rative benchmarking framework designed to evaluate the ability of score predictors (referred to as assessors) to anticipate LLM errors on specific task instances (i.e., prompts) from existing datasets. PredictaBoard evaluates pairs of LLMs and assessors by considering the rejection rate at different tolerance errors. As such, PredictaBoard stimulates research into developing better assessors and making LLMs more predictable, not only with a higher average performance. We conduct illustrative experiments using baseline assessors and state-of-the-art LLMs. PredictaBoard highlights the critical need to evaluate predictability alongside performance, paving the way for safer AI systems where errors are not only minimised but also anticipated and effectively mitigated. Code for our bench-mark can be found at https://github. com/Kinds-of-Intelligence-CFI/PredictaBoard
pdf
bib
abs
FedDQC: Data Quality Control in Federated Instruction-tuning of Large Language Models
Yaxin Du
|
Rui Ye
|
Fengting Yuchi
|
Wanru Zhao
|
Jingjing Qu
|
Yanfeng Wang
|
Siheng Chen
Federated Learning (FL) enables privacy-preserving collaborative instruction tuning of large language models (LLMs) by leveraging massively distributed data. However, the decentralized nature of FL exacerbates data quality challenges, as local clients lack global visibility to filter noisy or low-quality samples before training. To resolve this issue, we propose FedDQC, a novel federated instruction tuning framework with dynamic data quality control. Our approach introduces two key innovations. First, we propose instruction-response alignment (IRA)—an efficient client-side metric for quality evaluation requiring only low-cost inference. We validate that higher-IRA data corresponds to more relevant and easier-to-learn question-answer pairs. Second, mirroring the human easy-to-hard knowledge acquisition process, we design a quality-aware hierarchical FL training framework, where the LLM is progressively fine-tuned from high- to low-IRA data in a collaborative manner. The framework also supports adaptive data quality assessment at each hierarchy, enabling dynamic adjustments throughout the training process. Extensive experiments on synthetic and real-world datasets show that our method significantly improves LLM performance on mixed-quality data in FL.
pdf
bib
abs
Weed Out, Then Harvest: Dual Low-Rank Adaptation is an Effective Noisy Label Detector for Noise-Robust Learning
Bo Yuan
|
Yulin Chen
|
Yin Zhang
Parameter-efficient fine-tuning (PEFT) large language models (LLMs) have shown impressive performance in various downstream tasks. However, in many real-world scenarios, the collected training data inevitably contains noisy labels. To learn from noisy labels, most solutions select samples with small losses for model training. However, the selected samples, in turn, impact the loss computation in the next iteration. An inaccurate initial selection can create a vicious cycle, leading to suboptimal performance. To break this cycle, we propose Delora, a novel framework that decouples the sample selection from model training. For sample selection, Delora establishes a noisy label detector by introducing clean and noisy LoRA. Benefiting from the memory effect, the clean LoRA is encouraged to memorize clean data, while the noisy LoRA is constrained to memorize mislabeled data, which serves as a learnable threshold for selecting clean and noisy samples. For model training, Delora can use carefully selected samples to fine-tune language models seamlessly. Experimental results on synthetic and real-world noisy datasets demonstrate the effectiveness of Delora in noisy label detection and text classification.
pdf
bib
abs
“I understand your perspective”: LLM Persuasion through the Lens of Communicative Action Theory
Esra Dönmez
|
Agnieszka Falenska
Large Language Models (LLMs) can generate high-quality arguments, yet their ability to engage in *nuanced and persuasive communicative actions* remains largely unexplored. This work explores the persuasive potential of LLMs through the framework of Jürgen Habermas’ Theory of Communicative Action. It examines whether LLMs express illocutionary intent (i.e., pragmatic functions of language such as conveying knowledge, building trust, or signaling similarity) in ways that are comparable to human communication.We simulate online discussions between opinion holders and LLMs using conversations from the persuasive subreddit *ChangeMyView*. We then compare the likelihood of illocutionary intents in human-written and LLM-generated counter-arguments, specifically those that successfully changed the original poster’s view. We find that all three LLMs effectively convey illocutionary intent — often more so than humans — potentially increasing their anthropomorphism. Further, LLMs craft responses that closely align with the opinion holder’s intent, a strategy strongly associated with opinion change. Finally, crowd-sourced workers find LLM-generated counter-arguments more *agreeable* and consistently prefer them over human-written ones. These findings suggest that LLMs’ persuasive power extends beyond merely generating high-quality arguments. On the contrary, training LLMs with human preferences effectively tunes them to mirror human communication patterns, particularly nuanced communicative actions, potentially increasing individuals’ susceptibility to their influence.
pdf
bib
abs
Nunchi-Bench: Benchmarking Language Models on Cultural Reasoning with a Focus on Korean Superstition
Kyuhee Kim
|
Sangah Lee
As large language models (LLMs) become key advisors in various domains, their cultural sensitivity and reasoning skills are crucial in multicultural environments. We introduce Nunchi-Bench, a benchmark designed to evaluate LLMs’ cultural understanding, with a focus on Korean superstitions. The benchmark consists of 247 questions spanning 31 topics, assessing factual knowledge, culturally appropriate advice, and situational interpretation. We evaluate multilingual LLMs in both Korean and English to analyze their ability to reason about Korean cultural contexts and how language variations affect performance. To systematically assess cultural reasoning, we propose a novel verification strategy with customized scoring metrics that capture the extent to which models recognize cultural nuances and respond appropriately. Our findings highlight significant challenges in LLMs’ cultural reasoning. While models generally recognize factual information, they struggle to apply it in practical scenarios. Furthermore, explicit cultural framing enhances performance more effectively than relying solely on the language of the prompt. To support further research, we publicly release Nunchi-Bench alongside a leaderboard.
pdf
bib
abs
Let’s Be Self-generated via Step by Step: A Curriculum Learning Approach to Automated Reasoning with Large Language Models
Kangyang Luo
|
Zichen Ding
|
Zhenmin Weng
|
Lingfeng Qiao
|
Meng Zhao
|
Xiang Li
|
Di Yin
|
Jinlong Shu
While Chain of Thought (CoT) prompting approaches have significantly consolidated the reasoning capabilities of large language models (LLMs), they still face limitations that require extensive human effort or have performance needs to be improved. Existing endeavors have focused on bridging these gaps; however, these approaches either hinge on external data and cannot completely eliminate manual effort, or they fall short in effectively directing LLMs to generate high-quality exemplary prompts. To address the said pitfalls, we propose a novel prompt approach for automatic reasoning named LBS3, inspired by curriculum learning which better reflects human learning habits. Specifically, LBS3 initially steers LLMs to recall easy-to-hard proxy queries that are pertinent to the target query. Following this, it invokes a progressive strategy that utilizes exemplary prompts stemmed from easy-proxy queries to direct LLMs in solving hard-proxy queries, enabling the high-quality of the proxy solutions. Finally, our extensive experiments in various reasoning-intensive tasks with varying open- and closed-source LLMs show that LBS3 achieves strongly competitive performance compared to the SOTA baselines.
pdf
bib
abs
daDPO: Distribution-Aware DPO for Distilling Conversational Abilities
Zhengze Zhang
|
Shiqi Wang
|
Yiqun Shen
|
Simin Guo
|
Dahua Lin
|
Xiaoliang Wang
|
Nguyen Cam-Tu
|
Fei Tan
Large language models (LLMs) have demonstrated exceptional performance across various applications, but their conversational abilities decline sharply as model size decreases, presenting a barrier to their deployment in resource-constrained environments. Knowledge distillation (KD) with Direct Preference Optimization (DPO) has emerged as a promising approach to enhance the conversational abilities of smaller models using a larger teacher model. However, current methods primarily focus on “black-box” KD, which only uses the teacher’s responses, overlooking the rich distributional information within the teacher’s probability distribution. This paper addresses this gap by introducing daDPO (Distillation-Aware DPO), a novel framework that integrates the teacher’s distributional information into DPO distillation while preserving theoretical guarantees. Our framework offers a unified objective that enhances both preference optimization and distribution-based distillation. We provide rigorous theoretical analysis and empirical validation, showing that daDPO outperforms existing methods in restoring performance for pruned models and enhancing smaller models within the same LLM family. Notably, in in-domain evaluation, our method enables a 20% pruned Vicuna1.5-7B to achieve near-teacher performance (-7.3% preference rate), and allows Qwen2.5-1.5B to occasionally outperform its 7b teacher model (14.0% win rate).
pdf
bib
abs
Consultant Decoding: Yet Another Synergistic Mechanism
Chuanghao Ding
|
Jiaping Wang
|
Ziqing Yang
|
Xiaoliang Wang
|
Dahua Lin
|
Nguyen Cam-Tu
|
Fei Tan
The synergistic mechanism based on Speculative Decoding (SD) has garnered considerable attention as a simple yet effective approach for accelerating the inference of large language models (LLMs). Nonetheless, the high rejection rates require repeated LLMs calls to validate draft tokens, undermining the overall efficiency gain of SD.In this work, we revisit existing verification mechanisms and propose a novel synergetic mechanism Consultant Decoding (CD). CD achieves up to a 2.5-fold increase in inference speed compared to the target model, while maintaining comparable generation quality (~100% of the target model’s performance). Interestingly, this is achieved by combining models whose parameter sizes differ by two orders of magnitude.In addition, CD reduces the call frequency of the large target model to below 10%, particularly in more demanding tasks.CD’s performance was even found to surpass that of the large target model, which theoretically represents the upper bound for speculative decoding.
pdf
bib
abs
IntelliCockpitBench: A Comprehensive Benchmark to Evaluate VLMs for Intelligent Cockpit
Liang Lin
|
Siyuan Chai
|
Jiahao Wu
|
Hongbing Hu
|
Xiaotao Gu
|
Hao Hu
|
Fan Zhang
|
Wei Wang
|
Dan Zhang
The integration of sophisticated Vision-Language Models (VLMs) in vehicular systems is revolutionizing vehicle interaction and safety, performing tasks such as Visual Question Answering (VQA). However, a critical gap persists due to the lack of a comprehensive benchmark for multimodal VQA models in vehicular scenarios. To address this, we propose IntelliCockpitBench, a benchmark that encompasses diverse automotive scenarios. It includes images from front, side, and rear cameras, various road types, weather conditions, and interior views, integrating data from both moving and stationary states. Notably, all images and queries in the benchmark are verified for high levels of authenticity, ensuring the data accurately reflects real-world conditions. A sophisticated scoring methodology combining human and model-generated assessments enhances reliability and consistency. Our contributions include a diverse and authentic dataset for automotive VQA and a robust evaluation metric aligning human and machine assessments. All code and data can be found at
https://github.com/Lane315/IntelliCockpitBench.
pdf
bib
abs
Analyzing Political Bias in LLMs via Target-Oriented Sentiment Classification
Akram Elbouanani
|
Evan Dufraisse
|
Adrian Popescu
Political biases encoded by LLMs might have detrimental effects on downstream applications. Existing bias analysis methods rely on small-size intermediate tasks (questionnaire answering or political content generation) and rely on the LLMs themselves for analysis, thus propagating bias. We propose a new approach leveraging the observation that LLM sentiment predictions vary with the target entity in the same sentence. We define an entropy-based inconsistency metric to encode this prediction variability. We insert 1319 demographically and politically diverse politician names in 450 political sentences and predict target-oriented sentiment using seven models in six widely spoken languages. We observe inconsistencies in all tested combinations and aggregate them in a statistically robust analysis at different granularity levels. We observe positive and negative bias toward left and far-right politicians and positive correlations between politicians with similar alignment. Bias intensity is higher for Western languages than for others. Larger models exhibit stronger and more consistent biases and reduce discrepancies between similar languages. We partially mitigate LLM unreliability in target-oriented sentiment classification (TSC) by replacing politician names with fictional but plausible counterparts. The complete code, the data, and all analyses will be made public to enable reproducibility.
pdf
bib
abs
PISCO: Pretty Simple Compression for Retrieval-Augmented Generation
Maxime Louis
|
Hervé Déjean
|
Stéphane Clinchant
Retrieval-Augmented Generation (RAG) pipelines enhance Large Language Models (LLMs) by retrieving relevant documents, but they face scalability issues due to high inference costs and limited context size. Document compression is a practical solution, but current soft compression methods often suffer from accuracy losses and require extensive pretraining. In this paper, we introduce PISCO, a novel method that achieves a 16x compression rate with minimal accuracy loss (0-3%) across diverse RAG-based question-answering (QA) tasks. Unlike existing approaches, PISCO requires no pretraining or annotated data, relying solely on sequence-level knowledge distillation from document-based questions. With the ability to fine-tune a 7-10B LLM in 24 hours on a single A100 GPU, PISCO offers a highly efficient and scalable solution. We present comprehensive experiments showing that PISCO outperforms existing compression models by 8% in accuracy.
pdf
bib
abs
AnchorCoT: Anchors Pave the Way for Multi-hop Reasoning
Tianshi Ming
|
Xian Wu
|
Yingying Zhang
|
Zichuan Fu
|
Dawei Cheng
Large Language Models (LLMs) have made substantial strides in a broad array of natural language tasks. Recently, LLMs have demonstrated potential reasoning capabilities through prompt design, such as the Chain of Thought (CoT). Despite their superiority in question answering, LLMs still face challenges in answering questions that require multi-hop reasoning, often generating unreliable reasoning chains during answer generation. To improve LLMs’ performance in multi-hop reasoning, we introduce a novel reasoning approach, AnchorCoT, designed to assist LLMs in answering questions involving complex logical reasoning steps. AnchorCoT first predicts key entities which work as important “anchors” to guide the reasoning process and then employs a novel ranking algorithm to ensure the logical sequence of the predicted answers.We implement AnchorCoT on Qwen2.5-7B/14B and GPT-4o and evaluate our method on widely used multi-hop reasoning datasets, including HotpotQA, 2WikiMultiHopQA, and MuSiQue-Ans. The experimental results show that AnchorCoT outperforms existing methods in multi-hop question reasoning and provides more accurate reasoning results in multi-hop question answering tasks.
pdf
bib
abs
Token Pruning in Multimodal Large Language Models: Are We Solving the Right Problem?
Zichen Wen
|
Yifeng Gao
|
Weijia Li
|
Conghui He
|
Linfeng Zhang
Multimodal large language models (MLLMs) have shown remarkable performance for cross-modal understanding and generation, yet still suffer from severe inference costs. Recently, abundant works have been proposed to solve this problem with token pruning, which identifies the redundant tokens in MLLMs and then prunes them to reduce the computation and KV storage costs, leading to significant acceleration without training. While these methods claim efficiency gains, critical questions about their fundamental design and evaluation remain unanswered: Why do many existing approaches underperform even compared to naive random token selection? Are attention-based scoring sufficient for reliably identifying redundant tokens? Is language information really helpful during token pruning? What makes a good trade-off between token importance and duplication? Are current evaluation protocols comprehensive and unbiased? The ignorance of previous research on these problems hinders the long-term development of token pruning. In this paper, we answer these questions one by one, providing insights into the design of future token pruning methods. Codes are available in the supplementary materials.
pdf
bib
abs
Federated Data-Efficient Instruction Tuning for Large Language Models
Zhen Qin
|
Zhaomin Wu
|
Bingsheng He
|
Shuiguang Deng
Instruction tuning is a crucial step in improving the responsiveness of pretrained large language models (LLMs) to human instructions. Federated learning (FL) helps to exploit the use of vast private instruction data from clients, becoming popular for LLM tuning by improving data diversity. Existing federated tuning simply consumes all local data, causing excessive computational overhead and overfitting to local data, while centralized data-efficient solutions are not suitable for FL due to privacy concerns. This work presents FedHDS, a federated data-efficient instruction tuning approach, which tunes LLMs with a representative subset of edge-side data. It reduces the data redundancy at both intra- and inter-client levels without sharing raw data. Experiments with various LLMs, datasets and partitions show that FedHDS improves Rouge-L on unseen tasks by an average of 10.72% over the SOTA full-data federated instruction tuning methods, while using less than 1.5% of the data samples, improving training efficiency by up to tens of times.
pdf
bib
abs
They want to pretend not to understand: The Limits of Current LLMs in Interpreting Implicit Content of Political Discourse
Walter Paci
|
Alessandro Panunzi
|
Sandro Pezzelle
Implicit content plays a crucial role in political discourse, where systematically employ pragmatic strategies such as implicatures and presuppositions to influence their audiences. Large Language Models (LLMs) have demonstrated strong performance in tasks requiring complex semantic and pragmatic understanding, highlighting their potential for detecting and explaining the meaning of implicit content. However, their ability to do this within political discourse remains largely underexplored. Leveraging, for the very first time, the large IMPAQTS corpus comprising transcribed Italian political speeches with expert annotations of various types of implicit content, we propose methods to test the effectiveness of LLMs in this challenging problem. Through a multiple-choice task and an open-ended generation task, we demonstrate that all tested models struggle to interpret presuppositions and implicatures. To illustrate, the best-performing model provides a fully correct explanation in only one-fourth of cases in the open-ended generation setup. We conclude that current LLMs lack the key pragmatic capabilities necessary for accurately interpreting highly implicit language, such as that found in political discourse. At the same time, we highlight promising trends and future directions for enhancing model performance. We release our data and code at: \url{https://github.com/WalterPaci/IMPAQTS-PID}
pdf
bib
abs
ZeroNER: Fueling Zero-Shot Named Entity Recognition via Entity Type Descriptions
Alessio Cocchieri
|
Marcos Martínez Galindo
|
Giacomo Frisoni
|
Gianluca Moro
|
Claudio Sartori
|
Giuseppe Tagliavini
What happens when a named entity recognition (NER) system encounters entities it has never seen before? In practical applications, models must generalize to unseen entity types where labeled training data is either unavailable or severely limited—a challenge that demands zero-shot learning capabilities. While large language models (LLMs) offer extensive parametric knowledge, they fall short in cost-effectiveness compared to specialized small encoders. Existing zero-shot methods predominantly adopt a relaxed definition of the term with potential leakage issues and rely on entity type names for generalization, overlooking the value of richer descriptions for disambiguation. In this work, we introduce ZeroNER, a description-driven framework that enhances hard zero-shot NER in low-resource settings. By leveraging general-domain annotations and entity type descriptions with LLM supervision, ZeroNER enables a BERT-based student model to successfully identify unseen entity types. Evaluated on three real-world benchmarks, ZeroNER consistently outperforms LLMs by up to 16% in F1 score, and surpasses lightweight baselines that use type names alone. Our analysis further reveals that LLMs derive significant benefits from incorporating type descriptions in the prompts.
pdf
bib
abs
Do Large Language Models Have “Emotion Neurons”? Investigating the Existence and Role
Jaewook Lee
|
Woojin Lee
|
Oh-Woog Kwon
|
Harksoo Kim
This study comprehensively explores whether there actually exist “emotion neurons” within large language models (LLMs) that selectively process and express certain emotions, and what functional role they play. Drawing on the representative emotion theory of the six basic emotions, we focus on six core emotions. Using synthetic dialogue data labeled with emotions, we identified sets of neurons that exhibit consistent activation patterns for each emotion. As a result, we confirmed that principal neurons handling emotion information do indeed exist within the model, forming distinct groups for each emotion, and that their distribution varies with model size and architectural depth. We then validated the functional significance of these emotion neurons by analyzing whether the prediction accuracy for a specific emotion significantly decreases when those neurons are artificially removed. We observed that in some emotions, the accuracy drops sharply upon neuron removal, while in others, the model’s performance largely remains intact or even improves, presumably due to overlapping and complementary mechanisms among neurons. Furthermore, by examining how prediction accuracy changes depending on which layer range and at what proportion the emotion neurons are masked, we revealed that emotion information is processed in a multilayered and complex manner within the model.
pdf
bib
abs
Grammar-Based Code Representation: Is It a Worthy Pursuit for LLMs?
Qingyuan Liang
|
Zhao Zhang
|
Zeyu Sun
|
Zheng Lin
|
Qi Luo
|
Xiao Yueyi
|
Yizhou Chen
|
Yuqun Zhang
|
Haotian Zhang
|
Lu Zhang
|
Chenbin Chenbin
|
Yingfei Xiong
Grammar serves as a cornerstone in programming languages and software engineering, providing frameworks to define the syntactic space and program structure. Existing research demonstrates the effectiveness of grammar-based code representations in small-scale models, showing their ability to reduce syntax errors and enhance performance. However, as language models scale to the billion level or beyond, syntax-level errors become rare, making it unclear whether grammar information still provides performance benefits. To explore this, we develop a series of billion-scale GrammarCoder models, incorporating grammar rules in the code generation process. Experiments on HumanEval (+) and MBPP (+) demonstrate a notable improvement in code generation accuracy. Further analysis shows that grammar-based representations enhance LLMs’ ability to discern subtle code differences, reducing semantic errors caused by minor variations. These findings suggest that grammar-based code representations remain valuable even in billion-scale models, not only by maintaining syntax correctness but also by improving semantic differentiation.
pdf
bib
abs
Investigating Inference-time Scaling for Chain of Multi-modal Thought: A Preliminary Study
Yujie Lin
|
Ante Wang
|
Moye Chen
|
Jingyao Liu
|
Hao Liu
|
Jinsong Su
|
Xinyan Xiao
Recently, inference-time scaling of chain-of-thought (CoT) has been demonstrated as a promising approach for addressing multi-modal reasoning tasks.While existing studies have predominantly centered on text-based thinking, the integration of both visual and textual modalities within the reasoning process remains unexplored.In this study, we pioneer the exploration of inference-time scaling with multi-modal thought, aiming to bridge this gap.To provide a comprehensive analysis, we systematically investigate popular sampling-based and tree search-based inference-time scaling methods on 10 challenging tasks spanning various domains.Besides, we uniformly adopt a consistency-enhanced verifier to ensure effective guidance for both methods across different thought paradigms.Results show that multi-modal thought promotes better performance against conventional text-only thought, and blending the two types of thought fosters more diverse thinking.Despite these advantages, multi-modal thoughts necessitate higher token consumption for processing richer visual inputs, which raises concerns in practical applications.We hope that our findings on the merits and drawbacks of this research line will inspire future works in the field. The code will be released upon acceptance.
pdf
bib
abs
UI-E2I-Synth: Advancing GUI Grounding with Large-Scale Instruction Synthesis
Xinyi Liu
|
Xiaoyi Zhang
|
Ziyun Zhang
|
Yan Lu
Recent advancements in Large Vision-Language Models are accelerating the development of Graphical User Interface (GUI) agents that utilize human-like vision perception capabilities to enhance productivity on digital devices. Compared to approaches predicated on GUI metadata, which are platform-dependent and vulnerable to implementation variations, vision-based approaches offer broader applicability.In this vision-based paradigm, the GUI instruction grounding, which maps user instruction to the location of corresponding element on the given screenshot, remains a critical challenge, particularly due to limited public training dataset and resource-intensive manual instruction data annotation.In this paper, we delve into unexplored challenges in this task including element-to-screen ratio, unbalanced element type, and implicit instruction. To address these challenges, we introduce a large-scale data synthesis pipeline UI-E2I-Synth for generating varying complex instruction datasets using GPT-4o instead of human annotators. Furthermore, we propose a new GUI instruction grounding benchmark UI-I2E-Bench, which is designed to address the limitations of existing benchmarks by incorporating diverse annotation aspects.Our model, trained on the synthesized data, achieves superior performance in GUI instruction grounding, demonstrating the advancements of proposed data synthesis pipeline.The proposed benchmark, accompanied by extensive analyses, provides practical insights for future research in this domain. We will release our dataset and benchmark to facilitate further development of GUI instruction grounding community.
pdf
bib
abs
A Study into Investigating Temporal Robustness of LLMs
Jonas Wallat
|
Abdelrahman Abdallah
|
Adam Jatowt
|
Avishek Anand
Large Language Models (LLMs) encapsulate a surprising amount of factual world knowledge. However, their performance on temporal questions and historical knowledge is limited because they often cannot understand temporal scope and orientation or neglect the temporal aspect altogether.In this study, we aim to measure precisely how robust LLMs are for question answering based on their ability to process temporal information and perform tasks requiring temporal reasoning and temporal factual knowledge. Specifically, we design eight time-sensitiverobustness tests for factual information to check the sensitivity of six popular LLMs in the zero-shot setting.Overall, we find LLMs lacking temporal robustness, especially to temporal reformulations and the use of different granularities of temporal references. We show how a selection of these eight tests can be used automatically to judge a model’s temporal robustness for user questions on the fly. Finally, we apply the findings of this study to improve the temporal QA performance by up to 55%.
pdf
bib
abs
ToolExpNet: Optimizing Multi-Tool Selection in LLMs with Similarity and Dependency-Aware Experience Networks
Zijing Zhang
|
Zhanpeng Chen
|
He Zhu
|
Ziyang Chen
|
Nan Du
|
Xiaolong Li
Tool learning enhances Large Language Models’ (LLMs) dynamic interaction with external tools, improving their ability to solve complex problems. However, current empirical methods, which primarily focus on isolated tools learning, still struggle with accurate multi-tool selection due to issues like confusing similar tools and neglecting dependencies. To address these challenges, we propose the Tool Experience Network (ToolExpNet), which integrates tools and trial-and-error experiences into a network characterized by semantic similarity and dependency relationships. ToolExpNet iteratively conducts simulated experiments using adaptive sampling to explore subtle differences and connections between tools, and summarizes these experiences to provide insightful guidance for LLM tool selection. Our experiments demonstrate that learning the relationships between tools helps achieve more comprehensive tool learning. Evaluations on multiple real-world API datasets show that ToolExpNet effectively addresses common challenges in multi-tool selection, significantly outperforming existing baselines across different foundation LLMs.
pdf
bib
abs
SPILL: Domain-Adaptive Intent Clustering based on Selection and Pooling with Large Language Models
I-Fan Lin
|
Faegheh Hasibi
|
Suzan Verberne
In this paper, we propose Selection and Pooling with Large Language Models (SPILL), an intuitive, domain-adaptive method for intent clustering without fine-tuning. Existing embeddings-based clustering methods rely on a few labeled examples or unsupervised fine-tuning to optimize results for each new dataset, which makes them less generalizable to multiple datasets. Our goal is to make these existing embedders more generalizable to new domain datasets without further fine-tuning. Inspired by our theoretical derivation and simulation results on the effectiveness of sampling and pooling techniques, we view the clustering task as a small-scale selection problem. A good solution to this problem is associated with better clustering performance. Accordingly, we propose a two-stage approach: First, for each utterance (referred to as the seed), we derive its embedding using an existing embedder. Then, we apply a distance metric to select a pool of candidates close to the seed. Because the embedder is not optimized for new datasets, in the second stage, we use an LLM to further select utterances from these candidates that share the same intent as the seed. Finally, we pool these selected candidates with the seed to derive a refined embedding for the seed. We found that our method generally outperforms directly using an embedder, and it achieves comparable results to other state-of-the-art studies, even those that use much larger models and require fine-tuning, showing its strength and efficiency. Our results indicate that our method enables existing embedders to be further improved without additional fine-tuning, making them more adaptable to new domain datasets. Additionally, viewing the clustering task as a small-scale selection problem gives the potential of using LLMs to customize clustering tasks according to the user’s goals.
pdf
bib
abs
How Far are LLMs from Being Our Digital Twins? A Benchmark for Persona-Based Behavior Chain Simulation
Rui Li
|
Heming Xia
|
Xinfeng Yuan
|
Qingxiu Dong
|
Lei Sha
|
Wenjie Li
|
Zhifang Sui
Recently, LLMs have garnered increasing attention across academic disciplines for their potential as human digital twins, virtual proxies designed to replicate individuals and autonomously perform tasks such as decision-making, problem-solving, and reasoning on their behalf.However, current evaluations of LLMs primarily emphasize dialogue simulation while overlooking human behavior simulation, which is crucial for digital twins.To address this gap, we introduce BehaviorChain, the first benchmark for evaluating LLMs’ ability to simulate continuous human behavior.BehaviorChain comprises diverse, high-quality, persona-based behavior chains, totaling 15,846 distinct behaviors across 1,001 unique personas, each with detailed history and profile metadata.For evaluation, we integrate persona metadata into LLMs and employ them to iteratively infer contextually appropriate behaviors within dynamic scenarios provided by BehaviorChain. Comprehensive evaluation results demonstrated that even state-of-the-art models struggle with accurately simulating continuous human behavior.
pdf
bib
abs
GRI-QA: a Comprehensive Benchmark for Table Question Answering over Environmental Data
Michele Luca Contalbo
|
Sara Pederzoli
|
Francesco Del Buono
|
Venturelli Valeria
|
Francesco Guerra
|
Matteo Paganelli
Assessing corporate environmental sustainability with Table Question Answering systems is challenging due to complex tables, specialized terminology, and the variety of questions they must handle. In this paper, we introduce GRI-QA, a test benchmark designed to evaluate Table QA approaches in the environmental domain. Using GRI standards, we extract and annotate tables from non-financial corporate reports, generating question-answer pairs through a hybrid LLM-human approach. The benchmark includes eight datasets, categorized by the types of operations required, including operations on multiple tables from multiple documents. Our evaluation reveals a significant gap between human and model performance, particularly in multi-step reasoning, highlighting the relevance of the benchmark and the need for further research in domain-specific Table QA. Code and benchmark datasets are available at https://github.com/softlab-unimore/gri_qa.
pdf
bib
abs
WebUIBench: A Comprehensive Benchmark for Evaluating Multimodal Large Language Models in WebUI-to-Code
Zhiyu Lin
|
Zhengda Zhou
|
Zhiyuan Zhao
|
Tianrui Wan
|
Yilun Ma
|
Junyu Gao
|
Xuelong Li
With the rapid advancement of Generative AI technology, Multimodal Large Language Models(MLLMs) have the potential to act as AI software engineers capable of executing complex web application development. Considering that the model requires a confluence of multidimensional sub-capabilities to address the challenges of various development phases, constructing a multi-view evaluation framework is crucial for accurately guiding the enhancement of development efficiency. However, existing benchmarks usually fail to provide an assessment of sub-capabilities and focus solely on webpage generation outcomes. In this work, we draw inspiration from the principles of software engineering and further propose WebUIBench, a benchmark systematically designed to evaluate MLLMs in four key areas: WebUI Perception, HTML Programming, WebUI-HTML Understanding, and WebUI-to-Code. WebUIBench comprises 21K high-quality question-answer pairs derived from over 0.7K real-world websites. The extensive evaluation of 29 mainstream MLLMs uncovers the skill characteristics and various weakness that models encountered during the development process.
pdf
bib
abs
Optimizing Multi-Hop Document Retrieval Through Intermediate Representations
Linjiaen Linjiaen
|
Jingyu Liu
|
Yingbo Liu
Retrieval-augmented generation (RAG) encounters challenges when addressing complex queries, particularly multi-hop questions. While several methods tackle multi-hop queries by iteratively generating internal queries and retrieving external documents, these approaches are computationally expensive. In this paper, we identify a three-stage information processing pattern in LLMs during layer-by-layer reasoning, consisting of extraction, processing, and subsequent extraction steps. This observation suggests that the representations in intermediate layers contain richer information compared to those in other layers. Building on this insight, we propose Layer-wise RAG (L-RAG). Unlike prior methods that focus on generating new internal queries, L-RAG leverages intermediate representations from the middle layers, which capture next-hop information, to retrieve external knowledge. L-RAG achieves performance comparable to multi-step approaches while maintaining inference overhead similar to that of standard RAG. Experimental results show that L-RAG outperforms existing RAG methods on open-domain multi-hop question-answering datasets, including MuSiQue, HotpotQA, and 2WikiMultiHopQA. The code is available in https://anonymous.4open.science/r/L-RAG-ADD5/.
pdf
bib
abs
Towards Better Understanding of Program-of-Thought Reasoning in Cross-Lingual and Multilingual Environments
Patomporn Payoungkhamdee
|
Pume Tuchinda
|
Jinheon Baek
|
Samuel Cahyawijaya
|
Can Udomcharoenchaikit
|
Potsawee Manakul
|
Peerat Limkonchotiwat
|
Ekapol Chuangsuwanich
|
Sarana Nutanong
Multi-step reasoning is essential for large language models (LLMs), yet multilingual performance remains challenging. While Chain-of-Thought (CoT) prompting improves reasoning, it struggles with non-English languages due to the entanglement of reasoning and execution. Program-of-Thought (PoT) prompting separates reasoning from execution, offering a promising alternative but shifting the challenge to generating programs from non-English questions. We propose a framework to evaluate PoT by separating multilingual reasoning from code execution to examine (i) the impact of fine-tuning on question-reasoning alignment and (ii) how reasoning quality affects answer correctness. Our findings demonstrate that PoT fine-tuning substantially enhances multilingual reasoning, outperforming CoT fine-tuned models. We further demonstrate a strong correlation between reasoning quality (measured through code quality) and answer accuracy, highlighting its potential as a test-time performance improvement heuristic.
pdf
bib
abs
A Fully Automated Pipeline for Conversational Discourse Annotation: Tree Scheme Generation and Labeling with Large Language Models
Kseniia Petukhova
|
Ekaterina Kochmar
Recent advances in Large Language Models (LLMs) have shown promise in automating discourse annotation for conversations. While manually designing tree annotation schemes significantly improves annotation quality for humans and models, their creation remains time-consuming and requires expert knowledge. We propose a fully automated pipeline that uses LLMs to construct such schemes and perform annotation. We evaluate our approach on speech functions (SFs) and the Switchboard-DAMSL (SWBD-DAMSL) taxonomies. Our experiments compare various design choices, and we show that frequency-guided decision trees, paired with an advanced LLM for annotation, can outperform previously manually designed trees and even match or surpass human annotators while significantly reducing the time required for annotation. We release all code and resultant schemes and annotations to facilitate future research on discourse annotation.
pdf
bib
abs
Can Language Models Serve as Analogy Annotators?
Xiaojing Zhang
|
Bochen Lyu
Conceptual abstraction and analogy-making are crucial for human learning, reasoning, and adapting to unfamiliar domains. Recently, large language models (LLMs) have made the synthesis of analogical data possible, which, however, still heavily relies on extensive human efforts to be annotated. This paper empirically examines the LLMs’ capability to annotate story-level analogical data. Specifically, we propose a novel multi-stage progressive reasoning prompt framework A3E (Automated Analogy Annotation Expert), which is based on the structure mapping theory from cognitive psychology and efficiently annotates candidate story pairs across six fine-grained categories. We use A3E to evaluate how well the state-of-the-art LLMs can serve as analogy annotators. Experimental results demonstrate that our proposed A3E achieves an average performance gain of + 73% across a range of prompting baselines and base LLMs. The code and data is available at https://github.com/zhangxjohn/A3E.
pdf
bib
abs
Reward Generalization in RLHF: A Topological Perspective
Tianyi Alex Qiu
|
Fanzhi Zeng
|
Jiaming Ji
|
Dong Yan
|
Kaile Wang
|
Jiayi Zhou
|
Yang Han
|
Josef Dai
|
Xuehai Pan
|
Yaodong Yang
Existing alignment methods share a common topology of information flow, where reward information is collected from humans, modeled with preference learning, and used to tune language models. However, this shared topology has not been systematically characterized, nor have its alternatives been thoroughly explored, leaving the problems of low data efficiency and unreliable generalization unaddressed. As a solution, we introduce a theory of **reward generalization** in reinforcement learning from human feedback (RLHF), focusing on the **topology of information flow** at both macro and micro levels. At the macro level, we portray the RLHF information flow as an autoencoding process over behavior distributions, formalizing the RLHF objective of distributional consistency between human preference and model behavior. At the micro level, we present *induced Bayesian networks* to model the impact of dataset topologies on reward generalization. Combining analysis on both levels, we propose **reward modeling from tree-structured preference information**. It is shown to reduce reward uncertainty by up to 𝛩(log n/loglog n) times compared to baselines, where n is the dataset size. Validation on three NLP tasks shows that it achieves an average win rate of 65% against baselines, thus improving reward generalization *for free* via topology design, while *reducing* the amount of data requiring annotation.
pdf
bib
abs
Enhanced Data Synthesis for LLM through Reasoning Structures Generated by Hierarchical GFlowNet
Tianpeng Bu
|
Minying Zhang
|
Hongtao Duan
|
Shurui Li
|
Lulu Hu
|
Yu Li
Large language models (LLMs) excel in problem-solving but require training data with diverse reasoning processes. Existing methods mainly optimize instruction-response pairs but lack a systematic design for the underlying reasoning structure. This paper proposes RSS: a Reasoning Structure driven data Synthesis method. We first proactively develop a hierarchical GFlowNet to construct reasoning structures efficiently through a coarse-to-fine directed acyclic graph (DAG) growth process. Then reasoning DAGs are leveraged to actively guide the instruction generation via an iterative suggester-editor workflow and enhance response quality using a structure-aware strategy. Experiments show that LLMs trained on our synthetic datasets achieve 48.50%, 84.00%, 79.90% for AlpacaEval2, GSM8K and HumanEval, outperforming existing data synthesis methods.
pdf
bib
abs
Capturing Nuanced Preferences: Preference-Aligned Distillation for Small Language Models
Yanggan Gu
|
Junzhuo Li
|
Sirui Huang
|
Xin Zou
|
Zhenghua Li
|
Xuming Hu
Aligning small language models (SLMs) with human values typically involves distilling preference knowledge from large language models (LLMs). However, existing distillation methods model preference knowledge in teacher LLMs by comparing pairwise responses, overlooking the extent of difference between responses. This limitation hinders student SLMs from capturing the nuanced preferences for multiple responses. In this paper, we propose a Preference-Aligned Distillation (PAD) framework, which models teacher’s preference knowledge as a probability distribution over all potential preferences, thereby providing more nuanced supervisory signals. Our insight in developing PAD is rooted in the demonstration that language models can serve as reward functions, reflecting their intrinsic preferences. Based on this, PAD comprises three key steps: (1) sampling diverse responses using high-temperature; (2) computing rewards for both teacher and student to construct their intrinsic preference; and (3) training the student’s intrinsic preference distribution to align with the teacher’s. Experiments on four mainstream alignment benchmarks demonstrate that PAD consistently and significantly outperforms existing approaches, achieving over 20% improvement on AlpacaEval 2 and Arena-Hard, indicating superior alignment with human preferences. Notably, on MT-Bench, using the Gemma model family, the student trained by PAD surpasses its teacher, further validating the effectiveness of our PAD.
pdf
bib
abs
Token-level Preference Self-Alignment Optimization for Multi-style Outline Controllable Generation
Zihao Li
|
Xuekong Xu
|
Ziyao Chen
|
Lixin Zou
|
Ethanhjwu Ethanhjwu
|
Qiang Chen
|
Chenliang Li
Multi-style outline controllable generation is crucial for multiple applications, including document semantic structuring and retrieval-augmented generation.The great success of preference alignment approaches encourages their application in controllable generation tasks.However, these attempts encounter several limitations: (1) response pair requirements, (2) substantial computation costs, and (3) insufficient exploitation of fine-grained preference signals.To address these problems, we propose a token-level preference self-alignment optimization, named TKPO, for outline controllable generation. TKPO extends the Bradley-Terry model from pair-wise to list-wise comparison, which is further applied at the token level for fine-grained preference signal utilization. In comparison to the representative methods, e.g., DPO, TKPO does not require response pairs; instead, we propose a controllable attributes-driven method to construct reject samples for self-alignment. Additionally, TKPO optimizes only the base model, thereby avoiding additional memory usage and substantial computational costs.We curate two outline controllable generation datasets with regard to language style and level-of-detail.Extensive experiments demonstrate that TKPO outperforms DPO by up to 19.28% in performance while requiring only 56.25% in training time.We release the code and datasets resources at https://github.com/WHUIR/TKPO.
pdf
bib
abs
HatePRISM: Policies, Platforms, and Research Integration. Advancing NLP for Hate Speech Proactive Mitigation
Naquee Rizwan
|
Seid Muhie Yimam
|
Daryna Dementieva
|
Dr. Florian Skupin
|
Tim Fischer
|
Daniil Moskovskiy
|
Aarushi Ajay Borkar
|
Robert Geislinger
|
Punyajoy Saha
|
Sarthak Roy
|
Martin Semmann
|
Alexander Panchenko
|
Chris Biemann
|
Animesh Mukherjee
Despite regulations imposed by nations and social media platforms, e.g. (Government of India, 2021; European Parliament and Council of the European Union, 2022), inter alia, hateful content persists as a significant challenge. Existing approaches primarily rely on reactive measures such as blocking or suspending offensive messages, with emerging strategies focusing on proactive measurements like detoxification and counterspeech. In our work, which we call HATEPRISM, we conduct a comprehensive examination of hate speech regulations and strategies from three perspectives: country regulations, social platform policies, and NLP research datasets. Our findings reveal significant inconsistencies in hate speech definitions and moderation practices across jurisdictions and platforms, alongside a lack of alignment with research efforts. Based on these insights, we suggest ideas and research direction for further exploration of a unified framework for automated hate speech moderation incorporating diverse strategies.
pdf
bib
abs
Local Look-Ahead Guidance via Verifier-in-the-Loop for Automated Theorem Proving
Sara Rajaee
|
Kumar Pratik
|
Gabriele Cesa
|
Arash Behboodi
The most promising recent methods for AI reasoning require applying variants of reinforcement learning (RL) either on rolled out trajectories from the LLMs, even for the step-wise rewards, or large quantities of human-annotated trajectory data. The reliance on the rolled-out trajectory renders the compute cost and time prohibitively high. In particular, the correctness of a reasoning trajectory can typically only be judged at its completion, leading to sparse rewards in RL or requiring expensive synthetic data generation in expert iteration-like methods. In this work, we focus on the Automatic Theorem Proving (ATP) task and propose a novel verifier-in-the-loop design, which, unlike existing approaches that leverage feedback on the entire reasoning trajectory, employs an automated verifier to give intermediate feedback at each step of the reasoning process. Using Lean as the verifier, we empirically show that the step-by-step local verification produces a global improvement in the model’s reasoning accuracy and efficiency.
pdf
bib
abs
Generalizable Cross-Lingual Cognitive Distortion Detection with Standardized Annotations and Multi-Task Learning
Hongzhi Qi
|
Nan Bai
|
Jianqiang Li
|
Wei Zhai
|
Qing Zhao
|
Qi Gao
|
Bing Xiang Yang
|
Guanghui Fu
Cognitive distortion is a critical issue in psychology, with most existing studies based on Burns’ cognitive distortion theory. However, differences in annotation standards lead to variations in building analysis tools, resulting in inconsistent analyses and limiting the generalizability of findings, especially in large-scale and cross-linguistic contexts. To address this issue, we collected all publicly available datasets (four in total) and conducted a series of experiments to evaluate the generalizability of various cross-linguistic models. The results indicate that models exhibit significant performance differences across datasets, highlighting the generalization problem. To mitigate this issue, we propose two solutions. First, we propose a multi-task learning model based on teacher student architecture solution, which demonstrates improved generalization performance in our experiments. Second, we introduce a new dataset (~5,000 samples) derived from reannotating existing open datasets to ensure standardized alignment. The annotation process we provided is interpretable and grounded in psychological principles. Based on this, we constructed large language models with cognitive reasoning chains, enhancing both generalizability and interpretability. This study identifies the generalization challenge in cognitive distortion research, and our experiments show that the proposed solutions significantly improve model performance. The dataset and code are publicly available at: https://github.com/HongzhiQ/CrossLinCD.
pdf
bib
abs
How Do Multilingual Language Models Remember Facts?
Constanza Fierro
|
Negar Foroutan
|
Desmond Elliott
|
Anders Søgaard
Large Language Models (LLMs) store and retrieve vast amounts of factual knowledge acquired during pre-training. Prior research has localized and identified mechanisms behind knowledge recall; however, it has only focused on English monolingual models. The question of how these mechanisms generalize to non-English languages and multilingual LLMs remains unexplored. In this paper, we address this gap by conducting a comprehensive analysis of three multilingual LLMs. First, we show that previously identified recall mechanisms in English largely apply to multilingual contexts, with nuances based on language and architecture. Next, through patching intermediate representations, we localize the role of language during recall, finding that subject enrichment is language-independent, while object extraction is language-dependent. Additionally, we discover that the last token representation acts as a Function Vector (FV), encoding both the language of the query and the content to be extracted from the subject. Furthermore, in decoder-only LLMs, FVs compose these two pieces of information in two separate stages. These insights reveal unique mechanisms in multilingual LLMs for recalling information, highlighting the need for new methodologies—such as knowledge evaluation, fact editing, and knowledge acquisition—that are specifically tailored for multilingual LLMs.
pdf
bib
abs
SeqPO-SiMT: Sequential Policy Optimization for Simultaneous Machine Translation
Ting Xu
|
Zhichao Huang
|
Jiankai Sun
|
Shanbo Cheng
|
Wai Lam
We present Sequential Policy Optimization for Simultaneous Machine Translation (SeqPO-SiMT), a new policy optimization framework that defines the simultaneous machine translation (SiMT) task as a sequential decision making problem, incorporating a tailored reward to enhance translation quality while reducing latency. In contrast to popular Reinforcement Learning from Human Feedback (RLHF) methods, such as PPO and DPO, which are typically applied in single-step tasks, SeqPO-SiMT effectively tackles the multi-step SiMT task. This intuitive framework allows the SiMT LLMs to simulate and refine the SiMT process using a tailored reward. We conduct experiments on six datasets from diverse domains for En → Zh and Zh → En SiMT tasks, demonstrating that SeqPO-SiMT consistently achieves significantly higher translation quality with lower latency. In particular, SeqPO-SiMT outperforms the supervised fine-tuning (SFT) model by 1.13 points in COMET, while reducing the Average Lagging by 6.17 in the NEWSTEST2021 En → Zh dataset. While SiMT operates with far less context than offline translation, the SiMT results of SeqPO-SiMT on 7B LLM surprisingly rival the offline translation of high-performing LLMs, including Qwen-2.5-7B-Instruct and LLaMA-3-8B-Instruct.
pdf
bib
abs
Do Large Language Models Know Folktales? A Case Study of Yokai in Japanese Folktales
Ayuto Tsutsumi
|
Yuu Jinnai
Although Large Language Models (LLMs) have demonstrated strong language understanding and generation abilities across various languages, their cultural knowledge is often limited to English-speaking communities, which can marginalize the cultures of non-English communities. To address the problem, the evaluation of the cultural awareness of the LLMs and the methods to develop culturally aware LLMs have been investigated. In this study, we focus on evaluating knowledge of folktales, a key medium for conveying and circulating culture. In particular, we focus on Japanese folktales, specifically on knowledge of Yokai. Yokai are supernatural creatures originating from Japanese folktales that continue to be popular motifs in art and entertainment today. Yokai have long served as a medium for cultural expression, making them an ideal subject for assessing the cultural awareness of LLMs. We introduce YokaiEval, a benchmark dataset consisting of 809 multiple-choice questions (each with four options) designed to probe knowledge about yokai. We evaluate the performance of 31 Japanese and multilingual LLMs on this dataset. The results show that models trained with Japanese language resources achieve higher accuracy than English-centric models, with those that underwent continued pretraining in Japanese, particularly those based on Llama-3, performing especially well.
pdf
bib
abs
BOSE: A Systematic Evaluation Method Optimized for Base Models
Hongzhi Luan
|
Changxin Tian
|
Zhaoxin Huan
|
Xiaolu Zhang
|
Kunlong Chen
|
Zhiqiang Zhang
|
Jun Zhou
This paper poses two critical issues in evaluating base models (without post-training): (1) Unstable evaluation during training: in the early stages of pre-training, the models lack the capability to answer questions as required, leading to unstable evaluation results. This instability makes it difficult to provide solid conclusions to guide the training, especially for key experiments such as data ablation and scaling law. (2) Inconsistency between base and instruct models: base models generally exhibit poorer evaluation performance compared to corresponding instruct models. This gap poses a challenge for assessing whether a base model with better evaluation can truly lead to a better instruct model. To address these issues, we propose **B**ase model **O**riented **S**ystematic **E**valuation (**BOSE**), a method specifically designed to optimize the evaluation of base models. Specifically, BOSE introduces two key innovations: In-Context Light-instruction Prompt (**ICLiP**) for open-ended tasks and **Blank-ppl** for multi-choice tasks with candidate options, which transforms the standard perplexity (ppl) metric into a fill-in-the-blank format to mitigate early-stage evaluation fluctuations. Furthermore, we are the first to propose Kendall’s rank correlation to quantitatively measure the evaluation stability and consistency. Experimental results demonstrate that BOSE significantly enhances both the stability of evaluations during pre-training and the consistency between base and instruct models, thereby providing more reliable guidance for the LLMs’ training.
pdf
bib
abs
DPGA-TextSyn: Differentially Private Genetic Algorithm for Synthetic Text Generation
Zhonghao Sun
|
Zhiliang Tian
|
Yiping Song
|
Yuyi Si
|
Juhua Zhang
|
Minlie Huang
|
Kai Lu
|
Zeyu Xiong
|
Xinwang Liu
|
Dongsheng Li
Using large language models (LLMs) has a potential risk of privacy leakage since the data with sensitive information may be used for fine-tuning the LLMs. Differential privacy (DP) provides theoretical guarantees of privacy protection, but its practical application in LLMs still has the problem of privacy-utility trade-off. Researchers synthesized data with strong generation capabilities closed-source LLMs (i.e., GPT-4) under DP to alleviate this problem, but this method is not so flexible in fitting the given privacy distributions without fine-tuning. Besides, such methods can hardly balance the diversity of synthetic data and its relevance to target privacy data without accessing so much private data. To this end, this paper proposes DPGA-TextSyn, combining general LLMs with genetic algorithm (GA) to produce relevant and diverse synthetic text under DP constraints. First, we integrate the privacy gene (i.e., metadata) to generate better initial samples. Then, to achieve survival of the fittest and avoid homogeneity, we use privacy nearest neighbor voting and similarity suppression to select elite samples. In addition, we expand elite samples via genetic strategies such as mutation, crossover, and generation to expand the search scope of GA. Experiments show that this method significantly improves the performance of the model in downstream tasks while ensuring privacy.
pdf
bib
abs
Semantic Aware Linear Transfer by Recycling Pre-trained Language Models for Cross-lingual Transfer
Seungyoon Lee
|
Seongtae Hong
|
Hyeonseok Moon
|
Heuiseok Lim
Large Language Models (LLMs) are increasingly incorporating multilingual capabilities, fueling the demand to transfer them into target language-specific models. However, most approaches, which blend the source model’s embedding by replacing the source vocabulary with the target language-specific vocabulary, may constrain expressive capacity in the target language since the source model is predominantly trained on English data. In this paper, we propose Semantic Aware Linear Transfer (SALT), a novel cross-lingual transfer technique that recycles embeddings from target language Pre-trained Language Models (PLMs) to transmit the deep representational strengths of PLM-derived embedding to LLMs. SALT derives unique regression lines based on the similarity in the overlap of the source and target vocabularies to handle each non-overlapping token’s embedding space. Our extensive experiments show that SALT significantly outperforms other transfer methods, achieving lower loss and faster convergence during language adaptation. Notably, SALT achieves remarkable performance in cross-lingual understanding setups compared to other methods. Furthermore, we highlight the scalable use of PLMs to enhance the functionality of contemporary LLMs by conducting experiments with varying architectures.
pdf
bib
abs
Boost, Disentangle, and Customize: A Robust System2-to-System1 Pipeline for Code Generation
Kounianhua Du
|
Hanjing Wang
|
Jianxing Liu
|
Jizheng Chen
|
Xinyi Dai
|
Yasheng Wang
|
Ruiming Tang
|
Yong Yu
|
Jun Wang
|
Weinan Zhang
To address these limitations, we propose BDC, a novel framework that Boosts reasoning exploration via multi-agent collaboration, Disentangles heterogeneous data into specialized experts, and Customizes solutions through dynamic model composition. BDC integrates a Monte Carlo Tree-of-Agents algorithm, where multiple LLMs mutually verify and refine reasoning paths through reflection-guided pruning, enabling efficient exploration of high-quality solutions. To handle data diversity, we cluster problems by latent semantics, train composable LoRA experts on each cluster, and deploy an input-aware hypernetwork to dynamically merge these experts into tailored solvers. Experiments on APPS and CodeContest benchmarks demonstrate BDC’s superiority: it achieves up to 73.8% accuracy on hard problems, outperforming state-of-the-art methods like LATS and RethinkMCTS by 9–15%. This work lays the groundwork for advancing LLM capabilities in complex reasoning tasks, offering a novel System2-to-System1 solution.
pdf
bib
abs
On the Consistency of Commonsense in Large Language Models
Guozheng Li
|
Peng Wang
|
Wenjun Ke
|
Zijie Xu
|
Jiajun Liu
|
Ziyu Shang
Commonsense, humans’ implicit understanding of everyday situations, is crucial for large language models (LLMs). Existing commonsense evaluations for LLMs primarily focus on downstream knowledge tasks, failing to probe whether LLMs truly understand and utilize knowledge or merely memorize it. They also rely heavily on human annotation and lack automated large-scale data generation. To address this, we propose to automatically construct a large benchmark named CoCo (Consistency of Commonsense) comprising 39K samples derived from commonsense knowledge graphs (CSKGs), paired with symbolic questions and ground-truth answers, which systematically assesses LLMs’ knowledge memorization, comprehension, and application and examines the consistency between these tasks. To enhance our evaluation, we also propose novel metrics and prompting strategies. Experimental results on multiple LLMs reveal that CoCo presents significant challenges, and our detailed analysis provides deeper insights into the strengths and limitations of LLMs’ commonsense abilities.
pdf
bib
abs
Statement-Tuning Enables Efficient Cross-lingual Generalization in Encoder-only Models
Ahmed Elshabrawy
|
Thanh-Nhi Nguyen
|
Yeeun Kang
|
Lihan Feng
|
Annant Jain
|
Faadil Abdullah Shaikh
|
Jonibek Mansurov
|
Mohamed Fazli Mohamed Imam
|
Jesus-German Ortiz-Barajas
|
Rendi Chevi
|
Alham Fikri Aji
Large Language Models (LLMs) excel in zero-shot and few-shot tasks, but achieving similar performance with encoder-only models like BERT and RoBERTa has been challenging due to their architecture. However, encoders offer advantages such as lower computational and memory costs. Recent work adapts them for zero-shot generalization using Statement Tuning, which reformulates tasks into finite templates. We extend this approach to multilingual NLP, exploring whether encoders can achieve zero-shot cross-lingual generalization and serve as efficient alternatives to memory-intensive LLMs for low-resource languages. Our results show that state-of-the-art encoder models generalize well across languages, rivaling multilingual LLMs while being more efficient. We also analyze multilingual Statement Tuning dataset design, efficiency gains, and language-specific generalization, contributing to more inclusive and resource-efficient NLP models. We release our code and models.
pdf
bib
abs
Evaluating Large Language Models for Confidence-based Check Set Selection
Jane Arleth Dela Cruz
|
Iris Hendrickx
|
Martha Larson
Large Language Models (LLMs) have shown promise in automating high-labor data tasks, but the adoption of LLMs in high-stake scenarios faces two key challenges: their tendency to answer despite uncertainty and their difficulty handling long input contexts robustly.We investigate commonly used off-the-shelf LLMs’ ability to identify low-confidence outputs for human review through “check set selection”–a process where LLMs prioritize information needing human judgment.Using a case study on social media monitoring for disaster risk management,we define the “check set” as a list of tweets escalated to the disaster manager when the LLM has the least confidence, enabling human oversight within budgeted effort.We test two strategies for LLM check set selection: *individual confidence elicitation* – LLMs assesses confidence for each tweet classification individually, requiring more prompts with shorter contexts, and *direct set confidence elicitation* – LLM evaluates confidence for a list of tweet classifications at once, using less prompts but longer contexts.Our results reveal that set selection via individual probabilities is more reliable but that direct set confidence merits further investigation.Direct set selection challenges include inconsistent outputs, incorrect check set size, and low inter-annotator agreement. Despite these challenges, our approach improves collaborative disaster tweet classification by outperforming random-sample check set selection, demonstrating the potential of human-LLM collaboration.
pdf
bib
abs
Training Multi-Modal LLMs through Dialogue Planning for HRI
Claudiu Daniel Hromei
|
Federico Borazio
|
Andrea Sensi
|
Elisa Passone
|
Danilo Croce
|
Roberto Basili
Grounded natural language understanding in Human-Robot Interaction (HRI) requires integrating linguistic, visual, and world knowledge to ensure effective task execution. We propose an approach that enhances Multi-Modal Large Language Models (MLLMs) with a novel explicit dialogue planning phase, allowing robotic agents to systematically refine their understanding of ambiguous commands through structured clarification steps. This reduces hallucinations and improves task feasibility.To evaluate this approach, we introduce a novel dataset of over 1,100 annotated dialogues in English and Italian, designed for fine-tuning and assessing Multi-Modal models in HRI scenarios. Experimental results show that dialogue planning improves response accuracy and quality, and contributes to cross-lingual generalisation, enabling models trained in one language to transfer effectively to another. To the best of our knowledge, this is the first application of structured, goal-driven, and explicit dialogue planning in Multi-Modal LLMs for grounded interaction.
pdf
bib
abs
MVL-SIB: A Massively Multilingual Vision-Language Benchmark for Cross-Modal Topical Matching
Fabian David Schmidt
|
Florian Schneider
|
Chris Biemann
|
Goran Glavaš
Existing multilingual vision-language (VL) benchmarks often only cover a handful of languages. Consequently, evaluations of large vision-language models (LVLMs) predominantly target high-resource languages, underscoring the need for evaluation data for low-resource languages. To address this limitation, we introduce MVL-SIB, a massively multilingual vision-language benchmark that evaluates both cross-modal and text-only topical matching across 205 languages – over 100 more than the most multilingual existing VL benchmarks encompass. We then benchmark a range of of open-weight LVLMs together with GPT-4o(-mini) on MVL-SIB. Our results reveal that LVLMs struggle in cross-modal topic matching in lower-resource languages, performing no better than chance on languages like N’Koo. Our analysis further reveals that VL support in LVLMs declines disproportionately relative to textual support for lower-resource languages, as evidenced by comparison of cross-modal and text-only topical matching performance. We further observe that open-weight LVLMs do not benefit from representing a topic with more than one image, suggesting that these models are not yet fully effective at handling multi-image tasks. By correlating performance on MVL-SIB with other multilingual VL benchmarks, we highlight that MVL-SIB serves as a comprehensive probe of multilingual VL understanding in LVLMs.
pdf
bib
abs
The Rise of Darkness: Safety-Utility Trade-Offs in Role-Playing Dialogue Agents
Yihong Tang
|
Kehai Chen
|
Xuefeng Bai
|
Zheng-Yu Niu
|
Bo Wang
|
Jie Liu
|
Min Zhang
Large Language Models (LLMs) have made remarkable advances in role-playing dialogue agents, demonstrating their utility in character simulations. However, it remains challenging for these agents to balance character portrayal utility with content safety because this essential character simulation often comes with the risk of generating unsafe content. To address this issue, we first conduct a systematic exploration of the safety-utility trade-off across multiple LLMs. Our analysis reveals that risk scenarios created by villain characters and user queries (referred to as risk coupling) contribute to this trade-off. Building on this, we propose a novel Adaptive Dynamic Multi-Preference (ADMP) method, which dynamically adjusts safety-utility preferences based on the degree of risk coupling and guides the model to generate responses biased toward utility or safety. We further introduce Coupling Margin Sampling (CMS) into coupling detection to enhance the model’s ability to handle high-risk scenarios. Experimental results demonstrate that our approach improves safety metrics while maintaining utility.
pdf
bib
abs
SynGraph: A Dynamic Graph-LLM Synthesis Framework for Sparse Streaming User Sentiment Modeling
Xin Zhang
|
Qiyu Wei
|
Yingjie Zhu
|
Linhai Zhang
|
Deyu Zhou
|
Sophia Ananiadou
User reviews on e-commerce platforms exhibit dynamic sentiment patterns driven by temporal and contextual factors. Traditional sentiment analysis methods focus on static reviews, failing to capture the evolving temporal relationship between user sentiment rating and textual content. Sentiment analysis on streaming reviews addresses this limitation by modeling and predicting the temporal evolution of user sentiments. However, it suffers from data sparsity, manifesting in temporal, spatial, and combined forms. In this paper, we introduce SynGraph, a novel framework designed to address data sparsity in sentiment analysis on streaming reviews. SynGraph alleviates data sparsity by categorizing users into mid-tail, long-tail, and extreme scenarios and incorporating LLM-augmented enhancements within a dynamic graph-based structure. Experiments on real-world datasets demonstrate its effectiveness in addressing sparsity and improving sentiment modeling in streaming reviews.
pdf
bib
abs
Enhancing Tool Learning in Large Language Models with Hierarchical Error Checklists
Yue Cui
|
Liuyi Yao
|
Shuchang Tao
|
Weijie Shi
|
Yaliang Li
|
Bolin Ding
|
Xiaofang Zhou
Large language models (LLMs) have significantly advanced natural language processing, particularly through the integration of external tools and APIs. However, their effectiveness is frequently hampered by parameter mis-filling during tool calling. In this paper, we propose the Hierarchical Tool Error Checklist (HiTEC) framework to systematically diagnose and mitigate tool-calling errors without relying on extensive real-world interactions. HiTEC introduces a two-tiered approach: a global error checklist that identifies common, cross-tool issues, and a local error checklist that targets tool-specific and contextual failures. Building on this structure, we propose two deployments: HiTEC-In Context Learning (HiTEC-ICL) and HiTEC-Kahneman-Tversky Optimization (HiTEC-KTO). HiTEC-ICL embeds the global checklist in the initial prompts and leverages a two-round conversational interaction to dynamically refine parameter handling, while HiTEC-KTO generates high-quality negative examples to drive fine-tuning via preference-based optimization. Extensive experiments across five public datasets demonstrate that our framework significantly improves parameter-filling accuracy and tool-calling success rates compared to baseline methods.
pdf
bib
abs
A Large and Balanced Corpus for Fine-grained Arabic Readability Assessment
Khalid Elmadani
|
Nizar Habash
|
Hanada Taha
This paper introduces the Balanced Arabic Readability Evaluation Corpus (BAREC), a large-scale, fine-grained dataset for Arabic readability assessment. BAREC consists of 69,441 sentences spanning 1+ million words, carefully curated to cover 19 readability levels, from kindergarten to postgraduate comprehension. The corpus balances genre diversity, topical coverage, and target audiences, offering a comprehensive resource for evaluating Arabic text complexity. The corpus was fully manually annotated by a large team of annotators. The average pairwise inter-annotator agreement, measured by Quadratic Weighted Kappa, is 81.8%, reflecting a high level of substantial agreement.Beyond presenting the corpus, we benchmark automatic readability assessment across different granularity levels, comparing a range of techniques. Our results highlight the challenges and opportunities in Arabic readability modeling, demonstrating competitive performance across various methods.To support research and education, we make BAREC openly available, along with detailed annotation guidelines and benchmark results: http://barec.camel-lab.com.
pdf
bib
abs
Can Medical Vision-Language Pre-training Succeed with Purely Synthetic Data?
Che Liu
|
Zhongwei Wan
|
Haozhe Wang
|
Yinda Chen
|
Talha Qaiser
|
Chen Jin
|
Nikolay Burlutskiy
|
Fariba Yousefi
|
Rossella Arcucci
Medical Vision-Language Pre-training (MedVLP) has made significant progress in enabling zero-shot tasks for medical image understanding. However, training MedVLP models typically requires large-scale datasets with paired, high-quality image-text data, which are scarce in the medical domain. Recent advancements in Large Language Models (LLMs) and diffusion models have made it possible to generate large-scale synthetic image-text pairs. This raises the question: Can MedVLP succeed using purely synthetic data? To address this, we use off-the-shelf generative models to create synthetic radiology reports and paired Chest X-ray (CXR) images, and propose an automated pipeline to build a diverse, high-quality synthetic dataset, enabling a rigorous study that isolates model and training settings, focusing entirely from the data perspective.Our results show that MedVLP models trained exclusively on synthetic data outperform those trained on real data by 3.8% in averaged AUC on zero-shot classification. Moreover, using a combination of synthetic and real data leads to a further improvement of 9.07%. Additionally, MedVLP models trained on synthetic or mixed data consistently outperform those trained on real data in zero-shot grounding, as well as in fine-tuned classification and segmentation tasks.Our analysis suggests MedVLP trained on well-designed synthetic data can outperform models trained on real datasets, which may be limited by low-quality samples and long-tailed distributions[^1].[^1]: All data and code will be released upon acceptance.
pdf
bib
abs
See the World, Discover Knowledge: A Chinese Factuality Evaluation for Large Vision Language Models
Jihao Gu
|
Yingyao Wang
|
Pi Bu
|
Chen Wang
|
Ziming Wang
|
Tengtao Song
|
Donglai Wei
|
Jiale Yuan
|
Yingxiu Zhao
|
Yancheng He
|
Shilong Li
|
Jiaheng Liu
|
Meng Cao
|
Jun Song
|
Yingshui Tan
|
Xiang Li
|
Wenbo Su
|
Xiaoyong Zhu
|
Bo Zheng
The evaluation of factual accuracy in large vision language models (LVLMs) has lagged behind their rapid development, making it challenging to fully reflect these models’ knowledge capacity and reliability. In this paper, we introduce the first factuality-based visual question-answering benchmark in Chinese, named ChineseSimpleVQA, aimed at assessing the visual factuality of LVLMs across 8 major topics and 56 subtopics. The key features of this benchmark include a focus on the Chinese language, diverse knowledge types, a multi-hop question construction, high-quality data, static consistency, and easy-to-evaluate through short answers. Moreover, we contribute a rigorous data construction pipeline and decouple the visual factuality into two parts: seeing the world (i.e., object recognition) and discovering knowledge. This decoupling allows us to analyze the capability boundaries and execution mechanisms of LVLMs. Subsequently, we evaluate 34 advanced open-source and closed-source models, revealing critical performance gaps within this field.
pdf
bib
abs
Argus: Benchmarking and Enhancing Vision-Language Models for 3D Radiology Report Generation
Che Liu
|
Zhongwei Wan
|
Yuqi Wang
|
Hui Shen
|
Haozhe Wang
|
Kangyu Zheng
|
Mi Zhang
|
Rossella Arcucci
Automatic radiology report generation holds significant potential to streamline the labor-intensive process of report writing by radiologists, particularly for 3D radiographs such as CT scans. While CT scans are critical for clinical diagnostics, they remain less explored compared to 2D radiographs. To date, there has been no comprehensive benchmark for 3D radiograph report generation (3DRRG), nor sufficient investigation into the optimal training strategies for Vision Language Models (VLMs) in this context, particularly with respect to vision encoder choices, visual token compression, and model scaling.In this work, we make two three contributions. We curate CT-3DRRG, the largest publicly available 3D CT-report dataset, establishing a robust and diverse benchmark for evaluating VLM performance on 3DRRG. Furthermore, we propose a comprehensive training recipe for building high-performing VLMs for 3DRRG, exploring key factors such as vision encoder pretraining strategies, visual token compression, and the impact of data & model scale. Guided by these findings, we introduce Argus, a state-of-the-art family of VLMs that achieve superior performance across different model sizes and input 3D medical image resolutions, efficiently processing high-resolution 3D images up to 512 × 512 × 256.
pdf
bib
abs
Resource-Friendly Dynamic Enhancement Chain for Multi-Hop Question Answering
Binquan Ji
|
Haibo Luo
|
YifeiLu YifeiLu
|
Lei Hei
|
Jiaqi Wang
|
Tingjing Liao
|
Wang Lingyu
|
Shichao Wang
|
Feiliang Ren
Knowledge-intensive multi-hop question answering (QA) tasks, which require integrating evidence from multiple sources to address complex queries, often necessitate multiple rounds of retrieval and iterative generation by large language models (LLMs). However, incorporating many documents and extended contexts poses challenges—such as hallucinations and semantic drift—for lightweight LLMs with fewer parameters. This work proposes a novel framework called DEC (Dynamic Enhancement Chain). DEC first decomposes complex questions into logically coherent subquestions to form a hallucination-free reasoning chain. It then iteratively refines these subquestions through context-aware rewriting to generate effective query formulations. For retrieval, we introduce a lightweight discriminative keyword extraction module that leverages extracted keywords to achieve targeted, precise document recall with relatively low computational overhead. Extensive experiments on three multi-hop QA datasets demonstrate that DEC performs on par with or surpasses state-of-the-art benchmarks while significantly reducing token consumption. Notably, our approach attains state-of-the-art results on models with 8B parameters, showcasing its effectiveness in various scenarios, particularly in resource-constrained environments.
pdf
bib
abs
Evaluating LLMs’ Assessment of Mixed-Context Hallucination Through the Lens of Summarization
Siya Qi
|
Rui Cao
|
Yulan He
|
Zheng Yuan
With the rapid development of large language models (LLMs), LLM-as-a-judge has emerged as a widely adopted approach for text quality evaluation, including hallucination evaluation. While previous studies have focused exclusively on single-context evaluation (e.g., discourse faithfulness or world factuality), real-world hallucinations typically involve mixed contexts, which remains inadequately evaluated. In this study, we use summarization as a representative task to comprehensively evaluate LLMs’ capability in detecting mixed-context hallucinations, specifically distinguishing between factual and non-factual hallucinations. Through extensive experiments across direct generation and retrieval-based models of varying scales, our main observations are: (1) LLMs’ intrinsic knowledge introduces inherent biases in hallucination evaluation; (2) These biases particularly impact the detection of factual hallucinations, yielding a significant performance bottleneck; and (3) the fundamental challenge lies in effective knowledge utilization, balancing between LLMs’ intrinsic knowledge and external context for accurate mixed-context hallucination evaluation.
pdf
bib
abs
TUBA: Cross-Lingual Transferability of Backdoor Attacks in LLMs with Instruction Tuning
Xuanli He
|
Jun Wang
|
Qiongkai Xu
|
Pasquale Minervini
|
Pontus Stenetorp
|
Benjamin I. P. Rubinstein
|
Trevor Cohn
The implications of backdoor attacks on English-centric large language models (LLMs) have been widely examined — such attacks can be achieved by embedding malicious behaviors during training and activated under specific conditions that trigger malicious outputs. Despite the increasing support for multilingual capabilities in open-source and proprietary LLMs, the impact of backdoor attacks on these systems remains largely under-explored. Our research focuses on cross-lingual backdoor attacks against multilingual LLMs, particularly investigating how poisoning the instructiontuning data for one or two languages can affect the outputs for languages whose instructiontuning data were not poisoned. Despite its simplicity, our empirical analysis reveals that our method exhibits remarkable efficacy in models like BLOOM and GPT-4o, with high attack success rates, surpassing 90% in more than 7 out of 12 languages across various scenarios. Our findings also indicate that more powerful models show increased susceptibility to transferable cross-lingual backdoor attacks, which also applies to LLMs predominantly pre-trained on English/Chinese data, such as Llama2, Llama3, Qwen2.5, and Gemma. Moreover, our experiments demonstrate 1) High Transferability: the backdoor mechanism operates successfully in cross lingual response scenarios across 26 languages, achieving an average attack success rate of 99%, and 2) Robustness: the proposed attack remains effective even after defenses are applied. These findings expose critical security vulnerabilities in multilingual LLMs and highlight the urgent need for more robust, targeted defense strategies to address the unique challenges posed by cross-lingual backdoor transfer.
pdf
bib
abs
Eliciting Textual Descriptions from Representations of Continuous Prompts
Daniela Gottesman
|
Mor Geva
|
Dana Ramati
Continuous prompts, or “soft prompts”, are a widely-adopted parameter-efficient tuning strategy for large language models, but are often less favorable due to their opaque nature. Prior attempts to interpret continuous prompts relied on projecting individual prompt tokens onto the vocabulary space. However, this approach is problematic as performant prompts can yield arbitrary or contradictory text, and it individually interprets each prompt token. In this work, we propose a new approach to interpret continuous prompts that elicits textual descriptions from their representations during model inference. Using a Patchscopes variant (Ghandeharioun et al., 2024) called InSPEcT over various tasks, we show our method often yields accurate task descriptions which become more faithful as task performance increases. Moreover, an elaborated version of InSPEcT reveals biased features in continuous prompts, whose presence correlates with biased model predictions. Providing an effective interpretability solution, InSPEcT can be leveraged to debug unwanted properties in continuous prompts and inform developers on ways to mitigate them.
pdf
bib
abs
Mitigating Hallucination in Multimodal Large Language Model via Hallucination-targeted Direct Preference Optimization
Yuhan Fu
|
Ruobing Xie
|
Xingwu Sun
|
Zhanhui Kang
|
Xirong Li
Multimodal Large Language Models (MLLMs) are known to hallucinate, which limits their practical applications. Recent works have attempted to apply Direct Preference Optimization (DPO) to enhance the performance of MLLMs, but have shown inconsistent improvements in mitigating hallucinations. To address this issue more effectively, we introduce Hallucination-targeted Direct Preference Optimization (HDPO) to reduce hallucinations in MLLMs. Unlike previous approaches, our method tackles hallucinations from their diverse forms and causes. Specifically, we develop three types of preference pair data targeting the following causes of MLLM hallucinations: (1) insufficient visual capabilities, (2) long context generation, and (3) multimodal conflicts. Experimental results demonstrate that our method achieves superior performance across multiple hallucination evaluation datasets, surpassing most state-of-the-art (SOTA) methods and highlighting the potential of our approach. Ablation studies and in-depth analyses further confirm the effectiveness of our method and suggest the potential for further improvements through scaling up.
pdf
bib
abs
Review-Instruct: A Review-Driven Multi-Turn Conversations Generation Method for Large Language Models
Jiangxu Wu
|
Cong Wang
|
Tianhuang Su
|
Lin Haozhi
|
JunYang JunYang
|
Zhangchao Zhangchao
|
Binqiang Pan
|
SongpanYang SongpanYang
|
Mingpeng Mingpeng
|
Kai Shi
|
Zixian Li
The effectiveness of large language models (LLMs) in conversational AI is hindered by their reliance on single-turn supervised fine-tuning (SFT) data, which limits contextual coherence in multi-turn dialogues. Existing methods for generating multi-turn dialogue data struggle to ensure both diversity and quality in instructions. To address this, we propose Review-Instruct, a novel framework that synthesizes multi-turn conversations through an iterative “Ask-Respond-Review” process involving three agent roles: a Candidate, multiple Reviewers, and a Chairman. The framework iteratively refines instructions by incorporating Reviewer feedback, enhancing dialogue diversity and difficulty. We construct a multi-turn dataset using the Alpaca dataset and fine-tune the LLaMA2-13B model. Evaluations on MT-Bench, MMLU-Pro, and Auto-Arena demonstrate significant improvements, achieving absolute gains of 2.9% on MMLU-Pro and 2% on MT-Bench compared to prior state-of-the-art models based on LLaMA2-13B. Ablation studies confirm the critical role of the Review stage and the use of multiple Reviewers in boosting instruction diversity and difficulty. Our work highlights the potential of review-driven, multi-agent frameworks for generating high-quality conversational data at scale.
pdf
bib
abs
Why Uncertainty Estimation Methods Fall Short in RAG: An Axiomatic Analysis
Heydar Soudani
|
Evangelos Kanoulas
|
Faegheh Hasibi
Large Language Models (LLMs) are valued for their strong performance across various tasks, but they also produce inaccurate or misleading outputs. Uncertainty Estimation (UE) quantifies the model’s confidence and helps users assess response reliability. However, existing UE methods have not been thoroughly examined in scenarios like Retrieval-Augmented Generation (RAG), where the input prompt includes non-parametric knowledge. This paper shows that current UE methods cannot reliably estimate the correctness of LLM responses in the RAG setting. We propose an axiomatic framework to identify deficiencies in existing UE methods. Our framework introduces five constraints that an effective UE method should meet after incorporating retrieved documents into the LLM’s prompt. Experimental results reveal that no existing UE method fully satisfies all the axioms, explaining their suboptimal performance in RAG. We further introduce a simple yet effective calibration function based on our framework, which not only satisfies more axioms than baseline methods but also improves the correlation between uncertainty estimates and correctness.
pdf
bib
abs
EuroVerdict: A Multilingual Dataset for Verdict Generation Against Misinformation
Daniel Russo
|
Fariba Sadeghi
|
Stefano Menini
|
Marco Guerini
Misinformation is a global issue that shapes public discourse, influencing opinions and decision-making across various domains. While automated fact-checking (AFC) has become essential in combating misinformation, most work in multilingual settings has focused on claim verification rather than generating explanatory verdicts (i.e. short texts discussing the veracity of the claim), leaving a gap in AFC resources beyond English.To this end, we introduce EuroVerdict, a multilingual dataset designed for verdict generation, covering eight European languages. Developed in collaboration with professional fact-checkers, the dataset comprises claims, manually written verdicts, and supporting evidence, including fact-checking articles and additional secondary sources. We evaluate EuroVerdict with Llama-3.1-8B-Instruct on verdict generation under different settings, varying the prompt language, input article language, and training approach. Our results show that fine-tuning consistently improves performance, with models fine-tuned on original-language articles achieving the highest scores in both automatic and human evaluations. Using articles in a different language from the claim slightly lowers performance; however, pairing them with language-specific prompts improves results. Zero-shot and Chain-of-Thought setups perform worse, reinforcing the benefits of fine-tuning for multilingual verdict generation.
pdf
bib
abs
LoFTI: Localization and Factuality Transfer to Indian Locales
Sona Elza Simon
|
Soumen Kumar Mondal
|
Abhishek Singhania
|
Sayambhu Sen
|
Preethi Jyothi
Large language models (LLMs) encode vast amounts of world knowledge acquired via training on large web-scale datasets crawled from the internet. However, the datasets used to train the LLMs typically exhibit a geographical bias towards English-speaking Western countries. This results in LLMs producing biased or hallucinated responses to queries that require answers localized to other geographical regions. In this work, we introduce a new benchmark named LoFTI (Localization and Factuality Transfer to Indian Locales) that can be used to evaluate an LLM’s contextual localization and factual text transfer capabilities. LoFTI consists of factual statements about entities in source and target locations; the source locations are spread across the globe and the target locations are all within India with varying degrees of hyperlocality (country, states, cities). The entities span a wide variety of categories. We use LoFTI to evaluate Mixtral, Llama3.3-70B, GPT-4 and two other Mixtral-based approaches well-suited to the task of localized factual transfer. We demonstrate that LoFTI is a high-quality evaluation benchmark and all the models, including GPT-4, produce skewed results across varying levels of hyperlocality.
pdf
bib
abs
Hierarchical Retrieval with Evidence Curation for Open-Domain Financial Question Answering on Standardized Documents
Jaeyoung Choe
|
Jihoon Kim
|
Woohwan Jung
Retrieval-augmented generation (RAG) based large language models (LLMs) are widely used in finance for their excellent performance on knowledge-intensive tasks. However, standardized documents (e.g., SEC filing) share similar formats such as repetitive boilerplate texts,and similar table structures. This similarity forces traditional RAG methods to misidentify near-duplicate text, leading to duplicate retrieval that undermines accuracy and completeness. To address these issues, we propose the Hierarchical Retrieval with Evidence Curation (HiREC) framework. Our approach first performs hierarchical retrieval to reduce confusion among similar texts. It first retrieve related documents and then selects the most relevant passages from the documents. The evidence curation process removes irrelevant passages. When necessary, it automatically generates complementary queries to collect missing information. To evaluate our approach, we construct and release a Large-scale Open-domain Financial (LOFin) question answering benchmark that includes 145,897 SEC documents and 1,595 question-answer pairs. Our code and data are available at https://github.com/deep-over/LOFin-bench-HiREC.
pdf
bib
abs
GNN-RAG: Graph Neural Retrieval for Efficient Large Language Model Reasoning on Knowledge Graphs
Costas Mavromatis
|
George Karypis
Retrieval-augmented generation (RAG) in Knowledge Graph Question Answering (KGQA) enhances the context of Large Language Models (LLMs) by incorporating information retrieved from the Knowledge Graph (KG). Most recent approaches rely on costly LLM calls to generate executable relation paths or traverse the KG, which is inefficient in complex KGQA tasks, such as those involving multi-hop or multi-entity questions. We introduce the GNN-RAG framework, which utilizes lightweight Graph Neural Networks (GNNs) for effective and efficient graph retrieval. The GNN learns to assign importance weights to nodes based on their relevance to the question, as well as the relevance of their neighboring nodes. This enables the framework to effectively handle context from deeper parts of the graph, improving retrieval performance. GNN-RAG retrieves the shortest paths connecting question entities to GNN answer candidates, providing this information as context for the LLM. Experimental results show that GNN-RAG achieves effective retrieval on two widely used KGQA benchmarks (WebQSP and CWQ), outperforming or matching GPT-4 performance with a 7B tuned LLM. Additionally, GNN-RAG excels on multi-hop and multi-entity questions outperforming LLM-based retrieval approaches by 8.9–15.5% points at answer F1. Furthermore, it surpasses long-context inference while using
9× fewer KG tokens. The code is provided in
https://github.com/cmavro/GNN-RAG.
pdf
bib
abs
ASTRID - An Automated and Scalable TRIaD for the Evaluation of RAG-based Clinical Question Answering Systems
Yajie Vera He
|
Mohita Chowdhury
|
Jared Joselowitz
|
Aisling Higham
|
Ernest Lim
Large Language Models (LLMs) have shown impressive potential in clinical question answering (QA), with Retrieval Augmented Generation (RAG) emerging as a leading approach for ensuring the factual accuracy of model responses. However, current automated RAG metrics perform poorly in clinical and conversational use cases. Using clinical human evaluations of responses is expensive, unscalable, and not conducive to the continuous iterative development of RAG systems. To address these challenges, we introduce ASTRID - an Automated and Scalable TRIaD for evaluating clinical QA systems leveraging RAG - consisting of three metrics: Context Relevance (CR), Refusal Accuracy (RA), and Conversational Faithfulness (CF). Our novel evaluation metric, CF, is designed to better capture the faithfulness of a model’s response to the knowledge base without penalising conversational elements. Additionally, our metric RA captures the refusal to address questions outside of the system’s scope of practice. To validate our triad, we curate a dataset of over 200 real-world patient questions posed to an LLM-based QA agent during surgical follow-up for cataract surgery - the highest volume operation in the world - augmented with clinician-selected questions for emergency, and clinical and non-clinical out-of-domain scenarios. We demonstrate that CF predicts human ratings of faithfulness more accurately than existing definitions in conversational settings. Furthermore, using eight different LLMs, we demonstrate that the three metrics can closely agree with human evaluations, highlighting the potential of these metrics for use in LLM-driven automated evaluation pipelines. Finally, we show that evaluation using our triad of CF, RA, and CR exhibits alignment with clinician assessment for inappropriate, harmful, or unhelpful responses. We also publish the prompts and datasets for these experiments, providing valuable resources for further research and development.
pdf
bib
abs
On Entity Identification in Language Models
Masaki Sakata
|
Benjamin Heinzerling
|
Sho Yokoi
|
Takumi Ito
|
Kentaro Inui
We analyze the extent to which internal representations of language models (LMs) identify and distinguish mentions of named entities, focusing on the many-to-many correspondence between entities and their mentions.We first formulate two problems of entity mentions — ambiguity and variability — and propose a framework analogous to clustering quality metrics. Specifically, we quantify through cluster analysis of LM internal representations the extent to which mentions of the same entity cluster together and mentions of different entities remain separated.Our experiments examine five Transformer-based autoregressive models, showing that they effectively identify and distinguish entities with metrics analogous to precision and recall ranging from 0.66 to 0.9.Further analysis reveals that entity-related information is compactly represented in a low-dimensional linear subspace at early LM layers.Additionally, we clarify how the characteristics of entity representations influence word prediction performance.These findings are interpreted through the lens of isomorphism between LM representations and entity-centric knowledge structures in the real world, providing insights into how LMs internally organize and use entity information.
pdf
bib
abs
RAPID: Efficient Retrieval-Augmented Long Text Generation with Writing Planning and Information Discovery
Hongchao Gu
|
Dexun Li
|
Kuicai Dong
|
Hao Zhang
|
Hang Lv
|
Hao Wang
|
Defu Lian
|
Yong Liu
|
Enhong Chen
Generating knowledge-intensive and comprehensive long texts, such as encyclopedia articles, remains significant challenges for Large Language Models. It requires not only the precise integration of facts but also the maintenance of thematic coherence throughout the article. Existing methods, such as multi-agent discussion, often struggle with issues like hallucinations, topic incoherence, and significant latency. To address these challenges, we propose RAPID, an efficient **R**etrieval-**A**ugmented long text generation framework with writing **P**lanning and **I**nformation **D**iscovery. RAPID consists of three main modules: (1) Retrieval-augmented preliminary outline generation to reduce hallucinations, (2) Attribute-constrained search for efficient information discovery, (3) Plan-guided article generation for enhanced coherence. Extensive experiments on our newly compiled benchmark dataset, FreshWiki-2024, demonstrate that RAPID significantly outperforms state-of-the-art methods across a wide range of evaluation metrics (long-text generation, outline quality, latency, etc). Our work provides a robust and efficient solution to the challenges of automated long-text generation.
pdf
bib
abs
CHARPEVAL: Benchmarking Large Language Models’ Contextual Reasoning in Knowledge-Grounded Dialogue
Abbas Ghaddar
|
David Alfonso-Hermelo
|
Philippe Langlais
|
Boxing Chen
|
Prasanna Parthasarathi
This paper presents CHARPEVAL, a challenging benchmark specifically designed to evaluate the ability of Large Language Models (LLMs) to perform contextualized reasoning in knowledge-grounded dialogue scenarios. The task involves selecting the correct response from 6 options, including 5 manually crafted distractors, given the conversation history. Extensive benchmarking experiments with a diverse set of state-of-the-art open-weight LLMs show poor performance on CHARPEVAL due to their inability to effectively reason over discontinuous chunks of text across the input. Our analysis reveals systematic error patterns across models with different properties, highlighting the need to improve LLMs beyond simply scaling-up data and compute. CHARPEVAL is publicly available at https://huggingface.co/datasets/huawei-noah/CHARP.
pdf
bib
abs
Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation
Mohammad Mahdi Abootorabi
|
Amirhosein Zobeiri
|
Mahdi Dehghani
|
Mohammadali Mohammadkhani
|
Bardia Mohammadi
|
Omid Ghahroodi
|
Mahdieh Soleymani Baghshah
|
Ehsaneddin Asgari
Large Language Models (LLMs) suffer from hallucinations and outdated knowledge due to their reliance on static training data. Retrieval-Augmented Generation (RAG) mitigates these issues by integrating external dynamic information for improved factual grounding. With advances in multimodal learning, Multimodal RAG extends this approach by incorporating multiple modalities such as text, images, audio, and video to enhance the generated outputs. However, cross-modal alignment and reasoning introduce unique challenges beyond those in unimodal RAG. This survey offers a structured and comprehensive analysis of Multimodal RAG systems, covering datasets, benchmarks, metrics, evaluation, methodologies, and innovations in retrieval, fusion, augmentation, and generation. We review training strategies, robustness enhancements, loss functions, and agent-based approaches, while also exploring the diverse Multimodal RAG scenarios. In addition, we outline open challenges and future directions to guide research in this evolving field. This survey lays the foundation for developing more capable and reliable AI systems that effectively leverage multimodal dynamic external knowledge bases. All resources are publicly available at https://github.com/llm-lab-org/Multimodal-RAG-Survey.
pdf
bib
abs
Debate4MATH: Multi-Agent Debate for Fine-Grained Reasoning in Math
Shaowei Zhang
|
Deyi Xiong
Large language models (LLMs) have demonstrated impressive performance in reasoning. However, existing data annotation methods usually suffer from high annotation cost and the lack of effective automatic validation. To address these issues, we propose a Fine-grained Multi-Agent Debate framework (FMAD) and MMATH-Data, a dataset created by FMAD, which consists of 46K reasoning steps. By prompting multiple agents to debate, FMAD assesses the contribution of each reasoning step to the final solution, with labels based on the judge’s confidence score and the winner’s position. To facilitate reasoning in math and examine FMAD and MMATH-Data, we further propose two key components: a Multi-Agent Debate Reward Model (MRM) trained on MMATH-Data, which serves as a reward model to provide robust feedback during the optimization process, and MMATH-LLM, a model designed specifically for mathematical reasoning. MMATH-LLM is fine-tuned using reinforcement learning with supervised feedback from MRM, aiming at improving its mathematical reasoning capabilities. Extensive experiments demonstrate that our model achieves 83.4% accuracy on the GSM8K dataset and 45.1% on the MATH dataset, outperforming the state-of-the-art methods by 1.2% and 3.5%, respectively. All data and code will be available soon at GitHub.
pdf
bib
abs
Disambiguate First, Parse Later: Generating Interpretations for Ambiguity Resolution in Semantic Parsing
Irina Saparina
|
Mirella Lapata
Handling ambiguity and underspecification is an important challenge in natural language interfaces, particularly for tasks like text-to-SQL semantic parsing. We propose a modular approach that resolves ambiguity using natural language interpretations before mapping these to logical forms (e.g., SQL queries). Although LLMs excel at parsing unambiguous utterances, they show strong biases for ambiguous ones, typically predicting only preferred interpretations. We constructively exploit this bias to generate an initial set of preferred disambiguations and then apply a specialized infilling model to identify and generate missing interpretations. To train the infilling model, we introduce an annotation method that uses SQL execution to validate different meanings. Our approach improves interpretation coverage and generalizes across datasets with different annotation styles, database structures, and ambiguity types.
pdf
bib
abs
The Anatomy of Evidence: An Investigation Into Explainable ICD Coding
Katharina Beckh
|
Elisa Studeny
|
Sujan Sai Gannamaneni
|
Dario Antweiler
|
Stefan Rueping
Automatic medical coding has the potential to ease documentation and billing processes. For this task, transparency plays an important role for medical coders and regulatory bodies, which can be achieved using explainability methods. However, the evaluation of these approaches has been mostly limited to short text and binary settings due to a scarcity of annotated data. Recent efforts by Cheng et al. (2023) have introduced the MDACE dataset, which provides a valuable resource containing code evidence in clinical records. In this work, we conduct an in-depth analysis of the MDACE dataset and perform plausibility evaluation of current explainable medical coding systems from an applied perspective. With this, we contribute to a deeper understanding of automatic medical coding and evidence extraction. Our findings reveal that ground truth evidence aligns with code descriptions to a certain degree. An investigation into state-of-the-art approaches shows a high overlap with ground truth evidence. We propose match measures and highlight success and failure cases. Based on our findings, we provide recommendations for developing and evaluating explainable medical coding systems.
pdf
bib
abs
AVG-LLaVA: An Efficient Large Multimodal Model with Adaptive Visual Granularity
Zhibin Lan
|
Liqiang Niu
|
Fandong Meng
|
Wenbo Li
|
Jie Zhou
|
Jinsong Su
Recently, large multimodal models (LMMs) have achieved significant advancements. When dealing with high-resolution images, dominant LMMs typically divide them into multiple local images and a global image, leading to a large number of visual tokens. In this work, we introduce AVG-LLaVA, an LMM that can adaptively select the appropriate visual granularity based on the input image and instruction. Specifically, we first apply the multiple pooling layers to obtain visual tokens at different granularities. Then we propose a visual granularity router, which includes a Transformer layer, an MLP layer, and a voter layer, used to select the appropriate visual granularity based on the image and instruction. Furthermore, we put forward RGLF, a novel training paradigm that aims at aligning the granularity predicted by the router with the preferences of the LMM, without the need for additional manually annotated data. Extensive experiments and analysis show that AVG-LLaVA achieves superior performance across 11 benchmarks, as well as significantly reduces the number of visual tokens and speeds up inference (e.g., an 85.3% reduction in visual tokens and a 2.53× increase in inference speed on the AI2D benchmark).
pdf
bib
abs
Word Form Matters: LLMs’ Semantic Reconstruction under Typoglycemia
Chenxi Wang
|
Tianle Gu
|
Zhongyu Wei
|
Lang Gao
|
Zirui Song
|
Xiuying Chen
Human readers can efficiently comprehend scrambled words, a phenomenon known as Typoglycemia, primarily by relying on word form; if word form alone is insufficient, they further utilize contextual cues for interpretation. While advanced large language models (LLMs) exhibit similar abilities, the underlying mechanisms remain unclear. To investigate this, we conduct controlled experiments to analyze the roles of word form and contextual information in semantic reconstruction and examine LLM attention patterns. Specifically, we first propose SemRecScore, a reliable metric to quantify the degree of semantic reconstruction, and validate its effectiveness. Using this metric, we study how word form and contextual information influence LLMs’ semantic reconstruction ability, identifying word form as the core factor in this process. Furthermore, we analyze how LLMs utilize word form and find that they rely on specialized attention heads to extract and process word form information, with this mechanism remaining stable across varying levels of word scrambling. This distinction between LLMs’ fixed attention patterns primarily focused on word form and human readers’ adaptive strategy in balancing word form and contextual information provides insights into enhancing LLM performance by incorporating human-like, context-aware mechanisms. Code is available on: https://github.com/Aurora-cx/TypoLLM.
pdf
bib
abs
LLM-based Translation Inference with Iterative Bilingual Understanding
Andong Chen
|
Kehai Chen
|
Yang Xiang
|
Xuefeng Bai
|
Muyun Yang
|
Yang Feng
|
Tiejun Zhao
|
Min Zhang
The remarkable understanding and generation capabilities of large language models (LLMs) have greatly improved translation performance. However, incorrect understanding of the sentence to be translated can degrade translation quality. To address this issue, we proposed a novel Iterative Bilingual Understanding Translation (IBUT) method based on the cross-lingual capabilities of LLMs and the dual characteristics of translation tasks. The cross-lingual capability of LLMs enables the generation of contextual understanding for both the source and target languages separately. Furthermore, the dual characteristics allow IBUT to generate effective cross-lingual feedback, iteratively refining contextual understanding, thereby reducing errors and improving translation performance. Experimental results showed that the proposed IBUT outperforms several strong comparison methods, especially being generalized to multiple domains (e.g., news, commonsense, and cultural translation benchmarks).
pdf
bib
abs
Vulnerability of Text-to-Image Models to Prompt Template Stealing: A Differential Evolution Approach
Yurong Wu
|
Fangwen Mu
|
Qiuhong Zhang
|
Jinjing Zhao
|
Xinrun Xu
|
Lingrui Mei
|
Yang Wu
|
Lin Shi
|
Junjie Wang
|
Zhiming Ding
|
Yiwei Wang
Prompt trading has emerged as a significant intellectual property concern in recent years, where vendors entice users by showcasing sample images before selling prompt templates that can generate similar images. This work investigates a critical security vulnerability: attackers can steal prompt templates using only a limited number of sample images. To investigate this threat, we introduce Prism, a prompt-stealing benchmark consisting of 50 templates and 450 images, organized into Easy and Hard difficulty levels. To identify the vulnerabity of VLMs to prompt stealing, we propose EvoStealer, a novel template stealing method that operates without model fine-tuning by leveraging differential evolution algorithms. The system first initializes population sets using multimodal large language models (MLLMs) based on predefined patterns, then iteratively generates enhanced offspring through MLLMs. During evolution, EvoStealer identifies common features across offspring to derive generalized templates. Our comprehensive evaluation conducted across open-source (InternVL2-26B) and closed-source models (GPT-4o and GPT-4o-mini) demonstrates that EvoStealer’s stolen templates can reproduce images highly similar to originals and effectively generalize to other subjects, significantly outperforming baseline methods with an average improvement of over 10%. Moreover, our cost analysis reveals that EvoStealer achieves template stealing with negligible computational expenses. Our code and dataset are available at https://whitepagewu.github.io/evostealer-site.
pdf
bib
abs
mStyleDistance: Multilingual Style Embeddings and their Evaluation
Justin Qiu
|
Jiacheng Zhu
|
Ajay Patel
|
Marianna Apidianaki
|
Chris Callison-Burch
Style embeddings are useful for stylistic analysis and style transfer, yet they only exist for English. We introduce Multilingual StyleDistance (mStyleDistance), a method that can generate style embeddings in new languages using synthetic data and a contrastive loss. We create style embeddings in nine languages and a multilingual STEL-or-Content benchmark (Wegmann et al., 2022) that serves to assess their quality. We also employ our embeddings in an authorship verification task involving different languages. Our results show that mStyleDistance embeddings outperform existing style embeddings on these benchmarks and generalize well to unseen features and languages. We make our models and datasets publicly available.
pdf
bib
abs
SeqMMR: Sequential Model Merging and LLM Routing for Enhanced Batched Sequential Knowledge Editing
Shanbao Qiao
|
Xuebing Liu
|
Akshat Gupta
|
Seung-Hoon Na
Model knowledge editing enables the efficient correction of erroneous information and the continuous updating of outdated knowledge within language models. While existing research has demonstrated strong performance in single-instance or few-instance sequential editing and one-time massive editing scenarios, the batched sequential editing paradigm remains a significant challenge. The primary issue lies in the model’s tendency to gradually forget previously edited knowledge and become increasingly unstable after multiple iterations of batched editing. To address these challenges, we propose **SeqMMR**, an enhanced framework for batched sequential knowledge editing that leverages **Seq**uential **M**odel **M**erging and a model **R**outer. Our approach iteratively merges parameters from current batch-edited models with those of their predecessors, ensuring that newly emerging knowledge is integrated while mitigating the forgetting of previously edited knowledge. Furthermore, the model router directs queries unrelated to the edited knowledge to an unedited model backup, preventing unintended alterations in model predictions. Extensive experiments across various datasets demonstrate that our approach effectively mitigates knowledge forgetting, improves performance across all previous batches, and better preserves the model’s general capabilities.
pdf
bib
abs
ReflectEvo: Improving Meta Introspection of Small LLMs by Learning Self-Reflection
Jiaqi Li
|
Xinyi Dong
|
Yang Liu
|
Zhizhuo Yang
|
Quansen Wang
|
Xiaobo Wang
|
Song-Chun Zhu
|
Zixia Jia
|
Zilong Zheng
We present a novel pipeline, ReflectEvo, to demonstrate that small language models (SLMs) can enhance meta introspection through reflection learning. This process iteratively generates self-reflection for self-training, fostering a continuous and self-evolving process. Leveraging this pipeline, we construct ReflectEvo-460k, a large-scale, comprehensive, self-generated reflection dataset with broadened instructions and diverse multi-domain tasks. Building upon this dataset, we demonstrate the effectiveness of reflection learning to improve SLMs’ reasoning abilities using SFT and DPO with remarkable performance, substantially boosting Llama-3 from 52.4% to 71.2% and Mistral from 44.4% to 71.1%. It validates that ReflectEvo can rival or even surpass the reasoning capability of the three prominent open-sourced models on BIG-bench without distillation from superior models or fine-grained human annotation. We further conduct a deeper analysis of the high quality of self-generated reflections and their impact on error localization and correction. Our work highlights the potential of continuously enhancing the reasoning performance of SLMs through iterative reflection learning in the long run.
pdf
bib
abs
MAGIC-VQA: Multimodal And Grounded Inference with Commonsense Knowledge for Visual Question Answering
Shuo Yang
|
Caren Han
|
Siwen Luo
|
Eduard Hovy
Visual Question Answering (VQA) necessitates models to reason effectively across visual and textual modalities. However, existing Large Vision-Language Models (LVLMs) often fall short in achieving human-like reasoning due to a lack of integrated commonsense knowledge, limiting their robustness and accuracy in real-world scenarios where both explicit facts and implicit understanding are crucial. To address this challenge, we present MAGIC-VQA: Multimodal And Grounded Inference with Commonsense Knowledge, a novel framework designed to enhance multimodal inference by integrating commonsense reasoning. MAGIC-VQA introduces a three-stage process: (1) Explicit Commonsense Knowledge Retrieval from external knowledge graphs, (2) By-Type Commonsense Knowledge Post-Processing to refine contextual relevance, and (3) Implicit Commonsense Knowledge Augmentation using a heterogeneous graph processed by a Graph Neural Network (GNN). These stages collectively enable nuanced, context-aware reasoning without extensive pre-training or intricate prompt tuning.Our MAGIC-VQA significantly improves comprehensive benchmark datasets, surpassing existing models in tasks requiring advanced commonsense reasoning. MAGIC-VQA establishes a robust pathway for integrating commonsense knowledge into VQA, bridging the gap between vision-language inputs and high-level reasoning for improved reliability and contextual accuracy.
pdf
bib
abs
Automatic Transmission for LLM Tiers: Optimizing Cost and Accuracy in Large Language Models
Injae Na
|
Keonwoong Noh
|
Woohwan Jung
LLM providers typically offer multiple LLM tiers, varying in performance and price. As NLP tasks become more complex and modularized, selecting the suitable LLM tier for each subtask is a key challenge to balance between cost and performance. To address the problem, we introduce LLM Automatic Transmission (LLM-AT) framework that automatically selects LLM tiers without training. LLM-AT consists of Starter, Generator, and Judge. The starter selects the initial LLM tier expected to solve the given question, the generator produces a response using the LLM of the selected tier, and the judge evaluates the validity of the response. If the response is invalid, LLM-AT iteratively upgrades to a higher-tier model, generates a new response, and re-evaluates until a valid response is obtained. Additionally, we propose accuracy estimator, which enables the suitable initial LLM tier selection without training. Given an input question, accuracy estimator estimates the expected accuracy of each LLM tier by computing the valid response rate across top-k similar queries from past inference records. Experiments demonstrate that LLM-AT achieves superior performance while reducing costs, making it a practical solution for real-world applications.
pdf
bib
abs
Low-Rank Interconnected Adaptation across Layers
Yibo Zhong
|
Jinman Zhao
|
Yao Zhou
Low-rank adaptation (LoRA) is a widely used parameter-efficient fine-tuning (PEFT) method that learns weight updates 𝛥 W = AB for pretrained weights W through low-rank adapters A and B. While LoRA ensures hardware efficiency, its low-rank weight updates limit adaptation performance. In this paper, we propose low-rank interconnected adaptation across layers (Lily), a novel PEFT method that introduces an interconnected framework with locally shared A and globally shared B experts. This structure eliminates redundant per-layer AB pairs, enabling higher-rank 𝛥 W with equal or fewer parameters. To enhance expressiveness, we use data-dependent routers to determine A-B interconnections, preventing B experts from converging to the same behavior and improving representational power across domains. Experiments across modalities, architectures, and model sizes demonstrate Lily’s superior performance and efficiency.
pdf
bib
abs
GaRAGe: A Benchmark with Grounding Annotations for RAG Evaluation
Ionut Teodor Sorodoc
|
Leonardo F. R. Ribeiro
|
Rexhina Blloshmi
|
Christopher Davis
|
Adrià de Gispert
We present GaRAGe, a large RAG benchmark with human-curated long-form answers and annotations of each grounding passage, allowing a fine-grained evaluation of whether LLMs can identify relevant grounding when generating RAG answers. Our benchmark contains 2366 questions of diverse complexity, dynamism, and topics, and includes over 35K annotated passages retrieved from both private document sets and the Web, to reflect real-world RAG use cases. This makes it an ideal test bed to evaluate an LLM’s ability to identify only the relevant information necessary to compose a response, or provide a deflective response when there is insufficient information. Evaluations of multiple state-of-the-art LLMs on GaRAGe show that the models tend to over-summarise rather than (a) ground their answers strictly on the annotated relevant passages (reaching at most a Relevance-Aware Factuality Score of 60%), or (b) deflect when no relevant grounding is available (reaching at most 31% true positive rate in deflections). The F1 in attribution to relevant sources is at most 58.9%, and we show that performance is particularly reduced when answering time-sensitive questions and when having to draw knowledge from sparser private grounding sources.
pdf
bib
abs
Change Entity-guided Heterogeneous Representation Disentangling for Change Captioning
Yi Li
|
Yunbin Tu
|
Liang Li
|
Li Su
|
Qingming Huang
Change captioning aims to describe differences between a pair of images using natural language. However, learning effective difference representations is highly challenging due to distractors such as illumination and viewpoint changes. To address this, we propose a change-entity-guided disentanglement network that explicitly learns difference representations while mitigating the impact of distractors. Specifically, we first design a change entity retrieval module to identify key objects involved in the change from a textual perspective. Then, we introduce a difference representation enhancement module that strengthens the learned features, disentangling genuine differences from background variations. To further refine the generation process, we incorporate a gated Transformer decoder, which dynamically integrates both visual difference and textual change-entity information. Extensive experiments on CLEVR-Change, CLEVR-DC and Spot-the-Diff datasets demonstrate that our method outperforms existing approaches, achieving state-of-the-art performance. The code is available at https://github.com/yili-19/CHEER
pdf
bib
abs
RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment
Zhuoran Jin
|
Hongbang Yuan
|
Tianyi Men
|
Pengfei Cao
|
Yubo Chen
|
Jiexin Xu
|
Huaijun Li
|
Xiaojian Jiang
|
Kang Liu
|
Jun Zhao
Despite the significant progress made by existing retrieval augmented language models (RALMs) in providing trustworthy responses and grounding in reliable sources, they often overlook effective alignment with human preferences. In the alignment process, reward models (RMs) act as a crucial proxy for human values to guide optimization. However, it remains unclear how to evaluate and select a reliable RM for preference alignment in RALMs. To this end, we propose RAG-RewardBench, the first benchmark for evaluating RMs in RAG settings. First, we design four crucial and challenging RAG-specific scenarios to assess RMs, including multi-hop reasoning, fine-grained citation, appropriate abstain, and conflict robustness. Then, we incorporate 18 RAG subsets, six retrievers, and 24 RALMs to increase the diversity of data sources. Finally, we adopt an LLM-as-a-judge approach to improve preference annotation efficiency and effectiveness, exhibiting a strong correlation with human annotations. Based on the RAG-RewardBench, we conduct a comprehensive evaluation of 45 RMs and uncover their limitations in RAG scenarios. Additionally, we also reveal that existing trained RALMs show almost no improvement in preference alignment, highlighting the need for a shift towards preference-aligned training.
pdf
bib
abs
Generate, Discriminate, Evolve: Enhancing Context Faithfulness via Fine-Grained Sentence-Level Self-Evolution
Kun Li
|
Tianhua Zhang
|
Yunxiang Li
|
Hongyin Luo
|
Abdalla Mohamed Salama Sayed Moustafa
|
Xixin Wu
|
James R. Glass
|
Helen M. Meng
Improving context faithfulness in large language models is essential for developing trustworthy retrieval augmented generation systems and mitigating hallucinations, especially in long-form question answering (LFQA) tasks or scenarios involving knowledge conflicts. Existing methods either intervene LLMs only at inference without addressing their inherent limitations or overlook the potential for self-improvement. In this paper, we introduce GenDiE(Generate, Discriminate, Evolve), a novel self-evolving framework that enhances context faithfulness through fine-grained sentence-level optimization. GenDiE combines both generative and discriminative training, equipping LLMs with self-generation and self-scoring capabilities to facilitate iterative self-evolution. This supports both data construction for model alignment and score-guided search during inference. Furthermore, by treating each sentence in a response as an independent optimization unit, GenDiE effectively addresses the limitations of previous approaches that optimize at the holistic answer level, which may miss unfaithful details. Experiments on ASQA (in-domain LFQA) and ConFiQA (out-of-domain counterfactual QA) datasets demonstrate that GenDiE surpasses various baselines in both faithfulness and correctness, and exhibits robust performance for domain adaptation.
pdf
bib
abs
PAM: Paraphrase AMR-Centric Evaluation Metric
Afonso Sousa
|
Henrique Lopes Cardoso
Paraphrasing is rooted in semantics, which makes evaluating paraphrase generation systems hard. Current paraphrase generators are typically evaluated using borrowed metrics from adjacent text-to-text tasks, like machine translation or text summarization. These metrics tend to have ties to the surface form of the reference text. This is not ideal for paraphrases as we typically want variation in the lexicon while persisting semantics. To address this problem, and inspired by learned similarity evaluation on plain text, we propose PAM, a Paraphrase AMR-Centric Evaluation Metric. This metric uses AMR graphs extracted from the input text, which consist of semantic structures agnostic to the text surface form, making the resulting evaluation metric more robust to variations in syntax or lexicon. Additionally, we evaluated PAM on different semantic textual similarity datasets and found that it improves the correlations with human semantic scores when compared to other AMR-based metrics.
pdf
bib
abs
VP-MEL: Visual Prompts Guided Multimodal Entity Linking
Hongze Mi
|
Jinyuan Li
|
Zhangxuying Zhangxuying
|
Haoran Cheng
|
Jiahao Wang
|
Di Sun
|
Gang Pan
Multimodal entity linking (MEL), a task aimed at linking mentions within multimodal contexts to their corresponding entities in a knowledge base (KB), has attracted much attention due to its wide applications in recent years. However, existing MEL methods often rely on mention words as retrieval cues, which limits their ability to effectively utilize information from both images and text. This reliance causes MEL to struggle with accurately retrieving entities in certain scenarios, especially when the focus is on image objects or mention words are missing from the text. To solve these issues, we introduce a Visual Prompts guided Multimodal Entity Linking (VP-MEL) task. Given a text-image pair, VP-MEL aims to link a marked region (i.e., visual prompt) in an image to its corresponding entities in the knowledge base. To facilitate this task, we present a new dataset, VPWiki, specifically designed for VP-MEL. Furthermore, we propose a framework named IIER, which enhances visual feature extraction using visual prompts and leverages the pre-trained Detective-VLM model to capture latent information. Experimental results on the VPWiki dataset demonstrate that IIER outperforms baseline methods across multiple benchmarks for the VP-MEL task.
pdf
bib
abs
FADE: Why Bad Descriptions Happen to Good Features
Bruno Puri
|
Aakriti Jain
|
Elena Golimblevskaia
|
Patrick Kahardipraja
|
Thomas Wiegand
|
Wojciech Samek
|
Sebastian Lapuschkin
Recent advances in mechanistic interpretability have highlighted the potential of automating interpretability pipelines in analyzing the latent representations within LLMs. While this may enhance our understanding of internal mechanisms, the field lacks standardized evaluation methods for assessing the validity of discovered features. We attempt to bridge this gap by introducing **FADE**: Feature Alignment to Description Evaluation, a scalable model-agnostic framework for automatically evaluating feature-to-description alignment. **FADE** evaluates alignment across four key metrics – *Clarity, Responsiveness, Purity, and Faithfulness* – and systematically quantifies the causes of the misalignment between features and their descriptions. We apply **FADE** to analyze existing open-source feature descriptions and assess key components of automated interpretability pipelines, aiming to enhance the quality of descriptions. Our findings highlight fundamental challenges in generating feature descriptions, particularly for SAEs compared to MLP neurons, providing insights into the limitations and future directions of automated interpretability. We release **FADE** as an open-source package at: [github.com/brunibrun/FADE](https://github.com/brunibrun/FADE).
pdf
bib
abs
In the LLM era, Word Sense Induction remains unsolved
Anna Mosolova
|
Marie Candito
|
Carlos Ramisch
In the absence of sense-annotated data, word sense induction (WSI) is a compelling alternative to word sense disambiguation, particularly in low-resource or domain-specific settings. In this paper, we emphasize methodological problems in current WSI evaluation. We propose an evaluation on a SemCor-derived dataset, respecting the original corpus polysemy and frequency distributions. We assess pre-trained embeddings and clustering algorithms across parts of speech, and propose and evaluate an LLM-based WSI method for English. We evaluate data augmentation sources (LLM-generated, corpus and lexicon), and semi-supervised scenarios using Wiktionary for data augmentation, must-link constraints, number of clusters per lemma.We find that no unsupervised method (whether ours or previous) surpasses the strong “one cluster per lemma” heuristic (1cpl). We also show that (i) results and best systems may vary across POS, (ii) LLMs have troubles performing this task, (iii) data augmentation is beneficial and (iv) capitalizing on Wiktionary does help. It surpasses previous SOTA system on our test set by 3.3%. WSI is not solved, and calls for a better articulation of lexicons and LLMs’ lexical semantics capabilities.
pdf
bib
abs
Navigating the Political Compass: Evaluating Multilingual LLMs across Languages and Nationalities
Chadi Helwe
|
Oana Balalau
|
Davide Ceolin
Large Language Models (LLMs) have become ubiquitous in today’s technological landscape, boasting a plethora of applications, and even endangering human jobs in complex and creative fields. One such field is journalism: LLMs are being used for summarization, generation and even fact-checking. However, in today’s political landscape, LLMs could accentuate tensions if they exhibit political bias. In this work, we evaluate the political bias of the most used 15 multilingual LLMs via the Political Compass Test. We test different scenarios, where we vary the language of the prompt, while also assigning a nationality to the model. We evaluate models on the 50 most populous countries and their official languages. Our results indicate that language has a strong influence on the political ideology displayed by a model. In addition, smaller models tend to display a more stable political ideology, i.e. ideology that is less affected by variations in the prompt.
pdf
bib
abs
Who Can Withstand Chat-Audio Attacks? An Evaluation Benchmark for Large Audio-Language Models
Wanqi Yang
|
Yanda Li
|
Meng Fang
|
Yunchao Wei
|
Ling Chen
Adversarial audio attacks pose a significant threat to the growing use of large audio-language models (LALMs) in voice-based human-machine interactions. While existing research focused on model-specific adversarial methods, real-world applications demand a more generalizable and universal approach to audio adversarial attacks. In this paper, we introduce the Chat-Audio Attacks (CAA) benchmark including four distinct types of audio attacks, which aims to explore the vulnerabilities of LALMs to these audio attacks in conversational scenarios. To evaluate the robustness of LALMs, we propose three evaluation strategies: Standard Evaluation, utilizing traditional metrics to quantify model performance under attacks; GPT-4o-Based Evaluation, which simulates real-world conversational complexities; and Human Evaluation, offering insights into user perception and trust. We evaluate six state-of-the-art LALMs with voice interaction capabilities, including Gemini-1.5-Pro, GPT-4o, and others, using three distinct evaluation methods on the CAA benchmark. Our comprehensive analysis reveals the impact of four types of audio attacks on the performance of these models, demonstrating that GPT-4o exhibits the highest level of resilience. Our data can be accessed via the following link: CAA.
pdf
bib
abs
Beyond the Tip of Efficiency: Uncovering the Submerged Threats of Jailbreak Attacks in Small Language Models
Sibo Yi
|
Tianshuo Cong
|
Xinlei He
|
Qi Li
|
Jiaxing Song
Small language models (SLMs) have become increasingly prominent in the deployment on edge devices due to their high efficiency and low computational cost. While researchers continue to advance the capabilities of SLMs through innovative training strategies and model compression techniques, the security risks of SLMs have received considerably less attention compared to large language models (LLMs). To fill this gap, we provide a comprehensive empirical study to evaluate the security performance of 13 state-of-the-art SLMs under various jailbreak attacks. Our experiments demonstrate that most SLMs are quite susceptible to existing jailbreak attacks, while some of them are even vulnerable to direct harmful prompts. To address the safety concerns, we evaluate several representative defense methods and demonstrate their effectiveness in enhancing the security of SLMs. We further analyze the potential security degradation caused by different SLM techniques including architecture compression, quantization, knowledge distillation, and so on. We expect that our research can highlight the security challenges of SLMs and provide valuable insights to future work in developing more robust and secure SLMs.
pdf
bib
abs
EMRs2CSP : Mining Clinical Status Pathway from Electronic Medical Records
Yifei Chen
|
Ruihui Hou
|
Jingping Liu
|
Tong Ruan
Many current studies focus on extracting tests or treatments when constructing clinical pathways, often neglecting the patient’s symptoms and diagnosis, leading to incomplete diagnostic and therapeutic logic. Therefore, this paper aims to extract clinical pathways from electronic medical records that encompass complete diagnostic and therapeutic logic, including temporal information, patient symptoms, diagnosis, and tests or treatments. To achieve this objective, we propose a novel clinical pathway representation: the clinical status pathway. We also design a LLM-based pipeline framework for extracting clinical status pathway from electronic medical records, with the core concept being to improve extraction accuracy by modeling the diagnostic and treatment processes. In our experiments, we apply this framework to construct a comprehensive breast cancer-specific clinical status pathway and evaluate its performance on medical question-answering and decision-support tasks, demonstrating significant improvements over traditional clinical pathways. The code is publicly available at https://github.com/finnchen11/EMRs2CSP.
pdf
bib
abs
A Law Reasoning Benchmark for LLM with Tree-Organized Structures including Factum Probandum, Evidence and Experiences
Jiaxin Shen
|
Jinan Xu
|
Huiqi Hu
|
Luyi Lin
|
Guoyang Ma
|
Fei Zheng
|
Fandong Meng
|
Jie Zhou
|
Wenjuan Han
While progress has been made in legal applications, law reasoning, crucial for fair adjudication, remains unexplored. We propose a transparent law reasoning schema enriched with hierarchical factum probandum, evidence, and implicit experience, enabling public scrutiny and preventing bias. Inspired by this schema, we introduce the challenging task, which takes a textual case description and outputs a hierarchical structure justifying the final decision. We also create the first crowd-sourced dataset for this task, enabling comprehensive evaluation. Simultaneously, we propose TL agent that employs a comprehensive suite of legal analysis tools to address the challenge task. This benchmark paves the way for transparent and accountable AI-assisted law-reasoning in the “Intelligent Court”.
pdf
bib
abs
Libra: Leveraging Temporal Images for Biomedical Radiology Analysis
Xi Zhang
|
Zaiqiao Meng
|
Jake Lever
|
Edmond S. L. Ho
Radiology report generation (RRG) requires advanced medical image analysis, effective temporal reasoning, and accurate text generation. While multimodal large language models (MLLMs) align with pre-trained vision encoders to enhance visual-language understanding, most existing methods rely on single-image analysis or rule-based heuristics to process multiple images, failing to fully leverage temporal information in multi-modal medical datasets. In this paper, we introduce **Libra**, a temporal-aware MLLM tailored for chest X-ray report generation. Libra combines a radiology-specific image encoder with a novel Temporal Alignment Connector (**TAC**), designed to accurately capture and integrate temporal differences between paired current and prior images. Extensive experiments on the MIMIC-CXR dataset demonstrate that Libra establishes a new state-of-the-art benchmark among similarly scaled MLLMs, setting new standards in both clinical relevance and lexical accuracy. All source code and data are publicly available at: https://github.com/X-iZhang/Libra.
pdf
bib
abs
Stereotype Detection as a Catalyst for Enhanced Bias Detection: A Multi-Task Learning Approach
Aditya Tomar
|
Rudra Murthy
|
Pushpak Bhattacharyya
Bias and stereotypes in language models can cause harm, especially in sensitive areas like content moderation and decision-making. This paper addresses bias and stereotype detection by exploring how jointly learning these tasks enhances model performance. We introduce StereoBias, a unique dataset labeled for bias and stereotype detection across five categories: religion, gender, socio-economic status, race, profession, and others, enabling a deeper study of their relationship. Our experiments compare encoder-only models and fine-tuned decoder-only models using QLoRA. While encoder-only models perform well, decoder-only models also show competitive results. Crucially, joint training on bias and stereotype detection significantly improves bias detection compared to training them separately. Additional experiments with sentiment analysis confirm that the improvements stem from the connection between bias and stereotypes, not multi-task learning alone. These findings highlight the value of leveraging stereotype information to build fairer and more effective AI systems.
pdf
bib
abs
Filling the Temporal Void: Recovering Missing Publication Years in the Project Gutenberg Corpus Using LLMs
Omar Momen
|
Manuel Schaaf
|
Alexander Mehler
Analysing texts spanning long periods of time is critical for researchers in historical linguistics and related disciplines. However, publicly available corpora suitable for such analyses are scarce. The Project Gutenberg (PG) corpus presents a significant yet underutilized opportunity in this context, due to the absence of accurate temporal metadata. We take advantage of language models and information retrieval to explore four sources of information – Open Web, Wikipedia, Open Library API, and PG books texts – to add missing temporal metadata to the PG corpus. Through 20 experiments employing state-of-the-art Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) methods, we estimate the production years of all PG books. We curate an enriched metadata repository for the PG corpus and propose a refined version for it, which includes 53,774 books with a total of 3.8 billion tokens in 11 languages, produced between 1600 and 2000. This work provides a new resource for computational linguistics and humanities studies focusing on diachronic analyses. The final dataset and all experiments data are publicly available (https://github.com/OmarMomen14/pg-dates).
pdf
bib
abs
ExpliCa: Evaluating Explicit Causal Reasoning in Large Language Models
Martina Miliani
|
Serena Auriemma
|
Alessandro Bondielli
|
Emmanuele Chersoni
|
Lucia Passaro
|
Irene Sucameli
|
Alessandro Lenci
Large Language Models (LLMs) are increasingly used in tasks requiring interpretive and inferential accuracy. In this paper, we introduce ExpliCa, a new dataset for evaluating LLMs in explicit causal reasoning. ExpliCa uniquely integrates both causal and temporal relations presented in different linguistic orders and explicitly expressed by linguistic connectives. The dataset is enriched with crowdsourced human acceptability ratings. We tested LLMs on ExpliCa through prompting and perplexity-based metrics. We assessed seven commercial and open-source LLMs, revealing that even top models struggle to reach 0.80 accuracy. Interestingly, models tend to confound temporal relations with causal ones, and their performance is also strongly influenced by the linguistic order of the events. Finally, perplexity-based scores and prompting performance are differently affected by model size.
pdf
bib
abs
Are Dialects Better Prompters? A Case Study on Arabic Subjective Text Classification
Leila Moudjari
|
Farah Benamara
This paper investigates the effect of dialectal prompting, variations in prompting scrip t and model fine-tuning on subjective classification in Arabic dialects. To this end, we evaluate the performances of 12 widely used open LLMs across four tasks and eight benchmark datasets. Our results reveal that specialized fine-tuned models with Arabic and Arabizi scripts dialectal prompts achieve the best results, which constitutes a novel state of the art in the field.
pdf
bib
abs
Natural Logic at the Core: Dynamic Rewards for Entailment Tree Generation
Jihao Shi
|
Xiao Ding
|
Kai Xiong
|
Hengwei Zhao
|
Bing Qin
|
Ting Liu
Entailment trees are essential for enhancing interpretability and transparency in tasks like question answering and natural language understanding. However, existing approaches often lack logical consistency, as they rely on static reward structures or ignore the intricate dependencies within multi-step reasoning. To address these limitations, we propose a method that integrates natural logic principles into reinforcement learning, enabling dynamic reward computation to guide entailment tree generation. Our approach ensures logical consistency across reasoning steps while improving interpretability and generalization. Experiments on EntailmentBank demonstrate significant improvements over state-of-the-art methods, highlighting the effectiveness of natural logic in structured reasoning.
pdf
bib
abs
R.R.: Unveiling LLM Training Privacy through Recollection and Ranking
Wenlong Meng
|
Guo Zhenyuan
|
Lenan Wu
|
Chen Gong
|
Wenyan Liu
|
Weixian Li
|
Chengkun Wei
|
Wenzhi Chen
Large Language Models (LLMs) pose significant privacy risks, potentially leaking training data due to implicit memorization. Existing privacy attacks primarily focus on membership inference attacks (MIAs) or data extraction attacks, but reconstructing specific personally identifiable information (PII) in LLMs’ training data remains challenging. In this paper, we propose (Recollect and Rank), a novel two-step privacy stealing attack that enables attackers to reconstruct PII entities from scrubbed training data where the PII entities have been masked. In the first stage, we introduce a prompt paradigm named recollection, which instructs the LLM to repeat a masked text but fill in masks. Then we can use PII identifiers to extract recollected PII candidates. In the second stage, we design a new criterion to score each PII candidate and rank them. Motivated by membership inference, we leverage the reference model as a calibration to our criterion. Experiments across three popular PII datasets demonstrate that the achieves better PII identification performance than baselines. These results highlight the vulnerability of LLMs to PII leakage even when training data has been scrubbed. We release our code and datasets at GitHub.
pdf
bib
abs
Nested-Refinement Metamorphosis: Reflective Evolution for Efficient Optimization of Networking Problems
Shuhan Guo
|
Nan Yin
|
James Kwok
|
Quanming Yao
Large Language Models (LLMs) excel in network algorithm design but suffer from inefficient iterative coding and high computational costs. Drawing inspiration from butterfly metamorphosis—where structured developmental phases (Phase I: larval nutrient accumulation → Phase II: pupal transformation) enable adaptive evolution—we propose Nested-Refinement Metamorphosis (NeRM). Building on this principle, we introduce Metamorphosis on Prompts (MoP) to iteratively refine task descriptions (e.g. latency / bandwidth constraints) and Metamorphosis on Algorithms (MoA) to generate more effective solutions (e.g. appropriate network processing architecture). Their nested refinement ensures task-algorithm alignment, systematically improving both task descriptions and algorithmic solutions for more efficient algorithm design. To further enhance efficiency, we incorporate predictor-assisted code evaluation, mimicking natural selection by filtering out weak candidates early and reducing computational costs. Experimental results on TSP (routing), MKP (resource allocation), and CVRP (service-network coordination) demonstrate that NeRM consistently outperforms state-of-the-art approaches in both performance and efficiency.
pdf
bib
abs
MC-MKE: A Fine-Grained Multimodal Knowledge Editing Benchmark Emphasizing Modality Consistency
Junzhe Zhang
|
Huixuan Zhang
|
Xunjian Yin
|
Baizhou Huang
|
Xu Zhang
|
Xinyu Hu
|
Xiaojun Wan
Multimodal large language models (MLLMs) are prone to non-factual or outdated knowledge issues, highlighting the importance of knowledge editing. Many benchmark has been proposed for researching multimodal knowledge editing. However, previous benchmarks focus on limited scenarios due to the lack of rigorous definition of multimodal knowledge. To better evaluate multimodal knowledge editing, we propose a decomposed definition of multimodal knowledge. Following the decomposed definition of multimodal knowledge, we introduce three scenarios and a novel requirement modality consistency. We construct MC-MKE, a fine-grained **M**ultimodal **K**nowledge **E**diting benchmark emphasizing **M**odality **C**onsistency through strict data selection. We evaluate four multimodal knowledge editing methods on MC-MKE, revealing their limitations, particularly in terms of modality consistency. Our work highlights the challenges posed by multimodal knowledge editing and motivates further research in developing effective techniques for this task.
pdf
bib
abs
Visualising Policy-Reward Interplay to Inform Zeroth-Order Preference Optimisation of Large Language Models
Alessio Galatolo
|
Zhenbang Dai
|
Katie Winkle
|
Meriem Beloucif
Fine-tuning Large Language Models (LLMs) with first-order methods like back-propagation is computationally intensive. Zeroth-Order (ZO) optimisation uses function evaluations instead of gradients, reducing memory usage, but suffers from slow convergence in high-dimensional models. As a result, ZO research in LLMs has mostly focused on classification, overlooking more complex generative tasks. In this paper, we introduce ZOPrO, a novel ZO algorithm designed for *Preference Optimisation* in LLMs. We begin by analysing the interplay between policy and reward models during traditional (first-order) Preference Optimisation, uncovering patterns in their relative updates. Guided by these insights, we adapt Simultaneous Perturbation Stochastic Approximation (SPSA) with a targeted sampling strategy to accelerate convergence. Through experiments on summarisation, machine translation, and conversational assistants, we demonstrate that our method consistently enhances reward signals while achieving convergence times comparable to first-order methods. While it falls short of some state-of-the-art methods, our work is the first to apply Zeroth-Order methods to Preference Optimisation in LLMs, going beyond classification tasks and paving the way for a largely unexplored research direction. Code and visualisations are available at https://github.com/alessioGalatolo/VisZOPrO.
pdf
bib
abs
Metaphor and Large Language Models: When Surface Features Matter More than Deep Understanding
Elisa Sanchez-Bayona
|
Rodrigo Agerri
This paper presents a comprehensive evaluation of the capabilities of Large Language Models (LLMs) in metaphor interpretation across multiple datasets, tasks, and prompt configurations. Although metaphor processing has gained significant attention in Natural Language Processing (NLP), previous research has been limited to single-dataset evaluations and specific task settings, often using artificially constructed data through lexical replacement. We address these limitations by conducting extensive experiments using diverse publicly available datasets with inference and metaphor annotations, focusing on Natural Language Inference (NLI) and Question Answering (QA) tasks. The results indicate that LLMs’ performance is more influenced by features like lexical overlap and sentence length than by metaphorical content, demonstrating that any alleged emergent abilities of LLMs to understand metaphorical language are the result of a combination of surface-level features, in-context learning, and linguistic knowledge. This work provides critical insights into the current capabilities and limitations of LLMs in processing figurative language, highlighting the need for more realistic evaluation frameworks in metaphor interpretation tasks. Data and code publicly available: https://github.com/elisanchez-beep/metaphorLLM
pdf
bib
abs
AskQE: Question Answering as Automatic Evaluation for Machine Translation
Dayeon Ki
|
Kevin Duh
|
Marine Carpuat
How can a monolingual English speaker determine whether an automatic translation in French is good enough to be shared? Existing MT error detection and quality estimation (QE) techniques do not address this practical scenario. We introduce AskQE, a question generation and answering framework designed to detect critical MT errors and provide actionable feedback, helping users decide whether to accept or reject MT outputs even without the knowledge of the target language. Using ContraTICO, a dataset of contrastive synthetic MT errors in the COVID-19 domain, we explore design choices for AskQE and develop an optimized version relying on LLaMA-3 70B and entailed facts to guide question generation. We evaluate the resulting system on the BioMQM dataset of naturally occurring MT errors, where AskQE has higher Kendall’s Tau correlation and decision accuracy with human ratings compared to other QE metrics.
pdf
bib
abs
ExPerT: Effective and Explainable Evaluation of Personalized Long-Form Text Generation
Alireza Salemi
|
Julian Killingback
|
Hamed Zamani
Evaluating personalized text generated by large language models (LLMs) is challenging, as only the LLM user, i.e. prompt author, can reliably assess the output, but re-engaging the same individuals across studies is infeasible. This paper addresses the challenge of evaluating personalized text generation by introducing ExPerT, an explainable reference-based evaluation framework. ExPerT leverages an LLM to extract atomic aspects and their evidences from the generated and reference texts, match the aspects, and evaluate their alignment based on content and writing style—two key attributes in personalized text generation. Additionally, ExPerT generates detailed, fine-grained explanations for every step of the evaluation process, enhancing transparency and interpretability. Our experiments demonstrate that ExPerT achieves a 7.2% relative improvement in alignment with human judgments compared to the state-of-the-art text generation evaluation methods. Furthermore, human evaluators rated the usability of ExPerT’s explanations at 4.7 out of 5, highlighting its effectiveness in making evaluation decisions more interpretable.
pdf
bib
abs
Bridging Intuitive Associations and Deliberate Recall: Empowering LLM Personal Assistant with Graph-Structured Long-term Memory
Yujie Zhang
|
Weikang Yuan
|
Zhuoren Jiang
Large language models (LLMs)-based personal assistants may struggle to effectively utilize long-term conversational histories.Despite advances in long-term memory systems and dense retrieval methods, these assistants still fail to capture entity relationships and handle multiple intents effectively. To tackle above limitations, we propose **Associa**, a graph-structured memory framework that mimics human cognitive processes. Associa comprises an event-centric memory graph and two collaborative components: **Intuitive Association**, which extracts evidence-rich subgraphs through Prize-Collecting Steiner Tree optimization, and **Deliberating Recall**, which iteratively refines queries for comprehensive evidence collection. Experiments show that Associa significantly outperforms existing methods in retrieval and QA (question and answering) tasks across long-term dialogue benchmarks, advancing the development of more human-like AI memory systems.
pdf
bib
abs
Each graph is a new language: Graph Learning with LLMs
Huachi Zhou
|
Jiahe Du
|
Chuang Zhou
|
Chang Yang
|
Yilin Xiao
|
Yuxuan Xie
|
Xiao Huang
Natural language has been extensively used for modeling text-attributed graphs with LLMs. Natural language is used to describe the graph for LLMs to understand or serve as component of the graph, e.g., textual attributes for embedding generation. However, natural language is inherently redundant and unstructured, making it unsuitable for modeling high-order neighbors with LLMs. Specifically, (i) graph descriptions become verbose, overwhelming LLMs, and (ii) only relying on attribute embeddings limits LLM’s ability to capture the adequate graph structural information. These limitations make it difficult to model graphs both concisely and adequately using sole natural language with LLMs.Inspired by the observation that LLMs pre-trained on one language can achieve exceptional performance on another with minimal additional training, we propose Graph-Defined Language for Large Language Model (GDL4LLM). This novel framework enables LLMs to transfer their powerful language understanding capabilities to graph-structured data. GDL4LLM translates the graph into a graph language corpus instead of graph descriptions and pre-trains LLMs on this corpus to adequately understand the graph. This corpus represents the subgraph centered around target nodes concisely with only a few tokens during fine-tuning on downstream tasks. By treating the graph as a new language, GDL4LLM enables LLMs to model text-attributed graph adequately and concisely. Extensive experiments on five datasets demonstrate that GDL4LLM outperforms description-based and embedding-based baselines by efficiently modeling different orders of neighbors.
pdf
bib
abs
100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?
Van Yang
|
Hongye Jin
|
Shaochen Zhong
|
Song Jiang
|
Qifan Wang
|
Vipin Chaudhary
|
Xiaotian Han
Long-context capability is considered one of the most important abilities of LLMs, as a truly long context-capable LLM shall enable its users to effortlessly process many originally exhausting tasks — e.g., digesting a long-form document to find answers v.s., directly asking an LLM about it. However, existing real-task-based long-context evaluation benchmarks have a few major shortcomings. For instance, some Needle-in-a-Haystack-like benchmarks are too synthetic, and therefore do not represent the real world usage of LLMs. While some real-task-based benchmarks like LongBench avoid this problem, such benchmarks are often formed in a way where each data sample has a fixed sequence length, which not only makes them solely suitable for models with a certain range of context windows, but also lacks a proxy to know at what length the model/method-of-interest would fail. Last, most benchmarks tend to not provide proper metrics to separate long-context performance from the model’s baseline ability, so when conducting a cross-model/recipe comparison, such conflation makes the user unable to understand how exactly one model or recipe excels at the long-context task in relation to its baseline ability. To address these issues, we introduce a length-controllable, real-life reflective benchmark with a novel metric that disentangles baseline knowledge from long-context capabilities. Experiments demonstrate the superiority of our datasets in effectively evaluating LLMs. All assets are available at https://github.com/uservan/100-LongBench.git.
pdf
bib
abs
Multimodal Fusion and Coherence Modeling for Video Topic Segmentation
Hai Yu
|
Chong Deng
|
Qinglin Zhang
|
Jiaqing Liu
|
Qian Chen
|
Wen Wang
The video topic segmentation (VTS) task segments videos into intelligible, non-overlapping topics, facilitating efficient comprehension of video content and quick access to specific content. VTS is also critical to various downstream video understanding tasks. Traditional VTS methods using shallow features or unsupervised approaches struggle to accurately discern the nuances of topical transitions. Recently, supervised approaches have achieved superior performance on video action or scene segmentation over unsupervised approaches. In this work, we improve supervised VTS by thoroughly exploring **multimodal fusion** and **multimodal coherence modeling**. Specifically, (1) we enhance multimodal fusion by exploring different architectures using Cross-Attention and Mixture of Experts. (2) To generally strengthen multimodality alignment and fusion, we pre-train and fine-tune the model with multimodal contrastive learning. (3) We propose a new pre-training task tailored for the VTS task, and a novel fine-tuning task for enhancing multimodal coherence modeling for VTS. We evaluate our proposed approaches on educational videos, in the form of lectures, due to the vital role of topic segmentation of educational videos in boosting learning experiences. Additionally, to promote research in VTS, we introduce a large-scale Chinese lecture video dataset to augment the existing English lecture video datasets. Experiments on both English and Chinese lecture datasets demonstrate that our model achieves superior VTS performance compared to competitive unsupervised and supervised baselines.
pdf
bib
abs
Are Your LLMs Capable of Stable Reasoning?
Junnan Liu
|
Hongwei Liu
|
Linchen Xiao
|
Ziyi Wang
|
Kuikun Liu
|
Songyang Gao
|
Wenwei Zhang
|
Songyang Zhang
|
Kai Chen
The rapid advancement of large language models (LLMs) has shown remarkable progress in complex reasoning tasks. However, a significant disparity exists between benchmark performances and real-world applications. We attribute this gap primarily to current evaluation protocols and metrics, which inadequately capture the full spectrum of LLM capabilities, especially in complex reasoning tasks where both accuracy and consistency are essential. In this paper, we introduce **G-Pass@**k, a novel evaluation metric that continuously assesses model performance across multiple sampling attempts, quantifying both the model’s performance potential and its stability. Through extensive experiments on various public and newly constructed benchmarks, we employ G-Pass@k in conjunction with state-of-the-art large language models to provide comprehensive insights into their potential capabilities and operational consistency. Our findings reveal a significant opportunity to enhance the realistic reasoning abilities of LLMs, underscoring the necessity for more robust evaluation metrics.
pdf
bib
abs
FANNO: Augmenting High-Quality Instruction Data with Open-Sourced LLMs Only
He Zhu
|
Yifan Ding
|
Yicheng Tao
|
Zhiwen Ruan
|
Yixia Li
|
Wenjia Zhang
|
Yun Chen
|
Guanhua Chen
Instruction tuning stands as a crucial advancement in leveraging large language models (LLMs) for enhanced task performance. However, the annotation of instruction datasets has traditionally been expensive and laborious, often relying on manual annotations or costly proprietary LLMs. Recent works explore approaches to synthesize data with open-sourced LLMs but require high-quality human-crafted seed data. In this work, we introduce , an end-to-end framework to synthesize high-quality instruction data with open-sourced LLMs and sampled unlabeled documents, eliminating the necessity for seed data. Starting from diverse pre-screened documents, the framework synthesizes complex and diverse high-quality instruction and response pairs in different stages. We propose a tagging-based prompt method to generate diverse and complex seed data and a UCB-based approach to augment more instruction data with the seed data. A novel Think Different prompt is proposed to address the distributional limitations of the seeds, further boosting the data diversity. Experiments prove that the can generate diverse and complex high-quality data even with a opensource small teacher model. The synthesized instruction data demonstrates performance that is comparable to, or even surpasses, baseline annotation methods with proprietary LLMs or open-sourced LLMs while requiring fewer instruction data samples.
pdf
bib
abs
JEBS: A Fine-grained Biomedical Lexical Simplification Task
William Xia
|
Ishita Unde
|
Brian David Ondov
|
Dina Demner-Fushman
Though online medical literature has made health information more available than ever, the barrier of complex medical jargon prevents the general public from understanding it. Though parallel and comparable corpora for Biomedical Text Simplification have been introduced, these conflate the many syntactic and lexical operations involved in simplification. To enable more targeted development and evaluation, we present a fine-grained lexical simplification task and dataset, Jargon Explanations for Biomedical Simplification (JEBS). The JEBS task involves identifying complex terms, classifying how to replace them, and generating replacement text. The JEBS dataset contains 21,595 replacements for 10,314 terms across 400 biomedical abstracts and their manually simplified versions. Additionally, we provide baseline results for a variety of rule-based and transformer-based systems for the three subtasks. The JEBS task, data, and baseline results pave the way for development and rigorous evaluation of systems for replacing or explaining complex biomedical terms.
pdf
bib
abs
Multi-Hop Reasoning for Question Answering with Hyperbolic Representations
Simon Welz
|
Lucie Flek
|
Akbar Karimi
Hyperbolic representations are effective in modeling knowledge graph data which is prevalently used to facilitate multi-hop reasoning. However, a rigorous and detailed comparison of the two spaces for this task is lacking. In this paper, through a simple integration of hyperbolic representations with an encoder-decoder model, we perform a controlled and comprehensive set of experiments to compare the capacity of hyperbolic space versus Euclidean space in multi-hop reasoning. Our results show that the former consistently outperforms the latter across a diverse set of datasets. In addition, through an ablation study, we show that a learnable curvature initialized with the delta hyperbolicity of the utilized data yields superior results to random initializations. Furthermore, our findings suggest that hyperbolic representations can be significantly more advantageous when the datasets exhibit a more hierarchical structure.
pdf
bib
abs
Look & Mark: Leveraging Radiologist Eye Fixations and Bounding boxes in Multimodal Large Language Models for Chest X-ray Report Generation
Yunsoo Kim
|
Jinge Wu
|
Su Hwan Kim
|
Pardeep Vasudev
|
Jiashu Shen
|
Honghan Wu
Recent advancements in multimodal Large Language Models (LLMs) have significantly enhanced the automation of medical image analysis, particularly in generating radiology reports from chest X-rays (CXR). However, these models still suffer from hallucinations and clinically significant errors, limiting their reliability in real-world applications. In this study, we propose Look & Mark (L&M), a novel grounding fixation strategy that integrates radiologist eye fixations (Look) and bounding box annotations (Mark) into the LLM prompting framework. Unlike conventional fine-tuning, L&M leverages in-context learning to achieve substantial performance gains without retraining. When evaluated across multiple domain-specific and general-purpose models, L&M demonstrates significant gains, including a 1.2% improvement in overall metrics (A.AVG) for CXR-LLaVA compared to baseline prompting and a remarkable 9.2% boost for LLaVA-Med. General-purpose models also benefit from L&M combined with in-context learning, with LLaVA-OV achieving an 87.3% clinical average performance (C.AVG)—the highest among all models, even surpassing those explicitly trained for CXR report generation. Expert evaluations further confirm that L&M reduces clinically significant errors (by 0.43 average errors per report), such as false predictions and omissions, enhancing both accuracy and reliability. These findings highlight L&M’s potential as a scalable and efficient solution for AI-assisted radiology, paving the way for improved diagnostic workflows in low-resource clinical settings.
pdf
bib
abs
Hatevolution: What Static Benchmarks Don’t Tell Us
Chiara Di Bonaventura
|
Barbara McGillivray
|
Yulan He
|
Albert Meroño-Peñuela
Language changes over time, including in the hate speech domain, which evolves quickly following social dynamics and cultural shifts. While NLP research has investigated the impact of language evolution on model training and has proposed several solutions for it, its impact on model benchmarking remains under-explored. Yet, hate speech benchmarks play a crucial role to ensure model safety. In this paper, we empirically evaluate the robustness of 20 language models across two evolving hate speech experiments, and we show the temporal misalignment between static and time-sensitive evaluations. Our findings call for time-sensitive linguistic benchmarks in order to correctly and reliably evaluate language models in the hate speech domain.
pdf
bib
abs
Tag-Instruct: Controlled Instruction Complexity Enhancement through Structure-based Augmentation
He Zhu
|
Zhiwen Ruan
|
Junyou Su
|
Xingwei He
|
Yun Chen
|
Wenjia Zhang
|
Guanhua Chen
High-quality instruction data is crucial for developing large language models (LLMs), yet existing approaches struggle to effectively control instruction complexity. We present Tag-Instruct, a novel framework that enhances instruction complexity through structured semantic compression and controlled difficulty augmentation. Unlike previous prompt-based methods operating on raw text, Tag-Instruct compresses instructions into a compact tag space and systematically enhances complexity through RL-guided tag expansion. Through extensive experiments, we show that Tag-Instruct outperforms existing instruction complexity augmentation approaches. Our analysis reveals that operating in tag space provides superior controllability and stability across different instruction synthesis frameworks.
pdf
bib
abs
Code-SPA: Style Preference Alignment to Large Language Models for Effective and Robust Code Debugging
Tengfei Wen
|
Xuanang Chen
|
Ben He
|
Le Sun
Large language models (LLMs) have demonstrated impressive capabilities in coding tasks like code generation and debugging. However, code from real-world users is often poorly styled, containing various types of noise, such as structural inconsistencies, stylistic deviations and flawed test cases. To investigate this, we first simulate poorly styled code using eight types of code perturbations, and then demonstrate that the debugging performance of existing LLM-based methods significantly declines on such inputs. Furthermore, to address this, we propose a novel debugging method called Code-SPA, which aligns noisy code with the well-structured style familiar to LLMs, mitigating the impact of stylistic inconsistencies. Specifically, Code-SPA extracts the model’s preferred coding style from a reference snippet, then adjusts the input code by Concrete Syntax Tree (CST)-based transformations and LLM-assisted refinements before debugging. By aligning the code style preference, Code-SPA enhances the debugging performance of both code-specific and general-purpose LLMs on both poorly and well-styled code across the HumanEval, MBPP and EvalPlus datasets.
pdf
bib
abs
Open-World Authorship Attribution
Xinhao Tan
|
Songhua Liu
|
Xia Cong
|
Kunjun Li
|
Xinchao Wang
Recent years have witnessed rapid advancements in Large Language Models (LLMs). Nevertheless, it remains unclear whether state-of-the-art LLMs can infer the author of an anonymous research paper solely from the text, without any additional information. To investigate this novel challenge, which we define as Open-World Authorship Attribution, we introduce a benchmark comprising thousands of research papers across various fields to quantitatively assess model capabilities. Then, at the core of this paper, we tailor a two-stage framework to tackle this problem: candidate selection and authorship decision. Specifically, in the first stage, LLMs are prompted to generate multi-level key information, which are then used to identify potential candidates through Internet searches. In the second stage, we introduce key perspectives to guide LLMs in determining the most likely author from these candidates. Extensive experiments on our benchmark demonstrate the effectiveness of the proposed approach, achieving 60.7% and 44.3% accuracy in the two stages, respectively. We will release our benchmark and source codes to facilitate future research in this field.
pdf
bib
abs
What is in a name? Mitigating Name Bias in Text Embedding Similarity via Anonymization
Sahil Manchanda
|
Pannaga Shivaswamy
Text-embedding models often exhibit biases arising from the data on which they are trained. In this paper, we examine a hitherto unexplored bias in text-embeddings: bias arising from the presence of names such as persons, locations, organizations etc. in the text. Our study shows how the presence of name-bias in text-embedding models can potentially lead to erroneous conclusions in the assessment of thematic similarity. Text-embeddings can mistakenly indicate similarity between texts based on names in the text, even when their actual semantic content has no similarity or indicate dissimilarity simply because of the names in the text even when the texts match semantically. We first demonstrate the presence of name bias in different text-embedding models and then propose text-anonymization during inference which involves removing references to names, while preserving the core theme of the text. The efficacy of the anonymization approach is demonstrated on three downstream NLP tasks involving embedding similarities, achieving significant performance gains. Our simple and training-optimization-free approach offers a practical and easily implementable solution to mitigate name bias.
pdf
bib
abs
BenNumEval: A Benchmark to Assess LLMs’ Numerical Reasoning Capabilities in Bengali
Kawsar Ahmed
|
Md Osama
|
Omar Sharif
|
Eftekhar Hossain
|
Mohammed Moshiul Hoque
Large Language Models (LLMs) demonstrate exceptional proficiency in general-purpose tasks but struggle with numerical reasoning, particularly in low-resource languages like Bengali. Despite advancements, limited research has explored their numerical reasoning capabilities in these languages. To address this gap, we present BenNumEval (Bengali Numerical Evaluation), a benchmark designed to assess LLMs on numerical reasoning tasks in Bengali. It comprises six diverse tasks and a total of 3.2k samples curated from real-world problem-solving scenarios. Our extensive evaluations reveal that even with advanced prompting techniques such as Cross-Lingual Prompting (XLP) and Cross-Lingual Chain-of-Thought Prompting (XCoT), LLMs fall notably short of human-level performance, particularly when using Bengali Native Prompting (BNaP). These findings underscore the substantial gap between current LLM capabilities and human expertise in numerical reasoning, highlighting the need for more robust and linguistically inclusive AI models to advance Bengali Language Processing and equitable AI development. The source code for the system and evaluation pipeline is publicly available on GitHub.
pdf
bib
abs
LLM Agents for Coordinating Multi-User Information Gathering
Harsh Jhamtani
|
Jacob Andreas
|
Benjamin Van Durme
This paper introduces PeopleJoin, a benchmark for evaluating LM-mediated collaborative problem solving. Given a user request, PeopleJoin agents must identify teammates who might be able to assist, converse with these teammates to gather information, and finally compile a useful answer or summary for the original user. PeopleJoin comprises two evaluation domains: PeopleJoin-QA, focused on questions about tabular data, and PeopleJoin-DocCreation, focused on document creation tasks. The two domains are adapted from existing NLP benchmarks for database question answering and multi-document summarization; here, however, the information needed to complete these tasks is distributed across synthetic “organizations” of 2–20 users, simulating natural multi-user collaboration scenarios. We implemented several popular LM agent architectures, evaluating their accuracy and efficiency at completing tasks, and highlight new research questions that can be studied using PeopleJoin.
pdf
bib
abs
C2KD: Cross-layer and Cross-head Knowledge Distillation for Small Language Model-based Recommendation
Xiao Chen
|
Changyi Ma
|
Wenqi Fan
|
Zhaoxiang Zhang
|
Li Qing
Sequential recommenders predict users’ next interactions based on historical behavior and are essential in modern recommendation systems. While Large Language Models (LLMs) show promise, their size and high inference costs limit deployment on resource-constrained devices. Small Language Models (SLMs) provide a more efficient alternative for edge devices, but bridging the recommendation performance gap between LLMs and SLMs remains challenging. Typical approaches like supervised fine-tuning or vanilla knowledge distillation (KD) often lead to suboptimal performance or even negative transfer. Our motivational experiments reveal key issues with vanilla KD methods: feature imitation suffers from redundancy and uneven recommendation ability across layers, while prediction mimicking faces conflicts caused by differing weight distributions of prediction heads. To address these challenges, we propose a simple yet effective framework, C2KD, to transfer task-relevant knowledge from two complementary dimensions. Specifically, our method incorporates: (1) cross-layer feature imitation, which uses a dynamic router to select the most relevant teacher layers and assimilate task-relevant knowledge from the teacher’s late layers, allowing the student to concentrate on the teacher’s specialized knowledge; and (2) cross-head logit distillation, which maps the intermediate features of the student to the teacher’s output head, thereby minimizing prediction discrepancies between the teacher and the student. Extensive experiments across diverse model families demonstrate that our approach enables 1B-parameter SLMs to achieve competitive performance compared to LLMs (e.g., Llama3-8B), offering a practical solution for real-world on-device sequential recommendations.
pdf
bib
abs
Sign2Vis: Automated Data Visualization from Sign Language
Yao Wan
|
Yang Wu
|
Zhen Li
|
Guobiao Zhang
|
Hongyu Zhang
|
Zhou Zhao
|
Hai Jin
|
April Wang
Data visualizations, such as bar charts and histograms, are essential for analyzing and exploring data, enabling the effective communication of insights. While existing methods have been proposed to translate natural language descriptions into visualization queries, they focus solely on spoken languages, overlooking sign languages, which comprise about 200 variants used by 70 million Deaf and Hard-of-Hearing (DHH) individuals. To fill this gap, this paper proposes Sign2Vis, a sign language interface that enables the DHH community to engage more fully with data analysis. We first construct a paired dataset that includes sign language pose videos and their corresponding visualization queries. Using this dataset, we evaluate a variety of models, including both pipeline-based and end-to-end approaches. Extensive experiments, along with a user study involving 15 participants, demonstrate the effectiveness of Sign2Vis. Finally, we share key insights from our evaluation and highlight the need for more accessible and user-centered tools to support the DHH community in interactive data analytics.
pdf
bib
abs
Transparentize the Internal and External Knowledge Utilization in LLMs with Trustworthy Citation
Jiajun Shen
|
Tong Zhou
|
Yubo Chen
|
Delai Qiu
|
Shengping Liu
|
Kang Liu
|
Jun Zhao
While hallucinations of large language models could be alleviated through retrieval-augmented generation and citation generation, how the model utilizes internal knowledge is still opaque, and the trustworthiness of its generated answers remains questionable. In this work, we introduce Context-Prior Augmented Citation Generation task, requiring models to generate citations considering both external and internal knowledge while providing trustworthy references, with 5 evaluation metrics focusing on 3 aspects: answer helpfulness, citation faithfulness, and trustworthiness. We introduce RAEL, the paradigm for our task, and also design INTRALIGN, an integrated method containing customary data generation and an alignment algorithm. Our experimental results show that our method achieves a better cross-scenario performance with regard to other baselines. Our extended experiments further reveal that retrieval quality, question types, and model knowledge have considerable influence on the trustworthiness in citation generation.
pdf
bib
abs
JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse
Muyao Li
|
Zihao Wang
|
Kaichen He
|
Xiaojian Ma
|
Yitao Liang
Recently, action-based decision-making in open-world environments has gained significant attention. Visual Language Action (VLA) models, pretrained on large-scale web datasets, have shown promise in decision-making tasks. However, previous work has primarily focused on action post-training, often neglecting enhancements to the foundation model itself. In response, we introduce Act from Visual Language Post-Training (ActVLP), a novel training paradigm. ActVLP distinctively enhances the foundation model prior to action-specific tuning by first post-training it on a curated set of environment-specific visual and linguistic tasks using self-supervised learning. This initial stage significantly improves the model’s capabilities in world knowledge, visual recognition, and spatial grounding. Subsequently, this strengthened VLM undergoes action post-training via imitation learning on trajectory datasets.Following this paradigm, we develop JARVIS-VLA, the first VLA model in Minecraft that can follow human instructions on over 1k different atomic tasks, including crafting, smelting, cooking, mining, and killing. Our experiments demonstrate that our ActVLP paradigm leads to a significant 40% improvement over the best agent baseline on a diverse set of atomic tasks. Furthermore, JARVIS-VLA surpasses traditional imitation learning-based policies in Minecraft, achieving state-of-the-art performance. We have open-sourced the code, models, and datasets to foster further research.The project page can be found at
https://craftjarvis.github.io/JarvisVLA.
pdf
bib
abs
Generative Frame Sampler for Long Video Understanding
Linli Yao
|
Haoning Wu
|
Kun Ouyang
|
Yuanxing Zhang
|
Caiming Xiong
|
Bei Chen
|
Xu Sun
|
Junnan Li
Despite recent advances in Video Large Language Models (VideoLLMs), effectively understanding long-form videos remains a significant challenge. Perceiving lengthy videos containing thousands of frames poses substantial computational burden. To mitigate this issue, this paper introduces Generative Frame Sampler (GenS), a plug-and-play module integrated with VideoLLMs to facilitate efficient lengthy video perception. Built upon a lightweight VideoLLM, GenS leverages its inherent vision-language capabilities to identify question-relevant frames. To facilitate effective retrieval, we construct GenS-Video-150K, a large-scale video instruction dataset with dense frame relevance annotations. Extensive experiments demonstrate that GenS consistently boosts the performance of various VideoLLMs, including open-source models (Qwen2-VL-7B, Aria-25B, LLaVA-Video-7B/72B) and proprietary assistants (GPT-4o, Gemini). When equipped with GenS, open-source VideoLLMs achieve impressive state-of-the-art results on long-form video benchmarks: LLaVA-Video-72B reaches 66.8 (+4.3) on LongVideoBench and 77.0 (+2.7) on MLVU, while Aria obtains 39.2 on HourVideo surpassing the Gemini-1.5-pro by 1.9 points.
pdf
bib
abs
Annotating the Annotators: Analysis, Insights and Modelling from an Annotation Campaign on Persuasion Techniques Detection
Davide Bassi
|
Dimitar Iliyanov Dimitrov
|
Bernardo D’Auria
|
Firoj Alam
|
Maram Hasanain
|
Christian Moro
|
Luisa Orrù
|
Gian Piero Turchi
|
Preslav Nakov
|
Giovanni Da San Martino
Persuasion (or propaganda) techniques detection is a relatively novel task in Natural Language Processing (NLP). While there have already been a number of annotation campaigns, they have been based on heuristic guidelines, which have never been thoroughly discussed. Here, we present the first systematic analysis of a complex annotation task -detecting 22 persuasion techniques in memes-, for which we provided continuous expert oversight. The presence of an expert allowed us to critically analyze specific aspects of the annotation process. Among our findings, we show that inter-annotator agreement alone inadequately assessed annotation correctness. We thus define and track different error types, revealing that expert feedback shows varying effectiveness across error categories. This pattern suggests that distinct mechanisms underlie different kinds of misannotations. Based on our findings, we advocate for an expert oversight in annotation tasks and periodic quality audits. As an attempt to reduce the costs for this, we introduce a probabilistic model for optimizing intervention scheduling.
pdf
bib
abs
On the Generalization vs Fidelity Paradox in Knowledge Distillation
Suhas Kamasetty Ramesh
|
Ayan Sengupta
|
Tanmoy Chakraborty
Knowledge distillation (KD) is a key technique for compressing large language models into smaller ones while preserving performance. Despite the recent traction of KD research, its effectiveness for smaller language models (LMs) and the mechanisms driving knowledge transfer remain underexplored. In this work, we present the first large-scale empirical and statistical analysis of KD across models ranging from 0.5B to 7B parameters on 14 complex reasoning tasks in a zero-shot setting. Our findings reveal that KD can improve the average performance of smaller models by up to 10%, with a peak task specific gain of 22%, while providing only marginal benefits (∼ 1.3%) for larger models. Surprisingly, teacher performance has a minimal impact on student outcomes, while teacher task expertise impacts KD effectiveness. A correlation study indicates that smaller LMs benefit more from KD, whereas larger LMs show diminished gains. Additionally, we uncover a misalignment between improvements in student performance and reasoning fidelity, suggesting that while KD enhances accuracy, it does not always maintain the structured decision-making processes of the teacher. Our ablation study further highlights the importance of teacher signals and logit smoothing in influencing students’ performance after distillation. Overall, our study offers a comprehensive empirical and statistical assessment of KD, highlighting both its benefits and trade-offs when distilling knowledge from larger to smaller LMs.
pdf
bib
abs
BEDAA: Bayesian Enhanced DeBERTa for Uncertainty-Aware Authorship Attribution
Iqra Zahid
|
Youcheng Sun
|
Riza Batista-Navarro
Authorship Attribution (AA) seeks to identify the author of a given text, yet existing methods often struggle with trustworthiness and interpretability, particularly across different domains, languages, and stylistic variations. These challenges arise from the absence of uncertainty quantification and the inability of current models to adapt to diverse authorship tasks. To address these limitations, we introduce BEDAA, a Bayesian-Enhanced DeBERTa framework that integrates Bayesian reasoning with transformer-based language models to enable uncertainty-aware and interpretable authorship attribution. BEDAA achieves up to 19.69% improvement in F1-score across multiple authorship attribution tasks, including binary, multiclass, and dynamic authorship detection. By incorporating confidence ranking, uncertainty decomposition, and probabilistic reasoning, BEDAA improves robustness while offering transparent decision-making processes. Furthermore, BEDAA extends beyond traditional AA by demonstrating its effectiveness in human vs. machine-generated text classification, code authorship detection, and cross-lingual attribution. These advances establish BEDAA as a generalised, interpretable, and adaptable framework for modern authorship attribution challenges.
pdf
bib
abs
Benchmarking the Benchmarks: Reproducing Climate-Related NLP Tasks
Tom Calamai
|
Oana Balalau
|
Fabian M. Suchanek
Significant efforts have been made in the NLP community to facilitate the automatic analysis of climate-related corpora by tasks such as climate-related topic detection, climate risk classification, question answering over climate topics, and many more. In this work, we perform a reproducibility study on 8 tasks and 29 datasets, testing 6 models. We find that many tasks rely heavily on surface-level keyword patterns rather than deeper semantic or contextual understanding. Moreover, we find that 96% of the datasets contain annotation issues, with 16.6% of the sampled wrong predictions of a zero-shot classifier being actually clear annotation mistakes, and 38.8% being ambiguous examples.These results call into question the reliability of current benchmarks to meaningfully compare models and highlight the need for improved annotation practices. We conclude by outlining actionable recommendations to enhance dataset quality and evaluation robustness.
pdf
bib
abs
Exploring Supervised Approaches to the Detection of Anthropomorphic Language in the Reporting of NLP Venues
Matthew Shardlow
|
Ashley Williams
|
Charlie Roadhouse
|
Filippos Ventirozos
|
Piotr Przybyła
We investigate the prevalence of anthropomorphic language in the reporting of AI technology, focussed on NLP and LLMs. We undertake a corpus annotation focussing on one year of ACL long-paper abstracts and news articles from the same period. We find that 74% of ACL abstracts and 88% of news articles contain some form of anthropomorphic description of AI technology. Further, we train a regression classifier based on BERT, demonstrating that we can automatically label abstracts for their degree of anthropomorphism based on our corpus. We conclude by applying this labelling process to abstracts available in the entire history of the ACL Anthology and reporting on diachronic and inter-venue findings, showing that the degree of anthropomorphism is increasing at all examined venues over time.
pdf
bib
abs
PersonaLens: A Benchmark for Personalization Evaluation in Conversational AI Assistants
Zheng Zhao
|
Clara Vania
|
Subhradeep Kayal
|
Naila Khan
|
Shay B Cohen
|
Emine Yilmaz
Large language models (LLMs) have advanced conversational AI assistants. However, systematically evaluating how well these assistants apply personalization—adapting to individual user preferences while completing tasks—remains challenging. Existing personalization benchmarks focus on chit-chat, non-conversational tasks, or narrow domains, failing to capture the complexities of personalized task-oriented assistance. To address this, we introduce PersonaLens, a comprehensive benchmark for evaluating personalization in task-oriented AI assistants. Our benchmark features diverse user profiles equipped with rich preferences and interaction histories, along with two specialized LLM-based agents: a user agent that engages in realistic task-oriented dialogues with AI assistants, and a judge agent that employs the LLM-as-a-Judge paradigm to assess personalization, response quality, and task success. Through extensive experiments with current LLM assistants across diverse tasks, we reveal significant variability in their personalization capabilities, providing crucial insights for advancing conversational AI systems.
pdf
bib
abs
iAgent: LLM Agent as a Shield between User and Recommender Systems
Wujiang Xu
|
Yunxiao Shi
|
Zujie Liang
|
Xuying Ning
|
Kai Mei
|
Kun Wang
|
Xi Zhu
|
Min Xu
|
Yongfeng Zhang
Traditional recommender systems usually take the user-platform paradigm, where users are directly exposed under the control of the platform’s recommendation algorithms. However, the defect of recommendation algorithms may put users in very vulnerable positions under this paradigm. First, many sophisticated models are often designed with commercial objectives in mind, focusing on the platform’s benefits, which may hinder their ability to protect and capture users’ true interests. Second, these models are typically optimized using data from all users, which may overlook individual user’s preferences. Due to these shortcomings, users may experience several disadvantages under the traditional user-platform direct exposure paradigm, such as lack of control over the recommender system, potential manipulation by the platform, echo chamber effects, or lack of personalization for less active users due to the dominance of active users during collaborative learning. Therefore, there is an urgent need to develop a new paradigm to protect user interests and alleviate these issues. Recently, some researchers have introduced LLM agents to simulate user behaviors, these approaches primarily aim to optimize platform-side performance, leaving core issues in recommender systems unresolved. To address these limitations, we propose a new user-agent-platform paradigm, where agent serves as the protective shield between user and recommender system that enables indirect exposure. To this end, we first construct four recommendation datasets, denoted as InstructRec, along with user instructions for each record. To understand user’s intention, we design an Instruction-aware Agent capable of using tools to acquire knowledge from external environments. Moreover, we introduce an Individual Instruction-aware Agent, which incorporates a dynamic memory mechanism to optimize from individual feedback. Results on four datasets demonstrate that consistently achieves an average improvement of 16.6% over SOTA baselines across ranking metrics. Moreover, iAgent mitigates echo chamber effects and effectively alleviates the model bias in disadvantaged users (less-active), serving as a shield between user and recommender systems.
pdf
bib
abs
FactLens: Benchmarking Fine-Grained Fact Verification
Kushan Mitra
|
Dan Zhang
|
Sajjadur Rahman
|
Estevam Hruschka
Large Language Models (LLMs) have shown impressive capability in language generation and understanding, but their tendency to hallucinate and produce factually incorrect information remains a key limitation. To verify LLM-generated contents and claims from other sources, traditional verification approaches often rely on holistic models that assign a single factuality label to complex claims, potentially obscuring nuanced errors. In this paper, we advocate for a shift towards fine-grained verification, where complex claims are broken down into smaller sub-claims for individual verification, allowing for more precise identification of inaccuracies, improved transparency, and reduced ambiguity in evidence retrieval. However, generating sub-claims poses challenges, such as maintaining context and ensuring semantic equivalence with respect to the original claim. We introduce **FactLens**, a benchmark for evaluating fine-grained fact verification, with metrics and automated evaluators of sub-claim quality. The benchmark data is manually curated to ensure high-quality ground truth. Our results show alignment between automated FactLens evaluators and human judgments, and we discuss the impact of sub-claim characteristics on the overall verification performance.
pdf
bib
abs
Process-based Self-Rewarding Language Models
Shimao Zhang
|
Xiao Liu
|
Xin Zhang
|
Junxiao Liu
|
Zheheng Luo
|
Shujian Huang
|
Yeyun Gong
Large Language Models have demonstrated outstanding performance across various downstream tasks and have been widely applied in multiple scenarios. Human-annotated preference data is used for training to further improve LLMs’ performance, which is constrained by the upper limit of human performance. Therefore, Self-Rewarding method has been proposed, where LLMs generate training data by rewarding their own outputs. However, the existing self-rewarding paradigm is not effective in mathematical reasoning scenarios and may even lead to a decline in performance. In this work, we propose the Process-based Self-Rewarding pipeline for language models, which introduces long-thought reasoning, step-wise LLM-as-a-Judge, and step-wise preference optimization within the self-rewarding paradigm. Our new paradigm successfully enhances the performance of LLMs on multiple mathematical reasoning benchmarks through iterative Process-based Self-Rewarding, demonstrating the immense potential of process-based self-rewarding to achieve LLM reasoning that may surpass human capabilities.
pdf
bib
abs
The Devil Is in the Word Alignment Details: On Translation-Based Cross-Lingual Transfer for Token Classification Tasks
Benedikt Ebing
|
Goran Glavaš
Translation-based strategies for cross-lingual transfer XLT such as translate-train—training on noisy target language data translated from the source language—and translate-test—evaluating on noisy source language data translated from the target language—are competitive XLT baselines. In XLT for token classification tasks, however, these strategies include label projection, the challenging step of mapping the labels from each token in the original sentence to its counterpart(s) in the translation. Although word aligners (WAs) are commonly used for label projection, the low-level design decisions for applying them to translation-based XLT have not been systematically investigated. Moreover, recent marker-based methods, which project labeled spans by inserting tags around them before (or after) translation, claim to outperform WAs in label projection for XLT. In this work, we revisit WAs for label projection, systematically investigating the effects of low-level design decisions on token-level XLT: (i) the algorithm for projecting labels between (multi-)token spans, (ii) filtering strategies to reduce the number of noisily mapped labels, and (iii) the pre-tokenization of the translated sentences. We find that all of these substantially impact translation-based XLT performance and show that, with optimized choices, XLT with WA offers performance at least comparable to that of marker-based methods. We then introduce a new projection strategy that ensembles translate-train and translate-test predictions and demonstrate that it substantially outperforms the marker-based projection. Crucially, we show that our proposed ensembling also reduces sensitivity to low-level WA design choices, resulting in more robust XLT for token classification tasks.
pdf
bib
abs
ShieldHead: Decoding-time Safeguard for Large Language Models
Zitao Xuan
|
Xiaofeng Mao
|
Da Chen
|
Xin Zhang
|
Yuhan Dong
|
Jun Zhou
In light of the widespread deployment of Large Language Models (LLMs), the responsibility for safeguarding and regulating LLM-generated content has taken on heightened significance. Recent advancements in LLM-based moderation methods, e.g., LlamaGuard, have demonstrated remarkable promise in identifying safety risks associated with both inputs and outputs in human-AI interactions. However, integrating LLM-based safeguards into a chatbot system requires an additional inference stage involving a moderation LLM with billions of parameters, which significantly increases computational costs and reduces overall efficiency. In this paper, we demonstrate that simply learning a classification head on the last-layer hidden states of the dialogue model provides a strong capability to identify harmful contents. The classification head, referred to as ShieldHead, serves as an auxiliary branch paralleled with next-token-prediction LM head, enabling the detection of potential risks in past text sequences. Additionally, a label disambiguation technique is employed to supervise ShieldHead with both token-level and sentence-level labels, which further enhances its performance. ShieldHead exhibits remarkable efficiency during inference, providing real-time moderation results alongside token-wise streaming output during the chatbot system’s decoding phase. Extensive experimental results demonstrate the superiority of the proposed framework: a state-of-the-art performance on the XSTest and SafeRLHF datasets while running at a speed about **300×** faster (**<1ms**) than previous LLM-based moderation models with ** 99%** less parameters of LlamaGuard.
pdf
bib
abs
A Survey on Proactive Defense Strategies Against Misinformation in Large Language Models
Shuliang Liu
|
Hongyi Liu
|
Aiwei Liu
|
Duan Bingchen
|
Zheng Qi
|
Yibo Yan
|
He Geng
|
Peijie Jiang
|
Jia Liu
|
Xuming Hu
The widespread deployment of large language models (LLMs) across critical domains has amplified the societal risks posed by algorithmically generated misinformation. Unlike traditional false content, LLM-generated misinformation can be self-reinforcing, highly plausible, and capable of rapid propagation across multiple languages, which traditional detection methods fail to mitigate effectively. This paper introduces a proactive defense paradigm, shifting from passive post hoc detection to anticipatory mitigation strategies. We propose a Three Pillars framework: (1) Knowledge Credibility, fortifying the integrity of training and deployed data; (2) Inference Reliability, embedding self-corrective mechanisms during reasoning; and (3) Input Robustness, enhancing the resilience of model interfaces against adversarial attacks. Through a comprehensive survey of existing techniques and a comparative meta-analysis, we demonstrate that proactive defense strategies offer up to 63% improvement over conventional methods in misinformation prevention, despite non-trivial computational overhead and generalization challenges. We argue that future research should focus on co-designing robust knowledge foundations, reasoning certification, and attack-resistant interfaces to ensure LLMs can effectively counter misinformation across varied domains.
pdf
bib
abs
Smotrom tvoja på ander drogoj verden! Resurrecting Dead Pidgin with Generative Models: Russenorsk Case Study
Alexey Tikhonov
|
Sergei Shteiner
|
Anna Bykova
|
Ivan P. Yamshchikov
Russenorsk, a pidgin language historically used in trade interactions between Russian and Norwegian speakers, represents a unique linguistic phenomenon. In this paper, we attempt to analyze its lexicon using modern large language models (LLMs), based on surviving literary sources. We construct a structured dictionary of the language, grouped by synonyms and word origins. Subsequently, we use this dictionary to formulate hypotheses about the core principles of word formation and grammatical structure in Russenorsk and show which hypotheses generated by large language models correspond to the hypotheses previously proposed ones in the academic literature. We also develop a “reconstruction” translation agent that generates hypothetical Russenorsk renderings of contemporary Russian and Norwegian texts.
pdf
bib
abs
PromptCoT: Synthesizing Olympiad-level Problems for Mathematical Reasoning in Large Language Models
Xueliang Zhao
|
Wei Wu
|
Jian Guan
|
Lingpeng Kong
The ability of large language models to solve complex mathematical problems has progressed significantly, particularly for tasks requiring advanced reasoning. However, the scarcity of sufficiently challenging problems, particularly at the Olympiad level, hinders further advancements. In this work, we introduce PromptCoT, a novel approach for automatically generating high-quality Olympiad-level math problems. The proposed method synthesizes complex problems based on mathematical concepts and the rationale behind problem construction, emulating the thought processes of experienced problem designers. We provide a theoretical analysis demonstrating that an optimal rationale should maximize both the likelihood of rationale generation given the associated concepts and the likelihood of problem generation conditioned on both the rationale and the concepts. Our method is evaluated on standard benchmarks including GSM8K, MATH-500, and AIME2024, where it consistently outperforms existing problem generation methods. Furthermore, we demonstrate that PromptCoT exhibits superior scalability, consistently maintaining high performance as the dataset size increases, outperforming the baselines.
pdf
bib
abs
Speculative Sampling via Exponential Races
Szymon Kobus
|
Deniz Gunduz
Speculative decoding accelerates large language model inference using a smaller draft model. In this paper, we establish a surprising connection between speculative sampling and the concept of channel simulation from information theory, which aims at simulating a noisy channel using as few bits as possible. This connection allows us to provide an information-theoretic analysis of the speed up that can be achieved by speculative sampling. Leveraging this link, we derive an explicit relation between generation speed-up and the number of tokens k generated by the draft model for large k, which serves as an upper bound for all k. We also propose a novel speculative sampling method via exponential races called ERSS that matches state-of-the-art performance.
pdf
bib
abs
Going Beyond Your Expectations in Latency Metrics for Simultaneous Speech Translation
Jorge Iranzo-Sánchez
|
Javier Iranzo-Sánchez
|
Adrià Giménez
|
Jorge Civera
Current evaluation practices in Simultaneous Speech Translation (SimulST) systems typically involve segmenting the input audio and corresponding translations, calculating quality and latency metrics for each segment, and averaging the results. Although this approach may provide a reliable estimation of translation quality, it can lead to misleading values of latency metrics due to an inherent assumption that average latency values are good enough estimators of SimulST systems’ response time. However, our detailed analysis of latency evaluations for state-of-the-art SimulST systems demonstrates that latency distributions are often skewed and subject to extreme variations. As a result, the mean in latency metrics fails to capture these anomalies, potentially masking the lack of robustness in some systems and metrics. In this paper, a thorough analysis of the results of systems submitted to recent editions of the IWSLT simultaneous track is provided to support our hypothesis and alternative ways to report latency metrics are proposed in order to provide a better understanding of SimulST systems’ latency.
pdf
bib
abs
Towards a Design Guideline for RPA Evaluation: A Survey of Large Language Model-Based Role-Playing Agents
Chaoran Chen
|
Bingsheng Yao
|
Ruishi Zou
|
Wenyue Hua
|
Weimin Lyu
|
Toby Jia-Jun Li
|
Dakuo Wang
Role-Playing Agent (RPA) is an increasingly popular type of LLM Agent that simulates human-like behaviors in a variety of tasks. However, evaluating RPAs is challenging due to diverse task requirements and agent designs.This paper proposes an evidence-based, actionable, and generalizable evaluation design guideline for LLM-based RPA by systematically reviewing 1,676 papers published between Jan. 2021 and Dec. 2024.Our analysis identifies six agent attributes, seven task attributes, and seven evaluation metrics from existing literature.Based on these findings, we present an RPA evaluation design guideline to help researchers develop more systematic and consistent evaluation methods.
pdf
bib
abs
Recursive Question Understanding for Complex Question Answering over Heterogeneous Personal Data
Philipp Christmann
|
Gerhard Weikum
Question answering over mixed sources, like text and tables, has been advanced by verbalizing all contents and encoding it with a language model. A prominent case of such heterogeneous data is personal information: user devices log vast amounts of data every day, such as calendar entries, workout statistics, shopping records, streaming history, and more. Information needs range from simple look-ups to queries of analytical nature. The challenge is to provide humans with convenient access with small footprint, so that all personal data stays on the user devices. We present ReQAP, a novel method that creates an executable operator tree for a given question, via recursive decomposition. Operators are designed to enable seamless integration of structured and unstructured sources, and the execution of the operator tree yields a traceable answer. We further release the PerQA benchmark, with persona-based data and questions, covering a diverse spectrum of realistic user needs.
pdf
bib
abs
PreSumm: Predicting Summarization Performance Without Summarizing
Steven Koniaev
|
Ori Ernst
|
Jackie CK Cheung
Despite recent advancements in automatic summarization, state-of-the-art models do not summarize all documents equally well, raising the question: why? While prior research has extensively analyzed summarization models, little attention has been given to the role of document characteristics in influencing summarization performance.In this work, we explore two key research questions. First, do documents exhibit consistent summarization quality across multiple systems? If so, can we predict a document’s summarization performance without generating a summary? We answer both questions affirmatively and introduce PreSumm, a novel task in which a system predicts summarization performance based solely on the source document. Our analysis sheds light on common properties of documents with low PreSumm scores, revealing that they often suffer from coherence issues, complex content, or a lack of a clear main theme.In addition, we demonstrate PreSumm’s practical utility in two key applications: improving hybrid summarization workflows by identifying documents that require manual summarization and enhancing dataset quality by filtering outliers and noisy documents.Overall, our findings highlight the critical role of document properties in summarization performance and offer insights into the limitations of current systems that could serve as the basis for future improvements.
pdf
bib
abs
Mixture of Structural-and-Textual Retrieval over Text-rich Graph Knowledge Bases
Yongjia Lei
|
Haoyu Han
|
Ryan A. Rossi
|
Franck Dernoncourt
|
Nedim Lipka
|
Mahantesh M Halappanavar
|
Jiliang Tang
|
Yu Wang
Text-rich Graph Knowledge Bases (TG-KBs) have become increasingly crucial for answering queries by providing textual and structural knowledge. However, current retrieval methods often retrieve these two types of knowledge in isolation without considering their mutual reinforcement and existing hybrid methods even bypass structural retrieval entirely. To fill this gap, we propose a Mixture of Structural-and-Textual Retrieval (MoR) to retrieve these two types of knowledge via a Planning-Reasoning-Organizing framework. In the Planning stage, MoR generates textual planning graphs delineating the logic for answering queries. Following planning graphs, in the Reasoning stage, MoR interweaves structural traversal and textual matching to obtain candidates from TG-KBs. In the Organizing stage, MoR further reranks fetched candidates based on their structural trajectory. Extensive experiments demonstrate the superiority of MoR in harmonizing structural and textual retrieval with inspiring insights, including imbalanced retrieving performance across different query logics and the benefits of integrating structural trajectories for candidate reranking.
pdf
bib
abs
Fact Recall, Heuristics or Pure Guesswork? Precise Interpretations of Language Models for Fact Completion
Denitsa Saynova
|
Lovisa Hagström
|
Moa Johansson
|
Richard Johansson
|
Marco Kuhlmann
Language models (LMs) can make a correct prediction based on many possible signals in a prompt, not all corresponding to recall of factual associations. However, current interpretations of LMs fail to take this into account. For example, given the query “Astrid Lindgren was born in” with the corresponding completion “Sweden”, no difference is made between whether the prediction was based on knowing where the author was born or assuming that a person with a Swedish-sounding name was born in Sweden. In this paper, we present a model-specific recipe - PrISM - for constructing datasets with examples of four different prediction scenarios: generic language modeling, guesswork, heuristics recall and exact fact recall. We apply two popular interpretability methods to the scenarios: causal tracing (CT) and information flow analysis. We find that both yield distinct results for each scenario. Results for exact fact recall and generic language modeling scenarios confirm previous conclusions about the importance of mid-range MLP sublayers for fact recall, while results for guesswork and heuristics indicate a critical role of late last token position MLP sublayers. In summary, we contribute resources for a more extensive and granular study of fact completion in LMs, together with analyses that provide a more nuanced understanding of how LMs process fact-related queries.
pdf
bib
abs
FPE2M2: Approaching Lossless and Efficient Quantization with Native Floating Point
Ke Yi
|
Jianwei Zhang
|
Zhiying Xu
|
Xinlong Yang
|
Yang Zhou
|
Minmin Sun
|
Zengke Liu
|
Tong Zhang
|
Junyang Lin
|
Jingren Zhou
Auto-regressive decoding is a memory-bound job, meaning decoding inference performance is limited by the bandwidth rather than the computational capabilities of the GPU. Weight-only quantization is a promising method to address the memory-bound limitations. Previous studies have followed one of two approaches. Some have exclusively studied integer quantization while ignoring the Gaussian distribution nature of LLMs’ weights. Others have proposed non-uniform quantization but incurred additional I/O overhead due to lookup tables, e.g. NF4. In this work, we extend the IEEE 754 float-point standard to the ExMy quantization schema, which allocates x bit for the exponent and y bit for the mantissa to represent a number. In terms of runtime efficiency, we demonstrate that the conversion from ExMy to FP16 can be realized through register-level operations, which can get almost the same performance as INT5. In terms of quantization loss, we analyze that of different ExMy settings, where the E2M2 schema achieves an optimal balance, offering the highest efficiency with lossless accuracy. We further propose the FPE2M2 framework that supports lossless weight-only quantization inference and validate the FPE2M2 framework on Qwen and LLaMA Models across various modalities, such as text, image, and audio tasks, which achieves a faster inference speed while maintaining nearly lossless accuracy.
pdf
bib
abs
Asymmetric Conflict and Synergy in Post-training for LLM-based Multilingual Machine Translation
Tong Zheng
|
Yan Wen
|
Huiwen Bao
|
Junfeng Guo
|
Heng Huang
The emergence of Large Language Models (LLMs) has advanced the multilingual machine translation (MMT), yet the Curse of Multilinguality (CoM) remains a major challenge. Existing work in LLM-based MMT typically mitigates this issue via scaling up training and computation budget, which raises a critical question: Is scaling up the training and computation budget truly necessary for high-quality MMT, or can a deeper understanding of CoM provide a more efficient solution? To explore this problem, we analyze the linguistic conflicts and synergy, the underlying mechanism of CoM during post-training phase. We identify an asymmetric phenomenon in linguistic conflicts and synergy: the dominance of conflicts and synergy varies in different translation directions, leading to sub-optimal adaptation in existing post-training methods. We further find that a significant bottleneck in MMT appears to lie in post-training rather than multilingual pre-training, suggesting the need for more effective adaptation strategies. Building on these new insights, we propose a direction-aware training approach, combined with group-wise model merging, to address asymmetry in linguistic conflicts and synergy explicitly. Leveraging this strategy, our method fine-tunes X-ALMA-13B-Pretrain—trained only with multilingual pre-training—achieving comparable performance to XALMA-13B (only SFT) while using only 20B pretraining tokens and 17B parameters—5.5× fewer pretraining-tokens and 1.7x fewer model size—with just 0.85 COMET drop on Flores-200 testsets of 50 languages.
pdf
bib
abs
VISIAR: Empower MLLM for Visual Story Ideation
Zhaoyang Xia
|
Somdeb Sarkhel
|
Mehrab Tanjim
|
Stefano Petrangeli
|
Ishita Dasgupta
|
Yuxiao Chen
|
Jinxuan Xu
|
Di Liu
|
Saayan Mitra
|
Dimitris N. Metaxas
Ideation, the process of forming ideas from concepts, is a big part of the content creation process. However, the noble goal of helping visual content creators by suggesting meaningful sequences of visual assets from a limited collection is challenging. It requires a nuanced understanding of visual assets and the integration of open-world knowledge to support creative exploration. Despite its importance, this task has yet to be explored fully in existing literature. To fill this gap, we propose Visual Story Ideation, a novel and underexplored task focused on the automated selection and arrangement of visual assets into coherent sequences that convey expressive storylines.We also present VISIAR, Visual Ideation through Sequence Integration and Asset Rearrangement, a robust framework leveraging Multimodal Large Language Models (MLLMs), and a novel Story Graph mechanism. Our framework operates in three key stages: visual content understanding, candidate asset selection, and asset rearrangement via MLLMs. In addition, we curated a new benchmark dataset, called VTravel, to evaluate our methods both qualitatively and quantitatively.User studies and GPT-as-the-judge evaluation show that our approach surpasses GPT-4o based baseline by an average of 33.5% and 18.5% across three different metrics, demonstrating the effectiveness of our framework for generating compelling visual stories.
pdf
bib
abs
Same Company, Same Signal: The Role of Identity in Earnings Call Transcripts
Ding Yu
|
Zhuo Liu
|
Hangfeng He
Post-earnings volatility prediction is critical for investors, with previous works often leveraging earnings call transcripts under the assumption that their rich semantics contribute significantly. To further investigate how transcripts impact volatility, we introduce DEC, a dataset featuring accurate volatility calculations enabled by the previously overlooked beforeAfterMarket attribute and dense ticker coverage. Unlike established benchmarks, where each ticker has only around two earnings, DEC provides 20 earnings records per ticker. Using DEC, we reveal that post-earnings volatility undergoes significant shifts, with each ticker displaying a distinct volatility distribution. To leverage historical post-earnings volatility and capture ticker-specific patterns, we propose two training-free baselines: Post-earnings Volatility (PEV) and Same-ticker Post-earnings Volatility (STPEV). These baselines surpass all transcripts-based models on DEC as well as on established benchmarks. Additionally, we demonstrate that current transcript representations predominantly capture ticker identity rather than offering financially meaningful insights specific to each earnings. This is evidenced by two key observations: earnings representations from the same ticker exhibit significantly higher similarity compared to those from different tickers, and predictions from transcript-based models show strong correlations with prior post-earnings volatility.
pdf
bib
abs
Understanding and Meeting Practitioner Needs When Measuring Representational Harms Caused by LLM-Based Systems
Emma Harvey
|
Emily Sheng
|
Su Lin Blodgett
|
Alexandra Chouldechova
|
Jean Garcia-Gathright
|
Alexandra Olteanu
|
Hanna Wallach
The NLP research community has made publicly available numerous instruments for measuring representational harms caused by large language model (LLM)-based systems. These instruments have taken the form of datasets, metrics, tools, and more. In this paper, we examine the extent to which such instruments meet the needs of practitioners tasked with evaluating LLM-based systems. Via semi-structured interviews with 12 such practitioners, we find that practitioners are often unable to use publicly available instruments for measuring representational harms. We identify two types of challenges. In some cases, instruments are not useful because they do not meaningfully measure what practitioners seek to measure or are otherwise misaligned with practitioner needs. In other cases, instruments-even useful instruments-are not used by practitioners due to practical and institutional barriers impeding their uptake. Drawing on measurement theory and pragmatic measurement, we provide recommendations for addressing these challenges to better meet practitioner needs.
pdf
bib
abs
Mind the (Belief) Gap: Group Identity in the World of LLMs
Angana Borah
|
Marwa Houalla
|
Rada Mihalcea
Social biases and belief-driven behaviors can significantly impact Large Language Models’ (LLMs’) decisions on several tasks. As LLMs are increasingly used in multi-agent systems for societal simulations, their ability to model fundamental group psychological characteristics remains critical yet under-explored. In this study, we present a multi-agent framework that simulates belief congruence, a classical group psychology theory that plays a crucial role in shaping societal interactions and preferences. Our findings reveal that LLMs exhibit amplified belief congruence compared to humans, across diverse contexts. We further investigate the implications of this behavior on two downstream tasks: (1) misinformation dissemination and (2) LLM learning, finding that belief congruence in LLMs increases misinformation dissemination and impedes learning. To mitigate these negative impacts, we propose strategies inspired by: (1) contact hypothesis, (2) accuracy nudges, and (3) global citizenship framework. Our results show that the best strategies reduce misinformation dissemination by up to (37%) and enhance learning by (11%). Bridging social psychology and AI, our work provides insights to navigate real-world interactions using LLMs while addressing belief-driven biases.
pdf
bib
abs
A General Framework to Enhance Fine-tuning-based LLM Unlearning
Jie Ren
|
Zhenwei Dai
|
Xianfeng Tang
|
Hui Liu
|
Jingying Zeng
|
Zhen Li
|
Rahul Goutam
|
Suhang Wang
|
Yue Xing
|
Qi He
|
Hui Liu
Unlearning has been proposed to remove copyrighted and privacy-sensitive data from Large Language Models (LLMs). Existing approaches primarily rely on fine-tuning-based methods, which can be categorized into gradient ascent-based (GA-based) and suppression-based methods. However, they often degrade model utility (the ability to respond to normal prompts). In this work, we aim to develop a general framework that enhances the utility of fine-tuning-based unlearning methods. To achieve this goal, we first investigate the common property between GA-based and suppression-based methods. We unveil that GA-based methods unlearn by distinguishing the target data (i.e., the data to be removed) and suppressing related generations—essentially the same strategy employed by suppression-based methods. Inspired by this finding, we introduce Gated Representation UNlearning (GRUN) which has two components: a soft gate function for distinguishing target data and a suppression module using Representation Fine-tuning (ReFT) to adjust representations rather than model parameters. Experiments show that GRUN significantly improves the unlearning and utility. Meanwhile, it is general for fine-tuning-based methods, efficient and promising for sequential unlearning.
pdf
bib
abs
Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering
Francesco Maria Molfese
|
Luca Moroni
|
Luca Gioffré
|
Alessandro Scirè
|
Simone Conia
|
Roberto Navigli
One of the most widely used tasks for evaluating Large Language Models (LLMs) is Multiple-Choice Question Answering (MCQA). While open-ended question answering tasks are more challenging to evaluate, MCQA tasks are, in principle, easier to assess, as the model’s answer is thought to be simple to extract and is compared directly to a set of predefined choices. However, recent studies have started to question the reliability of MCQA evaluation, showing that multiple factors can significantly impact the reported performance of LLMs, especially when the model generates free-form text before selecting one of the answer choices. In this work, we shed light on the inconsistencies of MCQA evaluation strategies, which can lead to inaccurate and misleading model comparisons. We systematically analyze whether existing answer extraction methods are aligned with human judgment, and how they are influenced by answer constraints in the prompt across different domains. Our experiments demonstrate that traditional evaluation strategies often underestimate LLM capabilities, while LLM-based answer extractors are prone to systematic errors. Moreover, we reveal a fundamental trade-off between including format constraints in the prompt to simplify answer extraction and allowing models to generate free-form text to improve reasoning. Our findings call for standardized evaluation methodologies and highlight the need for more reliable and consistent MCQA evaluation practices.
pdf
bib
abs
Machine Theory of Mind Needs Machine Validation
Adil Soubki
|
Owen Rambow
In the last couple years, there has been a flood of interest in studying the extent to which language models (LMs) have a theory of mind (ToM) — the ability to ascribe mental states to themselves and others. The results provide an unclear picture of the current state of the art, with some finding near-human performance and others near-zero. To make sense of this landscape, we perform a survey of 16 recent studies aimed at measuring ToM in LMs and find that, while almost all perform checks for human identifiable issues, less than half do so for patterns only a machine might exploit. Among those that do perform such validation, which we call machine validation, none identify LMs to exceed human performance. We conclude that the datasets that show high LM performance on ToM tasks are easier than their peers, likely due to the presence of spurious patterns in the data, and we caution against building ToM benchmarks relying solely on human validation of the data.
pdf
bib
abs
MiniKV: Pushing the Limits of 2-Bit KV Cache via Compression and System Co-Design for Efficient Long Context Inference
Akshat Sharma
|
Hangliang Ding
|
Jianping Li
|
Neel Dani
|
Minjia Zhang
State-of-the-art 2-bit KV cache quantization techniques achieve excellent results in accelerating LLM inference while retaining accuracy on long context tasks. However, further pushing the compression ratio fails to deliver performance gains. In this work, we revisit these approaches by considering, additionally, adaptive KV methods that retain LLM accuracy with only a subset of KV states. This leads us to propose a method based on 2-bit KV cache quantization with adaptive KV policies. In addition, we take an algorithm and system co-design approach by developing hardware-friendly kernels to accelerate LLM inference while making MiniKV compatible with existing memory-efficient attention techniques such as FlashAttention, effectively translating algorithmic improvements into system performance gains. Experiments on a wide range of long context tasks show that MiniKV effectively achieves >80% KV cache compression while retaining accuracy, outperforming state-of-the-art methods while achieving excellent latency, throughput, and memory consumption improvements in long context inference.
pdf
bib
abs
Sci-LoRA: Mixture of Scientific LoRAs for Cross-Domain Lay Paraphrasing
Ming Cheng
|
Jiaying Gong
|
Hoda Eldardiry
Lay paraphrasing aims to make scientific information accessible to audiences without technical backgrounds. However, most existing studies focus on a single domain, such as biomedicine. With the rise of interdisciplinary research, it is increasingly necessary to comprehend knowledge spanning multiple technical fields. To address this, we propose Sci-LoRA, a model that leverages a mixture of LoRAs fine-tuned on multiple scientific domains. In particular, Sci-LoRA dynamically generates and applies weights for each LoRA, enabling it to adjust the impact of different domains based on the input text, without requiring explicit domain labels. To balance domain-specific knowledge and generalization across various domains, Sci-LoRA integrates information at both the data and model levels. This dynamic fusion enhances the adaptability and performance across various domains. Experimental results across twelve domains on five public datasets show that Sci-LoRA significantly outperforms state-of-the-art large language models and demonstrates flexible generalization and adaptability in cross-domain lay paraphrasing.
pdf
bib
abs
Trick or Neat: Adversarial Ambiguity and Language Model Evaluation
Antonia Karamolegkou
|
Oliver Eberle
|
Phillip Rust
|
Carina Kauf
|
Anders Søgaard
Detecting ambiguity is important for language understanding, including uncertainty estimation, humour detection, and processing garden path sentences. We assess language models’ sensitivity to ambiguity by introducing an adversarial ambiguity dataset that includes syntactic, lexical, and phonological ambiguities along with adversarial variations (e.g., word-order changes, synonym replacements, and random-based alterations). Our findings show that direct prompting fails to robustly identify ambiguity, while linear probes trained on model representations can decode ambiguity with high accuracy, sometimes exceeding 90%. Our results offer insights into the prompting paradigm and how language models encode ambiguity at different layers.
pdf
bib
abs
Biases Propagate in Encoder-based Vision-Language Models: A Systematic Analysis From Intrinsic Measures to Zero-shot Retrieval Outcomes
Kshitish Ghate
|
Tessa Charlesworth
|
Mona T. Diab
|
Aylin Caliskan
To build fair AI systems we need to understand how social-group biases intrinsic to foundational encoder-based vision-language models (VLMs) manifest in biases in downstream tasks. In this study, we demonstrate that intrinsic biases in VLM representations systematically “carry over” or propagate into zero-shot retrieval tasks, revealing how deeply rooted biases shape a model’s outputs. We introduce a controlled framework to measure this propagation by correlating (a) intrinsic measures of bias in the representational space with (b) extrinsic measures of bias in zero-shot text-to-image (TTI) and image-to-text (ITT) retrieval. Results show substantial correlations between intrinsic and extrinsic bias, with an average 𝜌 = 0.83 ± 0.10. This pattern is consistent across 114 analyses, both retrieval directions, six social groups, and three distinct VLMs. Notably, we find that larger/better-performing models exhibit greater bias propagation, a finding that raises concerns given the trend towards increasingly complex AI models. Our framework introduces baseline evaluation tasks to measure the propagation of group and valence signals. Investigations reveal that underrepresented groups experience less robust propagation, further skewing their model-related outcomes.
pdf
bib
abs
Stepwise Perplexity-Guided Refinement for Efficient Chain-of-Thought Reasoning in Large Language Models
Yingqian Cui
|
Pengfei He
|
Jingying Zeng
|
Hui Liu
|
Xianfeng Tang
|
Zhenwei Dai
|
Yan Han
|
Chen Luo
|
Jing Huang
|
Zhen Li
|
Suhang Wang
|
Yue Xing
|
Jiliang Tang
|
Qi He
Chain-of-Thought (CoT) reasoning, which breaks down complex tasks into intermediate reasoning steps, has significantly enhanced the performance of large language models (LLMs) on challenging tasks. However, the detailed reasoning process in CoT often incurs long generation times and high computational costs, partly due to the inclusion of unnecessary steps. To address this, we propose a method to identify critical reasoning steps using perplexity as a measure of their importance: a step is deemed critical if its removal causes a significant increase in perplexity. Our method enables models to focus solely on generating these critical steps. This can be achieved through two approaches: refining demonstration examples in few-shot CoT or fine-tuning the model using selected examples that include only critical steps. Comprehensive experiments validate the effectiveness of our method, which achieves a better balance between the reasoning accuracy and efficiency of CoT.
pdf
bib
abs
Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers
Yilun Zhao
|
Chengye Wang
|
Chuhan Li
|
Arman Cohan
This paper introduces MISS-QA, the first benchmark specifically designed to evaluate the ability of models to interpret schematic diagrams within scientific literature. MISS-QA comprises 3,000 expert-annotated examples over 983 scientific papers. In this benchmark, models are tasked with interpreting schematic diagrams that illustrate research overviews and answering corresponding information-seeking questions based on the broader context of the paper. To ensure reliable and consistent evaluation, we propose an automated evaluating protocol powered by open-source LLMs trained on human-scored data. We assess the performance of 18 frontier multimodal foundation models, including o1, Claude-3.5, Llama-3.2-Vision, and Qwen2-VL. We reveal a significant performance gap between these models and human experts on MISS-QA. Our analysis of model performance on unanswerable questions and our detailed error analysis further highlight the strengths and limitations of current models, offering key insights to enhance models in comprehending multimodal scientific literature.
pdf
bib
abs
MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs
Kaustubh Deshpande
|
Ved Sirdeshmukh
|
Johannes Baptist Mols
|
Lifeng Jin
|
Ed-Yeremai Hernandez-Cardona
|
Dean Lee
|
Jeremy Kritz
|
Willow E. Primack
|
Summer Yue
|
Chen Xing
We present MultiChallenge, a pioneering benchmark evaluating large language models (LLMs) on conducting multi-turn conversations with human users, a crucial yet underexamined capability for their applications. MultiChallenge identifies four categories of challenges in multi-turn conversations that are not only common and realistic among current human-LLM interactions, but are also challenging to all current frontier LLMs. All 4 challenges require accurate instruction-following, context allocation, and in-context reasoning at the same time.We also develop LLM as judge with instance-level rubrics to facilitate an automatic evaluation method with fair agreement with experienced human raters. Despite achieving near perfect scores on existing multi-turn evaluation benchmarks, all frontier models have less than 50% accuracy on MultiChallenge, with the top-performing Claude 3.5 Sonnet (October 2024) achieving just a 41.4% average accuracy.
pdf
bib
abs
Privacy Ripple Effects from Adding or Removing Personal Information in Language Model Training
Jaydeep Borkar
|
Matthew Jagielski
|
Katherine Lee
|
Niloofar Mireshghallah
|
David A. Smith
|
Christopher A. Choquette-Choo
Due to the sensitive nature of personally identifiable information (PII), its owners may have the authority to control its inclusion or request its removal from large-language model (LLM) training. Beyond this, PII may be added or removed from training datasets due to evolving dataset curation techniques, because they were newly scraped for retraining, or because they were included in a new downstream fine-tuning stage. We find that the amount and ease of PII memorization is a dynamic property of a model that evolves throughout training pipelines and depends on commonly altered design choices. We characterize three such novel phenomena: (1) similar-appearing PII seen later in training can elicit memorization of earlier-seen sequences in what we call assisted memorization, and this is a significant factor (in our settings, up to 1/3); (2) adding PII can increase memorization of other PII; and (3) removing PII can lead to other PII being memorized.
pdf
bib
abs
Safety is Not Only About Refusal: Reasoning-Enhanced Fine-tuning for Interpretable LLM Safety
Yuyou Zhang
|
Miao Li
|
William Han
|
Yihang Yao
|
Zhepeng Cen
|
Ding Zhao
Large Language Models (LLMs) are vulnerable to jailbreak attacks that exploit weaknesses in traditional safety alignment, which often relies on rigid refusal heuristics or representation engineering to block harmful outputs. While they are effective for direct adversarial attacks, they fall short of broader safety challenges requiring nuanced, context-aware decision-making. To address this, we propose Reasoning-enhanced Fine-Tuning for interpretable LLM Safety (RATIONAL), a novel framework that trains models to engage in explicit safe reasoning before response. Fine-tuned models leverage the extensive pretraining knowledge in self-generated reasoning to bootstrap their own safety through structured reasoning, internalizing context-sensitive decision-making. Our findings suggest that safety extends beyond refusal, requiring context awareness for more robust, interpretable, and adaptive responses. Reasoning is not only a core capability of LLMs but also a fundamental mechanism for LLM safety. RATIONAL employs reasoning-enhanced fine-tuning, allowing it to reject harmful prompts while providing meaningful and context-aware responses in complex scenarios.
pdf
bib
abs
Is a cute puyfred cute? Context-dependent form-meaning systematicity in LLMs
Jaïr A. Waal
|
Giovanni Cassani
We investigate static and contextualized embeddings for English pseudowords across a variety of Large Language Models (LLMs), to study (i) how these models represent semantic attributes of strings they encounter for the very first time and how (ii) these representations interact with sentence context. We zoom in on a key semantic attribute, valence, which plays an important role in theories of language processing, acquisition, and evolution. Across three experiments, we show that pseudoword valence is encoded in meaningful ways both in isolation and in context, and that, in some LLMs, pseudowords affect the representation of whole sentences similarly to words. This highlights how, at least for most LLMs we surveyed, pseudowords and words are not qualitatively different constructs. Our study confirms that LLMs capture systematic mappings between form and valence, and shows how different LLMs handle the contextualisation of pseudowords differently. Our findings provide a first computational exploration of how sub-lexical distributional patterns influence the valence of novel strings in context, offering useful insights for theories on the form-meaning interface and how it affects language learning and processing.
pdf
bib
abs
MetaSynth: Meta-Prompting-Driven Agentic Scaffolds for Diverse Synthetic Data Generation
Haris Riaz
|
Sourav Sanjukta Bhabesh
|
Vinayak Arannil
|
Miguel Ballesteros
|
Graham Horwood
Recent smaller language models such Phi-3.5 and Phi-4 rely on synthetic data generated using larger Language models. Questions remain about leveraging synthetic data for other use cases, such as adapting LLMs to specific domains. A key limitation of synthetic data is low diversity, which negatively impacts its downstream applicability for improving other models. To address this, we propose MetaSynth, a method for generating synthetic data that enhances diversity through meta-prompting, where a language model orchestrates multiple “expert” LLM agents to collaboratively generate data. Using only 25 million tokens of synthetic data generated with MetaSynth, we successfully adapt a well-trained LLM (Mistral-7B) to two specialized domains–Finance and Biomedicine–without compromising the capabilities of the resulting model in general tasks. In addition, we evaluate the diversity of our synthetic data using seven automated metrics, and find that it approaches the diversity of LLM pre-training corpora.Continually pre-training Mistral-7B with MetaSynth notably outperforms the base LLM, showing improvements of up to 4.08% in Finance and 13.75% in Biomedicine. The same model shows degraded performance when trained on data generated using a template-based prompt, even when the template includes prior generations and varying In-Context exemplars of real data. Our findings suggest that a few million tokens of diverse synthetic data without mixing any real data, is sufficient for effective domain adaptation when using MetaSynth.
pdf
bib
abs
MVTamperBench: Evaluating Robustness of Vision-Language Models
Amit Agarwal
|
Srikant Panda
|
Angeline Charles
|
Hitesh Laxmichand Patel
|
Bhargava Kumar
|
Priyaranjan Pattnayak
|
Taki Hasan Rafi
|
Tejaswini Kumar
|
Hansa Meghwani
|
Karan Gupta
|
Dong-Kyu Chae
Multimodal Large Language Models (MLLMs), are recent advancement of Vision-Language Models (VLMs) that have driven major advances in video understanding. However, their vulnerability to adversarial tampering and manipulations remains underexplored. To address this gap, we introduce MVTamperBench, a benchmark that systematically evaluates MLLM robustness against five prevalent tampering techniques: rotation, masking, substitution, repetition, and dropping; based on real-world visual tampering scenarios such as surveillance interference, social media content edits, and misinformation injection. MVTamperBench comprises ~3.4K original videos, expanded into over ~17K tampered clips covering 19 distinct video manipulation tasks. This benchmark challenges models to detect manipulations in spatial and temporal coherence. We evaluate 45 recent MLLMs from 15+ model families. We reveal substantial variability in resilience across tampering types and show that larger parameter counts do not necessarily guarantee robustness. MVTamperBench sets a new benchmark for developing tamper-resilient MLLM in safety-critical applications, including detecting clickbait, preventing harmful content distribution, and enforcing policies on media platforms. We release all code, data, and benchmark to foster open research in trustworthy video understanding.
pdf
bib
abs
Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models
Qianqi Yan
|
Yue Fan
|
Hongquan Li
|
Shan Jiang
|
Yang Zhao
|
Xinze Guan
|
Ching-Chen Kuo
|
Xin Eric Wang
Existing Multimodal Large Language Models (MLLMs) are predominantly trained and tested on consistent visual-textual inputs, leaving open the question of whether they can handle inconsistencies in real-world, layout-rich content. To bridge this gap, we propose the Multimodal Inconsistency Reasoning (MMIR) benchmark to assess MLLMs’ ability to detect and reason about semantic mismatches in artifacts such as webpages, presentation slides, and posters. MMIR comprises 534 challenging samples, each containing synthetically injected errors across five reasoning-heavy categories: Factual Contradiction, Identity Misattribution, Contextual Mismatch, Quantitative Discrepancy, and Temporal/Spatial Incoherence. We evaluate eight state-of-the-art MLLMs, showing that models with dedicated multimodal reasoning capabilities, such as o1, substantially outperform their counterparts while open-source models remain particularly vulnerable to inconsistency errors. Detailed error analyses further show that models excel in detecting inconsistencies confined to a single modality, particularly in text, but struggle with cross-modal conflicts and complex layouts. Probing experiments reveal that single-modality prompting, including Chain-of-Thought (CoT) and Set-of-Mark (SoM) methods, yields marginal gains, revealing a key bottleneck in cross-modal reasoning. Our findings highlight the need for advanced multimodal reasoning and point to future research on multimodal inconsistency.
pdf
bib
abs
Vision-Language Models Struggle to Align Entities across Modalities
Iñigo Alonso
|
Gorka Azkune
|
Ander Salaberria
|
Jeremy Barnes
|
Oier Lopez De Lacalle
Cross-modal entity linking refers to the ability to align entities and their attributes across different modalities. While cross-modal entity linking is a fundamental skill needed for real-world applications such as multimodal code generation, fake news detection, or scene understanding, it has not been thoroughly studied in the literature. In this paper, we introduce a new task and benchmark to address this gap. Our benchmark, MATE, consists of 5.5k evaluation instances featuring visual scenes aligned with their textual representations. To evaluate cross-modal entity linking performance, we design a question-answering task that involves retrieving one attribute of an object in one modality based on a unique attribute of that object in another modality. We evaluate state-of-the-art Vision-Language Models (VLMs) and humans on this task, and find that VLMs struggle significantly compared to humans, particularly as the number of objects in the scene increases. Our analysis also shows that, while chain-of-thought prompting can improve VLM performance, models remain far from achieving human-level proficiency. These findings highlight the need for further research in cross-modal entity linking and show that MATE is a strong benchmark to support that progress.
pdf
bib
abs
A Multi-Labeled Dataset for Indonesian Discourse: Examining Toxicity, Polarization, and Demographics Information
Lucky Susanto
|
Musa Izzanardi Wijanarko
|
Prasetia Anugrah Pratama
|
Zilu Tang
|
Fariz Akyas
|
Traci Hong
|
Ika Karlina Idris
|
Alham Fikri Aji
|
Derry Tanti Wijaya
Online discourse is increasingly trapped in a vicious cycle where polarizing language fuelstoxicity and vice versa. Identity, one of the most divisive issues in modern politics, oftenincreases polarization. Yet, prior NLP research has mostly treated toxicity and polarization asseparate problems. In Indonesia, the world’s third-largest democracy, this dynamic threatens democratic discourse, particularly in online spaces. We argue that polarization and toxicity must be studied in relation to each other. To this end, we present a novel multi-label Indonesian dataset annotated for toxicity, polarization, and annotator demographic information. Benchmarking with BERT-base models and large language models (LLMs) reveals that polarization cues improve toxicity classification and vice versa. Including demographic context further enhances polarization classification performance.
pdf
bib
abs
MedCite: Can Language Models Generate Verifiable Text for Medicine?
Xiao Wang
|
Mengjue Tan
|
Qiao Jin
|
Guangzhi Xiong
|
Yu Hu
|
Aidong Zhang
|
Zhiyong Lu
|
Minjia Zhang
Existing LLM-based medical question answering systems lack citation generation and evaluation capabilities, raising concerns about their adoption in practice. In this work, we introduce MedCite, the first end-to-end framework that facilitates the design and evaluation of LLM citations for medical tasks. Meanwhile, we introduce a novel multi-pass retrieval-citation method that generates high-quality citations.Our extensive evaluation highlights the challenges and opportunities of citation generation for medical tasks, while identifying important design choices that have a significant impact on the final citation quality. Our proposed method achieves superior citation precision and recall improvements compared to strong baseline methods, and we show that our evaluation results correlate well with annotation results from professional experts.
pdf
bib
abs
Let The Jury Decide: Fair Demonstration Selection for In-Context Learning through Incremental Greedy Evaluation
Sadaf Md Halim
|
Chen Zhao
|
Xintao Wu
|
Latifur Khan
|
Christan Grant
|
Fariha Ishrat Rahman
|
Feng Chen
Large Language Models (LLMs) are powerful in-context learners, achieving strong performance with just a few high-quality demonstrations. However, fairness concerns arise in many in-context classification tasks, especially when predictions involve sensitive attributes. To address this, we propose JUDGE—a simple yet effective framework for selecting fair and representative demonstrations that improve group fairness in In-Context Learning. JUDGE constructs the demonstration set iteratively using a greedy approach, guided by a small, carefully selected jury set. Our method remains robust across varying LLM architectures and datasets, ensuring consistent fairness improvements. We evaluate JUDGE on four datasets using four LLMs, comparing it against seven baselines. Results show that JUDGE consistently improves fairness metrics without compromising accuracy.
pdf
bib
abs
The Lies Characters Tell: Utilizing Large Language Models to Normalize Adversarial Unicode Perturbations
Portia Cooper
|
Eduardo Blanco
|
Mihai Surdeanu
Homoglyphs, Unicode characters that are visually homogeneous to Latin letters, are widely used to mask offensive content. Dynamic strategies are needed to combat homoglyphs as the Unicode library is ever-expanding and new substitution possibilities for Latin letters continuously emerge. The present study investigated two novel mitigation approaches that do not rely on strict mappings but instead harness the power of large language models to neutralize both known and unknown homoglyphs: (1) indirectly normalizing homoglyphs by replacing non-Latin characters with a delimiter and prompting large language models to “fill in the blanks” and (2) directly normalizing homoglyphs by using large language models to determine which characters should be replaced with Latin letters. We found that GPT-4o-mini constructed normalized text with an average cosine similarity score of 0.91 to the original tweets when applying our indirect method and 0.96 to the original tweets when applying our direct method. This study indicates that large language model-based normalization techniques can effectively unmask offensive content concealed by homoglyphs. Code and data are available in our GitHub repository: https://github.com/pcoopercoder/The-Lies-Characters-Tell.
pdf
bib
abs
Speech Act Patterns for Improving Generalizability of Explainable Politeness Detection Models
Ahmad Aljanaideh
The lack of explainability in state-of-the-art Natural Language Understanding (NLU) classification models has increased interest in developing techniques for improving explainable linear feature-based models (e.g., Logistic Regression/SVM). Politeness detection is a task that exemplifies this interest. While those techniques perform well on the task when applied to data from the same domain as the training data, they lack generalizability and thus fall short when applied to data from other domains. This is due to their reliance on discovering domain-specific word-level features. We introduce a method for improving the generalizability of explainable politeness models by relying on speech act patterns instead of words, leveraging speech act labels assigned by the GPT-4 model. This approach goes beyond the mere words and injects intent into politeness classification models, enhancing their generalizability. Results demonstrate that the proposed method achieves state-of-the-art accuracy in the cross-domain setting among explainable methods, while falling short in the in-domain setting. Our findings illustrate that explainable models can benefit from Large Language Models.
pdf
bib
abs
Systematic Evaluation of Auto-Encoding and Large Language Model Representations for Capturing Author States and Traits
Khushboo Singh
|
Vasudha Varadarajan
|
Adithya V Ganesan
|
August Håkan Nilsson
|
Nikita Soni
|
Syeda Mahwish
|
Pranav Chitale
|
Ryan L. Boyd
|
Lyle Ungar
|
Richard N Rosenthal
|
H. Schwartz
Large Language Models (LLMs) are increasingly used in human-centered applications, yet their ability to model diverse psychological constructs is not well understood. In this study, we systematically evaluate a range of Transformer-LMs to predict psychological variables across five major dimensions: affect, substance use, mental health, sociodemographics, and personality. Analyses span three temporal levels—short daily text responses about current affect, text aggregated over two-weeks, and user-level text collected over two years—allowing us to examine how each model’s strengths align with the underlying stability of different constructs. The findings show that mental health signals emerge as the most accurately predicted dimensions (r=0.6) across all temporal scales. At the daily scale, smaller models like DeBERTa and HaRT often performed better, whereas, at longer scales or with greater context, larger model like Llama3-8B performed the best. Also, aggregating text over the entire study period yielded stronger correlations for outcomes, such as age and income. Overall, these results suggest the importance of selecting appropriate model architectures and temporal aggregation techniques based on the stability and nature of the target variable.
pdf
bib
abs
TReMu: Towards Neuro-Symbolic Temporal Reasoning for LLM-Agents with Memory in Multi-Session Dialogues
Yubin Ge
|
Salvatore Romeo
|
Jason Cai
|
Raphael Shu
|
Yassine Benajiba
|
Monica Sunkara
|
Yi Zhang
Temporal reasoning in multi-session dialogues presents a significant challenge which has been under-studied in previous temporal reasoning benchmarks. To bridge this gap, we propose a new evaluation task for temporal reasoning in multi-session dialogues and introduce an approach to construct a new benchmark by augmenting dialogues from LoCoMo and creating multi-choice QAs. Furthermore, we present TReMu, a new framework aimed at enhancing the temporal reasoning capabilities of LLM-agents in this context. Specifically, the framework employs time-aware memorization through timeline summarization, generating retrievable memory by summarizing events in each dialogue session with their inferred dates. Additionally, we integrate neuro-symbolic temporal reasoning, where LLMs generate Python code to perform temporal calculations and select answers. Experimental evaluations on popular LLMs demonstrate that our benchmark is challenging, and the proposed framework significantly improves temporal reasoning performance compared to baseline methods, raising from 29.83 on GPT-4o via standard prompting to 77.67 via our approach and highlighting its effectiveness in addressing temporal reasoning in multi-session dialogues.
pdf
bib
abs
Conservative Bias in Large Language Models: Measuring Relation Predictions
Toyin Aguda
|
Erik Wilson
|
Allan Anzagira
|
Simerjot Kaur
|
Charese Smiley
Large language models (LLMs) exhibit pronounced conservative bias in relation extraction tasks, frequently defaulting to no_relation label when an appropriate option is unavailable. While this behavior helps prevent incorrect relation assignments, our analysis reveals that it also leads to significant information loss when reasoning is not explicitly included in the output. We systematically evaluate this trade-off across multiple prompts, datasets, and relation types, introducing the concept of Hobson’s choice to capture scenarios where models opt for safe but uninformative labels over hallucinated ones. Our findings suggest that conservative bias occurs twice as often as hallucination. To quantify this effect, we use SBERT and LLM prompts to capture the semantic similarity between conservative bias behaviors in constrained prompts and labels generated from semi-constrained and open-ended prompts.
pdf
bib
abs
Mitigating Bias in RAG: Controlling the Embedder
Taeyoun Kim
|
Jacob Mitchell Springer
|
Aditi Raghunathan
|
Maarten Sap
In retrieval augmented generation (RAG) systems, each individual component—the LLM, embedder, and corpus—could introduce biases in the form of skews towards certain genders or political leanings. In this work, we study the conflict between biases of each component and their relationship to the overall bias of the RAG system, which we call bias conflict. Examining both gender and political biases as case studies, we show that bias conflict can be characterized through a linear relationship among components despite its complexity. Through fine-tuning, we demonstrate how to control the bias of the embedder while maintaining utility and reveal the importance of reverse-biasing the embedder to mitigate bias in the overall system, Additionally, we find that LLMs and tasks exhibit varying sensitivities to bias, a crucial factor to consider for debiasing. Our results underscore that a fair RAG system can be better achieved by carefully controlling the bias of the embedder rather than increasing its fairness.
pdf
bib
abs
V-ALPHASOCIAL: Benchmark and Self-Reflective Chain-of-Thought Generation for Visual Social Commonsense Reasoning
Zongyu Lin
|
Zhikun Xu
|
Xiaohan Song
|
Yixin Wan
|
Xingcheng Yao
|
Tsung-Han Lin
|
Selina Song
|
Pranav Subbaraman
|
Ben Zhou
|
Kai-Wei Chang
|
Yizhou Sun
Social commonsense reasoning naturally involves both the verbal and non-verbal cues of a social interaction. It is important for Large Vision-Language Models (VLMs) to leverage both textual and visual information in performing tasks like social understanding and reasoning. However, while current LLMs have shown good social reasoning capabilities in textual context, whether they can effectively incorporate visual information in social comprehension remains under-explored. To narrow the gap, we first construct and propose a benchmark: V-Social, featuring well-aligned text and visual content, tailored to assess visual social commonsense for multimodal foundation models. Through experimenting with V-Social, we find that even the most advanced VLM, GPT-4o, often falls short in social commonsense reasoning. This highlights the critical need to enhance the social grounding of VLMs. One major obstacle for improving this is the lack of high-quality data with good reasoning process. To overcome this obstacle, we introduce V-AlphaSocial, a novel method that generates high-quality chain-of-thought reasoning paths from unlabeled data. We design a visual reasoning reward model to improve VLM, and then iteratively refine both the VLM and the reward model. Our extensive analysis showcases how our method enhances social commonsense reasoning, proposing an effective approach that facilitates deeper exploration into field.
pdf
bib
abs
AfroBench: How Good are Large Language Models on African Languages?
Jessica Ojo
|
Odunayo Ogundepo
|
Akintunde Oladipo
|
Kelechi Ogueji
|
Jimmy Lin
|
Pontus Stenetorp
|
David Ifeoluwa Adelani
Large-scale multilingual evaluations, such as MEGA, often include only a handful of African languages due to the scarcity of high-qualityevaluation data and the limited discoverability of existing African datasets. This lack of representation hinders comprehensive LLM evaluation across a diverse range of languages and tasks. To address these challenges, we introduce AFROBENCH—a multi-task benchmark for evaluating the performance of LLMs across 64 African languages, 15 tasks and 22 datasets. AFROBENCH consists of nine natural language understanding datasets, six text generation datasets, six knowledge and question answering tasks, and one mathematical reasoning task. We present results comparing the performance of prompting LLMs to fine-tuned baselines based on BERT and T5-style models. Our results suggest large gaps in performance between high-resource languages, such as English, and African languages across most tasks; but performance also varies based on the availability of monolingual data resources. Our findings confirm that performance on African languages continues to remain a hurdle for current LLMs, underscoring the need for additional efforts to close this gap.
pdf
bib
abs
Training Bilingual LMs with Data Constraints in the Targeted Language
Skyler Seto
|
Maartje Ter Hoeve
|
Richard He Bai
|
Natalie Schluter
|
David Grangier
Large language models are trained on massive scrapes of the web, as required by current scaling laws. Most progress is made for English, given its abundance of high-quality pretraining data. For most other languages, however, such high quality pretraining data is unavailable. In this work, we study how to boost pretrained model performance in a target language with insufficient pretraining data for training a high performing language model by enlisting data from an auxiliary language for which high quality data is available. We study this by quantifying the performance gap between training with data in a data-rich auxiliary language compared with training in the target language, exploring the benefits of translation systems, studying the limitations of model scaling when data is limited in the target languages, and proposing new methods for upsampling data from the auxiliary language. Our results show that stronger auxiliary datasets result in performance gains without modification to the model or training objective for close languages, and, in particular, that performance gains due to the development of more information-rich English pretraining datasets can extend to targeted language settings with limited data.
pdf
bib
abs
ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question Answering
Ahmed Masry
|
Mohammed Saidul Islam
|
Mahir Ahmed
|
Aayush Bajaj
|
Firoz Kabir
|
Aaryaman Kartha
|
Md Tahmid Rahman Laskar
|
Mizanur Rahman
|
Shadikur Rahman
|
Mehrad Shahmohammadi
|
Megh Thakkar
|
Md Rizwan Parvez
|
Enamul Hoque
|
Shafiq Joty
Charts are ubiquitous, as people often use them to analyze data, answer questions, and discover critical insights. However, performing complex analytical tasks with charts requires significant perceptual and cognitive effort. Chart Question Answering (CQA) systems automate this process by enabling models to interpret and reason with visual representations of data. However, existing benchmarks like ChartQA lack real-world diversity and have recently shown performance saturation with modern large vision-language models (LVLMs). To address these limitations, we introduce ChartQAPro, a new benchmark that includes 1,341 charts from 99 diverse sources, spanning various chart types—including infographics and dashboards—and featuring 1,948 questions in various types, such as multiple-choice, conversational, hypothetical, and unanswerable questions, to better reflect real-world challenges. Our evaluations with 21 models show a substantial performance drop for LVLMs on ChartQAPro; e.g., Claude Sonnet 3.5 scores 90.5% on ChartQA but only 55.81% on ChartQAPro, underscoring the complexity of chart reasoning. We complement our findings with detailed error analyses and ablation studies, identifying key challenges and opportunities for advancing LVLMs in chart understanding and reasoning. We release ChartQAPro at https://github.com/vis-nlp/ChartQAPro.
pdf
bib
abs
From Observation to Understanding: Front-Door Adjustments with Uncertainty Calibration for Enhancing Egocentric Reasoning in LVLMs
Shenshen Li
|
Wenxin Meng
|
Lei Wang
|
Hao Yang
|
Chong Peng
|
Peng Yan
|
Fumin Shen
|
Jingkuan Song
|
Heng Tao Shen
|
Xing Xu
Recent progress in large vision-language models (LVLMs) has shown substantial potential across a broad spectrum of third-person tasks. However, adapting these LVLMs to egocentric scenarios remains challenging due to their third-person training bias. Existing methods that adapt LVLMs for first-person tasks often overlook critical agent-environment interactions, limiting their ability to perform egocentric reasoning. To address these challenges, we propose a novel zero-shot paradigm termed Front-Door Adjustments with Uncertainty Calibration (FRUIT) to enhance the egocentric reasoning abilities of LVLMs by simulating human causal reasoning. Specifically, the FRUIT operates in two stages: observation and understanding. Unlike conventional prompting techniques, we formalize egocentric reasoning using a structural causal model. Then, we ground interaction regions and expand them into hierarchical visual cues, augmented with corresponding captions, to form the initial observations. To reduce noise in these observations, we employ uncertainty calibration to filter out unreliable information. These refined observations as mediators are then incorporated into the prompt template, guiding the model to understand semantics from a first-person perspective. Extensive experiments conducted on the EgoThink benchmark demonstrate that our FRUIT method consistently enhances the performance of existing LVLMs on six distinct tasks. Our code is available at https://github.com/Mrshenshen/FRUIT.
pdf
bib
abs
Hypothetical Documents or Knowledge Leakage? Rethinking LLM-based Query Expansion
Yejun Yoon
|
Jaeyoon Jung
|
Seunghyun Yoon
|
Kunwoo Park
Query expansion methods powered by large language models (LLMs) have demonstrated effectiveness in zero-shot retrieval tasks. These methods assume that LLMs can generate hypothetical documents that, when incorporated into a query vector, enhance the retrieval of real evidence. However, we challenge this assumption by investigating whether knowledge leakage in benchmarks contributes to the observed performance gains. Using fact verification as a testbed, we analyze whether the generated documents contain information entailed by ground-truth evidence and assess their impact on performance. Our findings indicate that, on average, performance improvements consistently occurred for claims whose generated documents included sentences entailed by gold evidence. This suggests that knowledge leakage may be present in fact-verification benchmarks, potentially inflating the perceived performance of LLM-based query expansion methods.
pdf
bib
abs
Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA
Qianqi Yan
|
Xuehai He
|
Xiang Yue
|
Xin Eric Wang
Large Multimodal Models (LMMs) have demonstrated impressive performance on existing medical Visual Question Answering (Med-VQA) benchmarks. However, high reported accuracy does not necessarily reflect their true diagnostic reliability in clinical settings. This study reveals that state-of-the-art models perform worse than random guessing on medical diagnosis questions when subjected to simple Probing Evaluation for Medical Diagnosis (ProbMed). ProbMed challenges models through probing evaluation and procedural diagnosis. Particularly, probing evaluation features pairing ground-truth questions with adversarial counterparts that feature negated and hallucinated attributes, while procedural diagnosis requires reasoning across various dimensions for each image, including modality recognition, organ identification, clinical findings, abnormalities, and positional grounding. Our evaluation reveals that even top-performing models like GPT-4o, GPT-4V, and Gemini Pro perform worse than random guessing on specialized diagnostic questions, indicating significant limitations in handling fine-grained medical inquiries. Furthermore, our ablation study on open-source models (e.g., LLaVA, LLaVA-Med, and Med-Flamingo) identifies poor visual understanding as a primary bottleneck—a limitation that can be partially mitigated by incorporating visual descriptions generated by GPT-4o, resulting in an average performance improvement of 9.44%. These findings underscore the urgent need for more robust evaluation methods and domain-specific expertise to ensure the reliability of LMMs in high-stakes medical applications.
pdf
bib
abs
Optimizing Reasoning for Text-to-SQL with Execution Feedback
Bohan Zhai
|
Canwen Xu
|
Yuxiong He
|
Zhewei Yao
Text-to-SQL demands precise reasoning to convert natural language questions into structured queries. While large language models (LLMs) excel in many reasoning tasks, their ability to leverage Chain-of-Thought (CoT) reasoning for text-to-SQL remains underexplored. We identify critical limitations: zero-shot CoT offers minimal gains, and Direct Preference Optimization (DPO) applied without CoT yields marginal improvements. We propose ExCoT-DPO, a novel framework that iteratively optimizes open-source LLMs by combining CoT reasoning with off-policy and on-policy DPO, relying solely on execution accuracy as feedback. This approach eliminates the need for reward models or human-annotated preferences. Our experimental results demonstrate significant performance gains: ExCoT-DPO improves execution accuracy on BIRD from 57.37% to 68.51% and on Spider from 78.81% to 86.59% for LLaMA-3 70B, with Qwen-2.5-Coder demonstrating similar improvements. Our best model achieves state-of-the-art performance in the single-model setting on both BIRD and Spider datasets.
pdf
bib
abs
Disentangling Logic: The Role of Context in Large Language Model Reasoning Capabilities
Wenyue Hua
|
Kaijie Zhu
|
Lingyao Li
|
Lizhou Fan
|
Mingyu Jin
|
Shuhang Lin
|
Haochen Xue
|
Zelong Li
|
Jindong Wang
|
Yongfeng Zhang
This study intends to systematically disentangle pure logic reasoning and text understanding by investigating the contrast across abstract and contextualized logical problems from a comprehensive set of domains. We explore whether LLMs demonstrate genuine reasoning capabilities across various domains when the underlying logical structure remains constant. We focus on two main questions (1) Can abstract logical problems alone accurately benchmark LLMs’ reasoning ability in real-world scenarios, disentangled from contextual support in practical settings? (2) Does fine-tuning LLMs on abstract logic problems generalize to contextualized logic problems and vice versa? To investigate these questions, we focus on standard propositional logic, specifically propositional deductive and abductive logic reasoning. We construct datasets for both reasoning types with four difficulty levels across 12 distinct domains based on the Wikipedia categorization in addition to those with purely abstract variables. Our experiments aim to provide insights into disentangling context in logical reasoning, the genuine reasoning capabilities of LLMs, and their generalization potential. Coda and data are available at
https://anonymous.4open.science/r/ContextHub-957E.
pdf
bib
abs
Sens-Merging: Sensitivity-Guided Parameter Balancing for Merging Large Language Models
Shuqi Liu
|
Han Wu
|
Bowei He
|
Xiongwei Han
|
Mingxuan Yuan
|
Linqi Song
Recent advances in large language models have led to numerous task-specialized fine-tuned variants, creating a need for efficient model merging techniques that preserve specialized capabilities while avoiding costly retraining. While existing task vector-based merging methods show promise, they typically apply uniform coefficients across all parameters, overlooking varying parameter importance both within and across tasks. We present Sens-Merging, a sensitivity-guided coefficient adjustment method that enhances existing model merging techniques by operating at both task-specific and cross-task levels. Our method analyzes parameter sensitivity within individual tasks and evaluates cross-task transferability to determine optimal merging coefficients. Extensive experiments on Mistral 7B and LLaMA2 7B/13B models demonstrate that Sens-Merging significantly improves performance across general knowledge, mathematical reasoning, and code generation tasks. Notably, when combined with existing merging techniques, our method enables merged models to outperform specialized fine-tuned models, particularly in code generation tasks. Our findings reveal important trade-offs between task-specific and cross-task scalings, providing insights for future model merging strategies.
pdf
bib
abs
EgoNormia: Benchmarking Physical-Social Norm Understanding
MohammadHossein Rezaei
|
Yicheng Fu
|
Phil Cuvin
|
Caleb Ziems
|
Yanzhe Zhang
|
Hao Zhu
|
Diyi Yang
Human activity is moderated by norms; however, supervision for normative reasoning is sparse, particularly where norms are physically- or socially-grounded. We thus present EgoNormia \lVert 𝜖 \rVert, comprising 1,853 (200 for EgoNormia-verified) multiple choice questions (MCQs) grounded within ego-centric videos of human interactions, enabling the evaluation and improvement of normative reasoning in vision-language models (VLMs). spans seven norm categories: safety, privacy, proxemics, politeness, cooperation, coordination/proactivity, and communication/legibility. To compile this dataset at scale, we propose a novel pipeline to generate grounded MCQs from raw egocentric video. Our work demonstrates that current state-of-the-art VLMs lack robust grounded norm understanding, scoring a maximum of 54% on EgoNormia and 58% on EgoNormia-verified, with performance across norm categories indicating significant risks of safety and privacy when VLMs are used in real-world agents. We additionally explore methods for improving normative understanding, demonstrating a naive retrieval-based generation (RAG) method using can enhance normative reasoning in VLMs.
pdf
bib
abs
Large Language Models as Neurolinguistic Subjects: Discrepancy between Performance and Competence
Linyang He
|
Ercong Nie
|
Helmut Schmid
|
Hinrich Schuetze
|
Nima Mesgarani
|
Jonathan Brennan
This study investigates the linguistic understanding of Large Language Models (LLMs) regarding signifier (form) and signified (meaning) by distinguishing two LLM assessment paradigms: psycholinguistic and neurolinguistic. Traditional psycholinguistic evaluations often reflect statistical rules that may not accurately represent LLMs’ true linguistic competence. We introduce a neurolinguistic approach, utilizing a novel method that combines minimal pair and diagnostic probing to analyze activation patterns across model layers. This method allows for a detailed examination of how LLMs represent form and meaning, and whether these representations are consistent across languages. We found: (1) Psycholinguistic and neurolinguistic methods reveal that language performance and competence are distinct; (2) Direct probability measurement may not accurately assess linguistic competence; (3) Instruction tuning won’t change much competence but improve performance; (4) LLMs exhibit higher competence and performance in form compared to meaning. Additionally, we introduce new conceptual minimal pair datasets for Chinese (COMPS-ZH) and German (COMPS-DE), complementing existing English datasets.
pdf
bib
abs
The Impact of Large Language Models in Academia: from Writing to Speaking
Mingmeng Geng
|
Caixi Chen
|
Yanru Wu
|
Yao Wan
|
Pan Zhou
|
Dongping Chen
Large language models (LLMs) are increasingly impacting human society, particularly in textual information. Based on more than 30,000 papers and 1,000 presentations from machine learning conferences, we examined and compared the words used in writing and speaking, representing the first large-scale study of how LLMs influence the two main modes of verbal communication and expression within the same group of people. Our empirical results show that LLM-style words such as significant have been used more frequently in abstracts and oral presentations. The implicit impact on human expression like writing and speaking is beginning to emerge and is likely to grow in the future. We take the first step in building an automated monitoring platform to record its longitudinal changes to call attention to the implicit influence and ripple effect of LLMs on human society.
pdf
bib
abs
X-WebAgentBench: A Multilingual Interactive Web Benchmark for Evaluating Global Agentic System
Peng Wang
|
Ruihan Tao
|
Qiguang Chen
|
Mengkang Hu
|
Libo Qin
Recently, large language model (LLM)-based agents have achieved significant success in interactive environments, attracting significant academic and industrial attention. Despite these advancements, current research predominantly focuses on English scenarios. In reality, there are over 7,000 languages worldwide, all of which demand access to comparable agentic services. Nevertheless, the development of language agents remains inadequate for meeting the diverse requirements of multilingual agentic applications. To fill this gap, we introduce X-WebAgentBench, a novel multilingual agent benchmark in an interactive web environment, which evaluates the planning and interaction performance of language agents across multiple languages, thereby contributing to the advancement of global agent intelligence. Additionally, we assess the performance of various LLMs and cross-lingual alignment methods, examining their effectiveness in enhancing agents. Our findings reveal that even advanced models like GPT-4o, when combined with cross-lingual techniques, fail to achieve satisfactory results. We hope that X-WebAgentBench can serve as a valuable benchmark for multilingual agent scenario in real-world applications.
pdf
bib
abs
MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents
Haoran Tan
|
Zeyu Zhang
|
Chen Ma
|
Xu Chen
|
Quanyu Dai
|
Zhenhua Dong
Recent works have highlighted the significance of memory mechanisms in LLM-based agents, which enable them to store observed information and adapt to dynamic environments. However, evaluating their memory capabilities still remains challenges. Previous evaluations are commonly limited by the diversity of memory levels and interactive scenarios. They also lack comprehensive metrics to reflect the memory capabilities from multiple aspects. To address these problems, in this paper, we construct a more comprehensive dataset and benchmark to evaluate the memory capability of LLM-based agents. Our dataset incorporates factual memory and reflective memory as different levels, and proposes participation and observation as various interactive scenarios. Based on our dataset, we present a benchmark, named MemBench, to evaluate the memory capability of LLM-based agents from multiple aspects, including their effectiveness, efficiency, and capacity. To benefit the research community, we release our dataset and project at
https://github.com/import-myself/Membench.
pdf
bib
abs
Adaptive LoRA Merge with Parameter Pruning for Low-Resource Generation
Ryota Miyano
|
Yuki Arase
This study proposes a simple yet effective LoRA merge method to achieve LLM adaptation for low-resource language generation tasks. The LoRA merge technique, which integrates multiple LoRA modules trained on different tasks, has gained attention as an effective and efficient approach for adapting LLMs to target tasks. However, previous methods are limited in adaptability as they keep the LoRA parameters frozen. Additionally, the low-resource problem has been out of their scope. We propose a LoRA merge method that updates and prunes LoRA parameters through fine-tuning with minimal target task data, which allows finer-grained adjustments of LoRA parameters and enhancement of task adaptability. Extensive experiments have been conducted taking summarization as a benchmark task. Our datasets cover various domains and multiple languages of English and Japanese. The results confirm that the proposed method achieves significant and consistent improvements in task adaptability over the previous methods.
pdf
bib
abs
LongAttn: Selecting Long-context Training Data via Token-level Attention
Longyun Wu
|
Dawei Zhu
|
Guangxiang Zhao
|
Zhuocheng Yu
|
Junfeng Ran
|
Xiangyu Wong
|
Lin Sun
|
Sujian Li
With the development of large language models (LLMs), there has been an increasing need for significant advancements in handling long contexts. To enhance long-context capabilities, constructing high-quality training data with **long-range dependencies** is crucial. Existing methods to select long-context data often rely on sentence-level analysis,which can be greatly optimized in both performance and efficiency. In this paper, we propose a novel token-level framework, **LongAttn**, which leverages the self-attention mechanism of LLMs to measure the **long-range dependencies** for the data. By calculating token-level dependency strength and distribution uniformity of token scores, LongAttn effectively quantifies **long-range dependencies**, enabling more accurate and efficient data selection. We filter **LongABC-32K** from open-source long-context datasets (ArXiv, Book, and Code). Through our comprehensive experiments, LongAttn has demonstrated its excellent **effectiveness**, **scalability**, and **efficiency**. We will release our code and the high-quality long-context dataset **LongABC-32K** in the future.
pdf
bib
abs
CoRE: Condition-based Reasoning for Identifying Outcome Variance in Complex Events
Sai P Vallurupalli
|
Francis Ferraro
Knowing which latent conditions lead to a particular outcome is useful for critically examining claims made about complex event outcomes. Identifying implied conditions and examining their influence on an outcome is challenging. We handle this by combining and augmenting annotations from two existing datasets consisting of goals and states, and explore the influence of conditions through our research questions and Condition-based Reasoning tasks. We examine open and closed LLMs of varying sizes and intent-alignment on our reasoning tasks and find that conditions are useful when not all context is available. Models differ widely in their ability to generate and identify outcome-variant conditions, which affects their performance on outcome validation, when conditions are used to replace missing context. Larger models like GPT-4o, are more cautious in such less constrained situations.
pdf
bib
abs
FaVe: Factored and Verified Search Rationale for Long-form Answer
Jihyuk Kim
|
Sungjin Lee
|
Seung-won Hwang
|
Yang Liu
Targeting long-form question-answering, chain-of-query (CoQ) has been studied, integrating chain-of-thought (CoT) with retrieval-augmented generation. CoQ answers the complex question step-by-step, through simpler subquestions (SQs) from which relevant knowledge is retrieved. By doing so, CoQ aims to improve the answer comprehensiveness and verifiability, at the expense of latency. Our first contribution is showing that the chaining often incurs harmful effects on both objectives, and SQs left unverified often fail to answer the given question. Second, we propose a better alternative to CoQ, union-of-query which adopts a factored approach to break the harmful chain. Finally, we propose to verify SQs before answers, by fine-tuning the SQ generator using verified SQs and introducing a selector verifying SQs in test time. Employing vicuna-13b, our approach, denoted by FaVe (short for Factored and Verified search), even outperforms ChatGPT baselines while maintaining efficiency.
pdf
bib
abs
UnrealLLM: Towards Highly Controllable and Interactable 3D Scene Generation by LLM-powered Procedural Content Generation
SongTang SongTang
|
Kaiyong Zhao
|
Lei Wang
|
Yuliang Li
|
Xuebo Liu
|
Junyi Zou
|
Qiang Wang
|
Xiaowen Chu
The creation of high-quality 3D scenes is essential for applications like video games and simulations, yet automating this process while retaining the benefits of Procedural Content Generation (PCG) remains challenging. In this paper, we introduce UnrealLLM, a novel multi-agent framework that connects natural language descriptions with the professional PCG system (Unreal Engine 5) to automate scene generation. UnrealLLM constructs a comprehensive knowledge base to translate text into executable PCG blueprints and a diverse asset library that guarantees high-quality scene generation. Additionally, it also introduces a text-based blueprint system with a spline-based control mechanism for geometric arrangement, enabling natural language interaction and enhancing interactivity in 3D environments using UE5’s advanced capabilities. Through extensive experiments, we show that UnrealLLM achieves competitive performance in technical metrics and aesthetic quality, offering unique advantages in generation scale and interactivity. This work makes a valuable contribution to automated 3D content creation, benefiting both novice users and professional designers.
pdf
bib
abs
Tree-of-Prompts: Abstracting Control-Flow for Prompt Optimization
Jihyuk Kim
|
Shubham Garg
|
Lahari Poddar
|
Seung-won Hwang
|
Chris Hench
Prompt optimization (PO) generates prompts to guide Large Language Models (LLMs) in performing tasks. Existing methods, such as PromptAgent, rely on a single static prompt, which struggles with disjoint cases in complex tasks. Although MoP uses multiple prompts, it fails to account for variations in task complexity. Inspired by programmatic control flow, we introduce a nested if-else structure to address both varying similarities and complexities across diverse cases. We propose Tree-of-Prompts (ToP), which implements this structure by recursively expanding child prompts from a parent prompt. Sibling prompts tackle disjoint cases while inheriting shared similarities from their parent, and handle cases more complex than the parent. Evaluated on Gorilla (understanding), MATH (reasoning), and a subset of BBH benchmarks, ToP outperforms PromptAgent and MoP, with improvements of 1.4% and 4.6% over PromptAgent and 3.2% and 4.5% over MoP, when tested with GPT-4o-mini and Llama 3.2-3B, respectively.
pdf
bib
abs
Outlier-weighed Layerwise Sampling for LLM Fine-tuning
Pengxiang Li
|
Lu Yin
|
Xiaowei Gao
|
Shiwei Liu
The rapid advancements in Large Language Models (LLMs) have revolutionized various natural language processing tasks. However, the substantial size of LLMs presents significant challenges in training or fine-tuning. While parameter-efficient approaches such as low-rank adaptation (LoRA) have gained popularity, they often compromise performance compared to full-rank fine-tuning. In this paper, we propose Outlier-weighed Layerwise Sampling (OWS), a new memory-efficient fine-tuning approach, inspired by the layerwise outlier distribution of LLMs. Unlike LoRA, which adds extra adapters to all layers, OWS strategically assigns higher sampling probabilities to layers with more outliers, selectively sampling only a few layers and fine-tuning their pre-trained weights. To further increase the number of fine-tuned layers without a proportional rise in memory costs, we incorporate gradient low-rank projection, further boosting the approach’s performance. Our extensive experiments across various architectures, including LLaMa2 and Mistral, demonstrate that OWS consistently outperforms baseline approaches, including full fine-tuning. Specifically, it achieves up to a 1.1% average accuracy gain on the Commonsense Reasoning benchmark, a 3.0% improvement on MMLU, and a notable 10% boost on MT-Bench, while being more memory efficient. OWS allows us to fine-tune 7B LLMs with only 21GB of memory. Our code is available at https://github.com/pixeli99/OWS.
pdf
bib
abs
KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation
Chaoyi Jiang
|
Lei Gao
|
Hossein Entezari Zarch
|
Murali Annavaram
Inference for Large Language Models (LLMs) is computationally demanding. To reduce the cost of auto-regressive decoding, Key-Value (KV) cache is used to store intermediate activations, which significantly lowers the computational overhead for token generation. However, the memory required for the KV cache grows rapidly, often exceeding the capacity of GPU memory. A cost-effective alternative is to offload KV cache to CPU memory, which alleviates GPU memory pressure, but shifts the bottleneck to the limited bandwidth of the PCIe connection between the CPU and GPU. Existing methods attempt to address these issues by overlapping GPU computation with I/O or employing CPU-GPU heterogeneous execution, but they are hindered by excessive data movement and dependence on CPU capabilities. Fully overlapping PCIe communication latency gets challenging as the size of the KV cache grows and/or the GPU compute capabilities increase. In this paper, we introduce KVPR, an efficient I/O-aware LLM inference method where the CPU first transfers a partial set of activations, from which the GPU can start recomputing the KV cache values. While the GPU recomputes the partial KV cache, the remaining portion of the KV cache is transferred concurrently from the CPU. This approach overlaps GPU recomputation with KV cache transfer to minimize idle GPU time and maximize inference performance. KVPR is fully automated by integrating a profiler module that utilizes input characteristics and system hardware information, a scheduler module to optimize the distribution of computation and communication workloads, and a runtime module to efficiently execute the derived execution plan. Experimental results show that KVPR achieves up to 35.8% lower latency and 46.2% higher throughput during decoding compared to state-of-the-art approaches. The code is available at https://github.com/chaoyij/KVPR.
pdf
bib
abs
Direct Behavior Optimization: Unlocking the Potential of Lightweight LLMs
Hongming Yang
|
Shi Lin
|
Jun Shao
|
Changting Lin
|
Donghai Zhu
|
Meng Han
|
Qinglei Kong
Lightweight Large Language Models (LwLLMs) are reduced-parameter, optimized models designed to run efficiently on consumer-grade hardware, offering significant advantages in resource efficiency, cost-effectiveness, and data privacy. However, these models often struggle with limited inference and reasoning capabilities, which restrict their performance on complex tasks and limit their practical applicability. Moreover, existing prompt optimization methods typically rely on extensive manual effort or the meta-cognitive abilities of state-of-the-art LLMs, making them less effective for LwLLMs.To address these challenges, we introduce DeBoP, a new Direct Behavior Optimization Paradigm, original from the Chain-of-Thought (CoT) prompting technique. Unlike CoT Prompting, DeBoP is an automatic optimization method, which focuses on the optimization directly on the behavior of LwLLMs. In particular, DeBoP transforms the optimization of complex prompts into the optimization of discrete, quantifiable execution sequences using a gradient-free Monte Carlo Tree Search. We evaluate DeBoP on seven challenging tasks where state-of-the-art LLMs excel but LwLLMs generally underperform. Experimental results demonstrate that DeBoP significantly outperforms recent prompt optimization methods on most tasks. In particular, DeBoP-optimized LwLLMs surpass GPT-3.5 on most tasks while reducing computational time by approximately 60% compared to other automatic prompt optimization methods.
pdf
bib
abs
Whether LLMs Know If They Know: Identifying Knowledge Boundaries via Debiased Historical In-Context Learning
Bo Lv
|
Nayu Liu
|
Yang Shen
|
Xin Liu
|
Ping Luo
|
Yue Yu
In active retrieval (AR), large language models (LLMs) need first assess whether they possess knowledge to answer a given query, to decide whether to invoke a retrieval module. Existing methods primarily rely on training classification models or using the confidence of the model’s answer to determine knowledge boundaries. However, training-based methods may have limited generalization, and our analysis reveals that LLMs struggle to reliably assess whether they possess the required information based on their answers, often biased by prior cognitive tendencies (e.g., tokens’ semantic preferences). To address this, we propose Debiased Historical In-Context Learning (DH-ICL) to identify knowledge boundaries in AR. DH-ICL aims to reframe this self-awareness metacognitive task as a structured pattern-learning problem by retrieving similar historical queries as high-confidence in-context examples to guide LLMs to identify knowledge boundaries. Furthermore, we introduce a historical bias calibration strategy that leverages deviations in the model’s past response logits to mitigate cognitive biases in its current knowledge boundary assessment. Experiments on four QA benchmarks show that DH-ICL achieves performance comparable to full retrieval on LLaMA with only half the number of retrievals, without any additional training.
pdf
bib
abs
How do LLMs’ Preferences Affect Event Argument Extraction? CAT: Addressing Preference Traps in Unsupervised EAE
Yunhao Wei
|
Kai Shuang
|
Zhiyi Li
|
Chenrui Mao
Large Language Models (LLMs) have significantly improved the performance of unsupervised Event Argument Extraction (EAE) tasks. However, LLMs’ inherent preferences severely hinder their effectiveness in EAE, leading to what we term preference traps, namely, the Prior Knowledge Trap, the Sycophancy Hallucination Trap, and the Output Contradiction Trap. Existing approaches often fall into these traps due to misalignments between their prior knowledge, instructions, or output constraints and LLMs’ preferences, which significantly limits further performance gains. To address this issue, we propose Choose-After-Think (CAT), an unsupervised EAE framework designed to handle these preference traps through targeted measures. CAT innovatively divides the EAE task into two phases: identification of event information (argument roles) (Think Phase) and selection of the final answers from a candidate set (Choose Phase). This two-phase approach reduces the impact of individual token probability anomalies and ensures the integrity of EAE results. Experimental results demonstrate that CAT (based on the local 7B model, zero-shot setting) matches the performance of the best DeepSeek-R1 API model, with a significantly lower time cost.
pdf
bib
abs
Out-of-Distribution Detection via LLM-Guided Outlier Generation for Text-attributed Graph
Xiangwei Lv
|
Mengze Li
|
Jingyuan Chen
|
Zhiang Dong
|
Sirui Han
|
Beishui Liao
Text-Attributed Graphs (TAGs), which are characterized with text attributes, are widely used in the real world. When evaluating fully trained models designed for TAG predictions, they may perform significantly unsatisfactory on samples outside the In-Distribution (ID) data, which may raise serious security issues. To tackle it, Out-Of-Distribution (OOD) detection is introduced to the TAGs field, which aims to utilize a detector to classify OOD and ID samples. Recent studies attempt to introduce extra OOD datasets to regularize the detection model. However, due to the vastness of the OOD data space, high-quality OOD samples for training the detector are scarce and difficult to obtain in the real world. Thus, we utilize Large Language Models (LLMs) to generate the OOD training samples with high quality. There are two issues in this process: (1) LLMs tend to generate OOD-node samples significantly different from ID ones, with a limited learning value for OOD and ID relations. (2) Due to the inherent structure of TAGs, obtained OOD nodes need to be integrated with existing nodes by generating edges using LLMs. However, the large number of nodes makes reasoning over each node pair computationally unbearable. Toward these issues, we introduce LLMGuard with challenging OOD-node generation and lightweight edge predictors. Extensive experiments prove the effectiveness of LLMGuard. The source code is available.
pdf
bib
abs
Document-Level Relation Extraction with Global Relations and Entity Pair Reasoning
Fu Zhang
|
Yi Yan
|
Jingwei Cheng
Document-level relation extraction (DocRE) aims to extract structured relational triples from unstructured text based on given entities. Existing methods are mainly categorized into transformer-based models and graph-based models. While transformer-based models capture global contextual information, they typically focus on individual entity pairs, making it challenging to capture complex interactions between multiple entity pairs. Graph-based models build document graphs using entities or sentences as nodes for reasoning but often lack explicit mechanisms to model fine-grained interactions between entity pairs, limiting their ability to handle complex relational reasoning tasks. Additionally, previous research has not considered predicting all possible relations in advance to assist with DocRE tasks. To address these issues, we propose a new framework namely **GREP** (**g**lobal **r**elations and **e**ntity **p**air reasoning) for DocRE tasks. GREP leverages the global interdependencies between entity pairs to capture fine-grained interactions and perform multi reasoning at the entity pair level. In addtion, GREP for the first time proposes an auxiliary task that predicts all possible relations in advance that exist in a document, which enables the model to filter out the most unlikely relations. Experimental results on widely-used datasets demonstrate that our model achieves state-of-the-art performance. Code is available at https://github.com/yanyi74/GREP.
pdf
bib
abs
Towards Storage-Efficient Visual Document Retrieval: An Empirical Study on Reducing Patch-Level Embeddings
Yubo Ma
|
Jinsong Li
|
Yuhang Zang
|
Xiaobao Wu
|
Xiaoyi Dong
|
Pan Zhang
|
Yuhang Cao
|
Haodong Duan
|
Jiaqi Wang
|
Yixin Cao
|
Aixin Sun
Despite the strong performance of ColPali/ColQwen2 in Visualized Document Retrieval (VDR), its patch-level embedding approach leads to excessive memory usage. This empirical study investigates methods to reduce patch embeddings per page while minimizing performance degradation. We evaluate two token-reduction strategies: token pruning and token merging. Regarding token pruning, we surprisingly observe that a simple random strategy outperforms other sophisticated pruning methods, though still far from satisfactory. Further analysis reveals that pruning is inherently unsuitable for VDR as it requires removing certain page embeddings without query-specific information. Turning to token merging (more suitable for VDR), we search for the optimal combinations of merging strategy across three dimensions and develops Light-ColPali/ColQwen2. It maintains 98.2% of retrieval performance with only 11.8% of original memory usage, and preserves 94.6% effectiveness at 2% memory footprint. We expect our empirical findings and resulting Light-ColPali/ColQwen2 offer valuable insights and establish a competitive baseline for future efficient-VDR research.
pdf
bib
abs
Step-by-Step Mastery: Enhancing Soft Constraint Following Ability of Large Language Models
Qingyu Ren
|
Jie Zeng
|
Qianyu He
|
Jiaqing Liang
|
Yanghua Xiao
|
Weikang Zhou
|
Zeye Sun
|
Fei Yu
It is crucial for large language models (LLMs) to follow instructions that involve multiple constraints. In real-world scenarios, user instructions often contain soft constraints, which are semantically related and cannot be rule-based verified, posing challenges for LLMs. To enhance the soft constraint following ability of LLMs, we initially design a pipeline to construct datasets with high-quality outputs for instructions containing soft constraints automatically. Additionally, to fully utilize the positive and negative samples generated during the data construction process, we choose Direct Preference Optimization (DPO) as the training method. Furthermore, taking into account the difficulty of soft constraints indicated by the number of constraints, we design a curriculum learning training paradigm based on the constraint quantity. We experimentally evaluate the effectiveness of our methods in improving LLMs’ soft constraint following ability and analyze the factors driving the improvements.
pdf
bib
abs
ZeroDL: Zero-shot Distribution Learning for Text Clustering via Large Language Models
Hwiyeol Jo
|
Hyunwoo Lee
|
Kang Min Yoo
|
Taiwoo Park
The advancements in large language models (LLMs) have brought significant progress in NLP tasks. However, if a task cannot be fully described in prompts, the models could fail to carry out the task. In this paper, we propose a simple yet effective method to contextualize a task toward a LLM. The method utilizes (1) open-ended zero-shot inference from the entire dataset, (2) aggregate the inference results, and (3) finally incorporate the aggregated meta-information for the actual task. We show the effectiveness in text clustering tasks, empowering LLMs to perform text-to-text-based clustering and leading to improvements on several datasets. Furthermore, we explore the generated class labels for clustering, showing how the LLM understands the task through data.
pdf
bib
abs
Patterns Over Principles: The Fragility of Inductive Reasoning in LLMs under Noisy Observations
Chunyang Li
|
Weiqi Wang
|
Tianshi Zheng
|
Yangqiu Song
Inductive reasoning, a cornerstone of human cognition, enables generalization from limited data but hasn’t yet been fully achieved by large language models (LLMs). While modern LLMs excel at reasoning tasks, their ability to maintain stable and consistent rule abstraction under imperfect observations remains underexplored. To fill this gap, in this work, we introduce **Robust Rule Induction**, a task that evaluates LLMs’ capability in inferring rules from data that are fused with noisy examples. To address this task, we further propose Sample-steered Rule Refinement (SRR), a method enhancing reasoning stability via observation diversification and execution-guided feedback. Experiments across arithmetic, cryptography, and list functions reveal: (1) SRR outperforms other methods with minimal performance degradation under noise; (2) Despite slight accuracy variation, LLMs exhibit instability under noise (e.g., 0 accuracy change with only 70 consistent score);(3) Counterfactual task gaps highlight LLMs’ reliance on memorized patterns over genuine abstraction. Our findings challenge LLMs’ reasoning robustness, revealing susceptibility to hypothesis drift and pattern overfitting, while providing empirical evidence critical for developing human-like inductive systems.
pdf
bib
abs
LLMTaxo: Leveraging Large Language Models for Constructing Taxonomy of Factual Claims from Social Media
Haiqi Zhang
|
Zhengyuan Zhu
|
Zeyu Zhang
|
Chengkai Li
With the rapid expansion of content on social media platforms, analyzing and comprehending online discourse has become increasingly complex. This paper introduces LLMTaxo, a novel framework leveraging large language models for the automated construction of taxonomies of factual claims from social media by generating topics at multiple levels of granularity. The resulting hierarchical structure significantly reduces redundancy and improves information accessibility. We also propose dedicated taxonomy evaluation metrics to enable comprehensive assessment. Evaluations conducted on three diverse datasets demonstrate LLMTaxo’s effectiveness in producing clear, coherent, and comprehensive taxonomies. Among the evaluated models, GPT-4o mini consistently outperforms others across most metrics. The framework’s flexibility and low reliance on manual intervention underscore its potential for broad applicability.
pdf
bib
abs
AnCast++: Document-Level Evaluation of Graph-based Meaning Representations
Haibo Sun
|
Jayeol Chun
|
Nianwen Xue
Uniform Meaning Representation (UMR) is a cross-lingual document-level graph-based representation that is based on Abstract Meaning Representation (AMR) but extends it to include document-level semantic annotations such as coreference, modal and temporal dependencies.With recent advancements in UMR annotation efforts, a reliable evaluation metric is essential for assessing annotation consistency and tracking progress in automatic parsing. In this paper, we present AnCast++, an aggregated metric that unifies the evaluation of four distinct sub-structures of UMR: (1) sentence-level graphs that represent word senses, named entities, semantic relations between events and their participants, aspectual attributes of events as well as person and number attributes of entities, (2) modal dependencies that represent the level of certainty that a source holds with respect to an event, (3) temporal dependencies between events and their reference times, and (4) coreference relations between entities and between events. In particular, we describe a unified method TC2 for evaluating temporal and coreference relations that captures their shared transitive properties, and present experimental results on English and Chinese UMR parsing based on UMR v1.0 corpus to demonstrate the reliability of our metric. The tool will be made publicly available on Github.
pdf
bib
abs
MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct
Run Luo
|
Haonan Zhang
|
Longze Chen
|
Ting-En Lin
|
Xiong Liu
|
Yuchuan Wu
|
Min Yang
|
Yongbin Li
|
Minzheng Wang
|
Pengpeng Zeng
|
Lianli Gao
|
Heng Tao Shen
|
Yunshui Li
|
Hamid Alinejad-Rokny
|
Xiaobo Xia
|
Jingkuan Song
|
Fei Huang
The development of Multimodal Large Language Models (MLLMs) has seen significant progress, driven by increasing demands across various fields (e.g., multimodal agents, embodied intelligence). While model-driven approaches aim to enhance MLLM capabilities through diverse architectures, their performance gains have become increasingly marginal. In contrast, data-driven methods, which scale up image-text instruction datasets, have proven more effective but face challenges related to limited data diversity and complexity. The absence of high-quality instruction data remains a major bottleneck in MLLM development. To address this issue, we propose , a novel multimodal instruction data evolution framework. This framework iteratively enhances data quality through a refined combination of fine-grained perception, cognitive reasoning, and interaction evolution, generating a more complex and diverse image-text instruction dataset that significantly improves MLLM capabilities. Starting with an initial dataset, SEED-163K, we employ to systematically expand instruction diversity, extend visual reasoning steps to improve cognitive abilities, and extract fine-grained visual details to enhance understanding and robustness. To rigorously evaluate our approach, we conduct extensive qualitative analysis and quantitative experiments across 13 vision-language tasks. Compared to baseline models trained on the original seed dataset, our method achieves an average accuracy improvement of 3.1 percentage points. Moreover, our approach attains state-of-the-art (SOTA) performance in nine tasks while using significantly less data than existing state-of-the-art models.
pdf
bib
abs
SciVerse: Unveiling the Knowledge Comprehension and Visual Reasoning of LMMs on Multi-modal Scientific Problems
Ziyu Guo
|
Renrui Zhang
|
Hao Chen
|
Jialin Gao
|
Dongzhi Jiang
|
Jiaze Wang
|
Pheng-Ann Heng
The rapid advancement of Large Multi-modal Models (LMMs) has enabled their application in scientific problem-solving, yet their fine-grained capabilities remain under-explored. In this paper, we introduce SciVerse, a multi-modal scientific evaluation benchmark to thoroughly assess LMMs across 5,735 test instances in five distinct versions. We aim to investigate three key dimensions of LMMs: scientific knowledge comprehension, multi-modal content interpretation, and Chain-of-Thought (CoT) reasoning. To unveil whether LMMs possess sufficient scientific expertise, we first transform each problem into three versions containing different levels of knowledge required for solving, i.e., Knowledge-free, -lite, and -rich. Then, to explore how LMMs interpret multi-modal scientific content, we annotate another two versions, i.e., Vision-rich and -only, marking more question information from texts to diagrams. Comparing the results of different versions, SciVerse systematically examines the professional knowledge stock and visual perception skills of LMMs in scientific domains. In addition, to rigorously assess CoT reasoning, we propose a new scientific CoT evaluation strategy, conducting a step-wise assessment on knowledge and logical errors in model outputs. Our extensive evaluation of different LMMs on SciVerse reveals critical limitations in their scientific proficiency and provides new insights into future developments. Project page: https://sciverse-cuhk.github.io
pdf
bib
abs
Exploring Layer-wise Representations of English and Chinese Homonymy in Pre-trained Language Models
Matthew King-Hang Ma
|
Xie Chenwei
|
Wenbo Wang
|
William Shiyuan Wang
Homonymy can easily raise lexical ambiguity due to the misunderstanding of its multiple senses. Correct recognition of homonym sense greatly relies on its surrounding context. This ambiguous nature makes homonyms an appropriate testbed for examining the contextualization capability of pre-trained (PLM) and large language models (LLMs). Considering the impact of part of speech (POS) on homonym disambiguation and the prevalence of English-focused studies in word embedding research, this study extends to Chinese and provides a comprehensive layer-wise analysis of homonym representations in both languages, spanning same and different POS categories, across four families of PLMs/LLMs (BERT, GPT-2, Llama 3, Qwen 2.5). Through the creation of a synthetic dataset and computation of disambiguation score (D-Score), we found that: (1) no universal layer depth excels in differentiating homonym representations; (2) bidirectional models produce better contextualized homonym representations compared to much larger autoregressive models; (3) most importantly, POS affects homonym representations in models in ways that differ from human research findings. The individual differences between LLMs uncovered in our study challenge the simplistic understanding of their inner workings. This reveals a compelling research frontier: conducting controlled experiments with purposefully manipulated inputs to enhance the interpretability of LLMs. We have made our dataset and codes available publicly at https://github.com/neurothew/exploring-homonym-rep-in-llm.
pdf
bib
abs
DocMEdit: Towards Document-Level Model Editing
Li Zeng
|
Zeming Liu
|
Chong Feng
|
Heyan Huang
|
Yuhang Guo
Model editing aims to correct errors and outdated knowledge in the Large language models (LLMs) with minimal cost. Prior research has proposed a variety of datasets to assess the effectiveness of these model editing methods. However, most existing datasets only require models to output short phrases or sentences, overlooks the widespread existence of document level tasks in the real world, raising doubts about their practical usability. Aimed at addressing this limitation and promoting the application of model editing in real-world scenarios, we propose the task of document-level model editing. To tackle such challenges and enhance model capabilities in practical settings, we introduce DocMEdit, a dataset focused on document-level model editing, characterized by document-level inputs and outputs, extrapolative, and multiple facts within a single edit. We propose a series of evaluation metrics and experiments. The results show that the difficulties in document-level model editing pose challenges for existing model editing methods.
pdf
bib
abs
Adaptive Detoxification: Safeguarding General Capabilities of LLMs through Toxicity-Aware Knowledge Editing
Yifan Lu
|
Jing Li
|
Yigeng Zhou
|
Yihui Zhang
|
Wenya Wang
|
Xiucheng Li
|
Meishan Zhang
|
Fangming Liu
|
Jun Yu
|
Min Zhang
Large language models (LLMs) exhibit impressive language capabilities but remain vulnerable to malicious prompts and jailbreaking attacks. Existing knowledge editing methods for LLM detoxification face two major challenges. First, they often rely on entity-specific localization, making them ineffective against adversarial inputs without explicit entities. Second, these methods suffer from over-editing, where detoxified models reject legitimate queries, compromising overall performance. In this paper, we propose ToxEdit, a toxicity-aware knowledge editing approach that dynamically detects toxic activation patterns during forward propagation. It then routes computations through adaptive inter-layer pathways to mitigate toxicity effectively. This design ensures precise toxicity mitigation while preserving LLMs’ general capabilities. To more accurately assess over-editing, we also enhance the SafeEdit benchmark by incorporating instruction-following evaluation tasks. Experimental results on multiple LLMs demonstrate that our ToxEdit outperforms previous state-of-the-art methods in both detoxification performance and safeguarding general capabilities of LLMs.
pdf
bib
abs
Evaluating the Long-Term Memory of Large Language Models
Zixi Jia
|
Qinghua Liu
|
Hexiao Li
|
Yuyan Chen
|
Jiqiang Liu
In applications such as dialogue systems, personalized recommendations, and personal assistants, large language models (LLMs) need to retain and utilize historical information over the long term to provide more accurate and consistent responses. Although long-term memory capability is crucial, recent studies have not thoroughly investigated the memory performance of large language models in long-term tasks. To address this gap, we introduce the Long-term Chronological Conversations (LOCCO) dataset and conduct a quantitative evaluation of the long-term memory capabilities of large language models. Experimental results demonstrate that large language models can retain past interaction information to a certain extent, but their memory decays over time. While rehearsal strategies can enhance memory persistence, excessive rehearsal is not an effective memory strategy for large models, unlike in smaller models. Additionally, the models exhibit memory preferences across different categories of information. Our study not only provides a new framework and dataset for evaluating the long-term memory capabilities of large language models but also offers important references for future enhancements of their memory persistence.
pdf
bib
abs
Explain-then-Process: Using Grammar Prompting to Enhance Grammatical Acceptability Judgments
Russell Scheinberg
|
Ameeta Agrawal
|
Amber Shore
|
So Young Lee
Large language models (LLMs) can explain grammatical rules, yet they often fail to apply those rules when judging sentence acceptability. We present grammar prompting, an explain-then-process paradigm: a large LLM first produces a concise explanation of the relevant syntactic phenomenon, then that explanation is fed back as additional context to the target model – either an LLM or a smaller language model (SLM) – before deciding which sentence of a minimal pair is grammatical. On the English BLiMP, Chinese SLING, and Russian RuBLiMP benchmarks, this simple prompt design yields substantial improvements over strong baselines across a wide range of syntactic phenomena. Feeding an LLM’s metalinguistic explanation back to the target model bridges the gap between knowing a rule and using it. On SLMs, grammar prompting alone trims the average LLM-SLM accuracy gap by 20%, and when paired with chain-of-thought, by 56% (13.0 pp → 5.8 pp), all at negligible cost. The lightweight, language-agnostic cue lets low-cost SLMs approach frontier-LLM performance in multilingual settings.
pdf
bib
abs
Data Interpreter: An LLM Agent for Data Science
Sirui Hong
|
Yizhang Lin
|
Bang Liu
|
Bangbang Liu
|
Binhao Wu
|
Ceyao Zhang
|
Danyang Li
|
Jiaqi Chen
|
Jiayi Zhang
|
Jinlin Wang
|
Li Zhang
|
Lingyao Zhang
|
Min Yang
|
Mingchen Zhuge
|
Taicheng Guo
|
Tuo Zhou
|
Wei Tao
|
Robert Tang
|
Xiangtao Lu
|
Xiawu Zheng
|
Xinbing Liang
|
Yaying Fei
|
Yuheng Cheng
|
Yongxin Ni
|
Zhibin Gou
|
Zongze Xu
|
Yuyu Luo
|
Chenglin Wu
Large Language Model (LLM)-based agents have excelled in various domains but face significant challenges when applied to data science workflows due to their complex, multi-stage nature. Current LLM-based agents struggle with non-linear relationships, recursive dependencies, implicit data- and logic-dependent reasoning, and managing extensive context. In this paper, we introduce Data Interpreter, an LLM-based agent that addresses these challenges through hierarchical graph-based modeling to represent the complexity and a progressive strategy for step-by-step verification, refinement, and consistent context management. Extensive experiments confirm the effectiveness of Data Interpreter. On InfiAgent-DABench, it boosts performance by 25% (from 75.9% to 94.9%), and on machine learning and open-ended tasks, it lifts accuracy from 88% to 95% and from 60% to 97%, respectively. Moreover, our method surpasses state-of-the-art baselines by 26% on the MATH dataset. We will release the code upon publication.
pdf
bib
abs
DReSD: Dense Retrieval for Speculative Decoding
Milan Gritta
|
Huiyin Xue
|
Gerasimos Lampouras
Speculative decoding (SD) accelerates Large Language Model (LLM) generation by using an efficient draft model to propose the next few tokens, which are verified by the LLM in a single forward call, reducing latency while preserving its outputs. We focus on retrieval-based SD where the draft model retrieves the next tokens from a non-parametric datastore. Sparse retrieval (CITATION)REST], which operates on the surface form of strings, is currently the dominant paradigm due to its simplicity and scalability. However, its effectiveness is limited due to the usage of short contexts and exact string matching. Instead, we introduce Dense Retrieval for Speculative Decoding (DReSD), a novel framework that uses approximate nearest neighbour search with contextualised token embeddings to retrieve the most semantically relevant token sequences for SD. Extensive experiments show that DReSD achieves (on average) 87% higher acceptance rates, 65% longer accepted tokens and 19% faster generation speeds compared to sparse retrieval (REST).
pdf
bib
abs
Core: Robust Factual Precision with Informative Sub-Claim Identification
Zhengping Jiang
|
Jingyu Zhang
|
Nathaniel Weir
|
Seth Ebner
|
Miriam Wanner
|
Kate Sanders
|
Daniel Khashabi
|
Anqi Liu
|
Benjamin Van Durme
Hallucinations pose a challenge to the application of large language models (LLMs) thereby motivating the development of metrics to evaluate factual precision. We observe that popular metrics using the Decompose-Then-Verify framework, such as FActScore, can be manipulated by adding obvious or repetitive subclaims to artificially inflate scores. This observation motivates our new customizable plug-and-play subclaim selection component called Core, which filters down individual subclaims according to their uniqueness and informativeness. We show that many popular factual precision metrics augmented by Core are substantially more robust on a wide range of knowledge domains. We release an evaluation framework supporting easy and modular use of Core and various decomposition strategies, which we recommend adoption by the community. We also release an expansion of the FActScore biography dataset to facilitate further studies of decomposition-based factual precision evaluation.
pdf
bib
abs
Rethinking Diverse Human Preference Learning through Principal Component Analysis
Feng Luo
|
Rui Yang
|
Hao Sun
|
Chunyuan Deng
|
Jiarui Yao
|
Jingyan Shen
|
Huan Zhang
|
Hanjie Chen
Understanding human preferences is crucial for improving foundation models and building personalized AI systems. However, preferences are inherently diverse and complex, making it difficult for traditional reward models to capture their full range. While fine-grained preference data can help, collecting it is expensive and hard to scale. In this paper, we introduce Decomposed Reward Models (DRMs), a novel approach that extracts diverse human preferences from binary comparisons without requiring fine-grained annotations. Our key insight is to represent human preferences as vectors and analyze them using Principal Component Analysis (PCA). By constructing a dataset of embedding differences between preferred and rejected responses, DRMs identify orthogonal basis vectors that capture distinct aspects of preference. These decomposed rewards can be flexibly combined to align with different user needs, offering an interpretable and scalable alternative to traditional reward models. We demonstrate that DRMs effectively extract meaningful preference dimensions (e.g., helpfulness, safety, humor) and adapt to new users without additional training. Our results highlight DRMs as a powerful framework for personalized and interpretable LLM alignment.
pdf
bib
abs
Improving Word Alignment Using Semi-Supervised Learning
Zhongtao Miao
|
Qiyu Wu
|
Masaaki Nagata
|
Yoshimasa Tsuruoka
Word alignment plays a crucial role in various natural language processing tasks, such as serving as cross-lingual signals for sentence embedding, reducing hallucination and omission in machine translation, and facilitating the construction of training data for simultaneous speech translation.Current state-of-the-art approaches usually rely on: (1) supervised data and large-scale weakly supervised data constructed from Wikipedia and (2) multilingual Transformer encoder-based models.However, we find that the current state-of-the-art encoder-based method, BinaryAlign, suffers from the issue of insufficient labeled data, and we further improve it with self-training with a small amount of parallel data. In addition, considering the impressive performance of multilingual large language models on many natural language processing tasks, we also explore the possibility of using these decoder-based large language models as word aligners. We observe that although fine-tuning large language models with labeled data produces acceptable results, augmenting the training with pseudo-labeled data further enhances model performance. Based on the findings, we propose a semi-supervised framework to improve the large language model-based word aligners. Experimental results demonstrate that the proposed method with a small amount of parallel data outperforms the current state-of-the-art method on various word alignment datasets.
pdf
bib
abs
How Do LLMs Acquire New Knowledge? A Knowledge Circuits Perspective on Continual Pre-Training
Yixin Ou
|
Yunzhi Yao
|
Ningyu Zhang
|
Hui Jin
|
Jiacheng Sun
|
Shumin Deng
|
Zhenguo Li
|
Huajun Chen
Despite exceptional capabilities in knowledge-intensive tasks, Large Language Models (LLMs) face a critical gap in understanding how they internalize new knowledge, particularly how acquired knowledge becomes structurally embedded in their neural computations. We address this issue through the lens of knowledge circuit evolution, identifying computational subgraphs that facilitate knowledge storage and processing. Our systematic analysis of circuit evolution throughout continual pre-training reveals several key findings: (1) the acquisition of new knowledge is influenced by its relevance to pre-existing knowledge; (2) the evolution of knowledge circuits exhibits a distinct phase shift from formation to optimization; (3) the evolution of knowledge circuits follows a deep-to-shallow pattern. These insights not only advance our theoretical understanding of the mechanisms of new knowledge acquisition in LLMs, but also provide potential implications for improving continual pre-training strategies to enhance model performance.
pdf
bib
abs
LLM-Symbolic Integration for Robust Temporal Tabular Reasoning
Atharv Kulkarni
|
Kushagra Dixit
|
Vivek Srikumar
|
Dan Roth
|
Vivek Gupta
Temporal tabular question answering presents a significant challenge for Large Language Models (LLMs), requiring robust reasoning over structured data—a task where traditional prompting methods often fall short. These methods face challenges such as memorization, sensitivity to table size, and reduced performance on complex queries. To overcome these limitations, we introduce TEMPTABQA-C, a synthetic dataset designed for systematic and controlled evaluations, alongside a symbolic intermediate representation that transforms tables into database schemas. This structured approach allows LLMs to generate and execute SQL queries, enhancing generalization and mitigating biases. By incorporating adaptive fewshot prompting with contextually tailored examples, our method achieves superior robustness, scalability, and performance. Experimental results consistently highlight improvements across key challenges, setting a new benchmark for robust temporal reasoning with LLMs. Code and TEMPTABQA-C dataset: https:// coral-lab-asu.github.io/llm_symbolic.
pdf
bib
abs
Multimodal Large Language Models for Text-rich Image Understanding: A Comprehensive Review
Pei Fu
|
Tongkun Guan
|
Zining Wang
|
Zhentao Guo
|
Chen Duan
|
Hao Sun
|
Boming Chen
|
Qianyi Jiang
|
Jiayao Ma
|
Kai Zhou
|
Junfeng Luo
The recent emergence of Multi-modal Large Language Models (MLLMs) has introduced a new dimension to the Text-rich Image Understanding (TIU) field, with models demonstrating impressive and inspiring performance. However, their rapid evolution and widespread adoption have made it increasingly challenging to keep up with the latest advancements. To address this, we present a systematic and comprehensive survey to facilitate further research on TIU MLLMs. Initially, we outline the timeline, architecture, and pipeline of nearly all TIU MLLMs. Then, we review the performance of selected models on mainstream benchmarks. Finally, we explore promising directions, challenges, and limitations within the field.
pdf
bib
abs
PruneVid: Visual Token Pruning for Efficient Video Large Language Models
Xiaohu Huang
|
Hao Zhou
|
Kai Han
We introduce PruneVid, a training-free visual token pruning method designed to enhance the efficiency of multimodal video understanding. While Large Language Models (LLMs) have shown promising performance on video tasks due to their advanced visual comprehension capabilities, the substantial redundancy inherent in video data poses significant computational challenges. To address this issue, PruneVid (1) reduces intrinsic video redundancy by merging temporally static and spatially similar tokens, and (2) leverages LLMs’ inherent ability to selectively prune visual tokens irrelevant to specific queries, thereby improving model efficiency. We validate our method across multiple video benchmarks, demonstrating that PruneVid can prune over 80% of tokens while maintaining competitive performance when combined with different video LLMs. Our results highlight PruneVid’s superior effectiveness and efficiency compared to existing pruning methods.
pdf
bib
abs
PromptWizard: Optimizing Prompts via Task-Aware, Feedback-Driven Self-Evolution
Eshaan Agarwal
|
Raghav Magazine
|
Joykirat Singh
|
Vivek Dani
|
Tanuja Ganu
|
Akshay Nambi
Large language models (LLMs) have transformed AI across diverse domains, with prompting being central to their success in guiding model outputs. However, manual prompt engineering is both labor-intensive and domain-specific, necessitating the need for automated solutions. We introduce PromptWizard, a novel, fully automated framework for discrete prompt optimization, utilizing a self-evolving, self-adapting mechanism. Through a feedback-driven critique and synthesis process, PromptWizard achieves an effective balance between exploration and exploitation, iteratively refining both prompt instructions and in-context examples to generate human-readable, task-specific prompts. This guided approach systematically improves prompt quality, resulting in superior performance across 45 tasks. PromptWizard excels even with limited training data, smaller LLMs, and various LLM architectures. Additionally, our cost analysis reveals a substantial reduction in API calls, token usage, and overall cost, demonstrating PromptWizard’s efficiency, scalability, and advantages over existing prompt optimization strategies.
pdf
bib
abs
Exposing Numeracy Gaps: A Benchmark to Evaluate Fundamental Numerical Abilities in Large Language Models
Haoyang Li
|
Xuejia Chen
|
Zhanchao Xu
|
Darian Li
|
Nicole Hu
|
Fei Teng
|
Yiming Li
|
Luyu Qiu
|
Chen Jason Zhang
|
Li Qing
|
Lei Chen
Large Language Models (LLMs) have demonstrated impressive capabilities in natural language processing tasks, such as text generation and semantic understanding. However, their performance on numerical reasoning tasks, such as basic arithmetic, numerical retrieval, and magnitude comparison, remains surprisingly poor. This gap arises from their reliance on surface-level statistical patterns rather than understanding numbers as continuous magnitudes. Existing benchmarks primarily focus on either linguistic competence or structured mathematical problem-solving, neglecting fundamental numerical reasoning required in real-world scenarios. To bridge this gap, we propose NumericBench, a comprehensive benchmark to evaluate six fundamental numerical capabilities: number recognition, arithmetic operations, contextual retrieval, comparison, summary, and multi-step reasoning. NumericBench includes datasets ranging from synthetic number lists to crawled real-world data, addressing challenges like long contexts, noise, and multi-step reasoning. Extensive experiments on state-of-the-art LLMs, including GPT-4 and DeepSeek, reveal persistent weaknesses in numerical reasoning, highlighting the urgent need to improve numerically-aware language modeling. The benchmark is released in: https://github.com/TreeAI-Lab/NumericBench.
pdf
bib
abs
TABGEN-ICL: Residual-Aware In-Context Example Selection for Tabular Data Generation
Liancheng Fang
|
Aiwei Liu
|
Hengrui Zhang
|
Henry Peng Zou
|
Weizhi Zhang
|
Philip S. Yu
Large Language models (LLMs) have achieved encouraging results in tabular data generation. However, existing approaches require fine-tuning, which is computationally expensive. This paper explores an alternative: prompting a fixed LLM with in-context examples. We observe that using randomly selected in-context examples hampers the LLM’s performance, resulting in sub-optimal generation quality. To address this, we propose a novel in-context learning framework: TabGen-ICL, to enhance the in-context learning ability of LLMs for tabular data generation. TabGen-ICL operates iteratively, retrieving a subset of real samples that represent the residual between currently generated samples and true data distributions. This approach serves two purposes: locally, it provides more effective in-context learning examples for the LLM in each iteration; globally, it progressively narrows the gap between generated and real data. Extensive experiments on five real-world tabular datasets demonstrate that TabGen-ICL significantly outperforms the random selection strategy. Specifically, it reduces the error rate by a margin of up to 42.2% on the fidelity metric. We demonstrate for the first time that prompting a fixed LLM can yield high-quality synthetic tabular data.
pdf
bib
abs
Benchmarking Multi-National Value Alignment for Large Language Models
Chengyi Ju
|
Weijie Shi
|
Chengzhong Liu
|
Jiaming Ji
|
Jipeng Zhang
|
Ruiyuan Zhang
|
Jiajie Xu
|
Yaodong Yang
|
Sirui Han
|
Yike Guo
Do Large Language Models (LLMs) hold positions that conflict with your country’s values? Occasionally they do! However, existing works primarily focus on ethical reviews, failing to capture the diversity of national values, which encompass broader policy, legal, and moral considerations. Furthermore, current benchmarks that rely on spectrum tests using manually designed questionnaires are not easily scalable. To address these limitations, we introduce NaVAB, a comprehensive benchmark to evaluate the alignment of LLMs with the values of five major nations: China, the United States, the United Kingdom, France, and Germany. NaVAB implements a national value extraction pipeline to efficiently construct value assessment datasets. Specifically, we propose a modeling procedure with instruction tagging to process raw data sources, a screening process to filter value-related topics and a generation process with a Conflict Reduction mechanism to filter non-conflicting values. We conduct extensive experiments on various LLMs across countries, and the results provide insights into assisting in the identification of misaligned scenarios. Moreover, we demonstrate that NaVAB can be combined with alignment techniques to effectively reduce value concerns by aligning LLMs’ values with the target country.
pdf
bib
abs
MotiveBench: How Far Are We From Human-Like Motivational Reasoning in Large Language Models?
Xixian Yong
|
Jianxun Lian
|
Xiaoyuan Yi
|
Xiao Zhou
|
Xing Xie
Large language models (LLMs) have been widely adopted as the core of agent frameworks in various scenarios, such as social simulations and AI companions. However, the extent to which they can replicate human-like motivations remains an underexplored question. Existing benchmarks are constrained by simplistic scenarios and the absence of character identities, resulting in an information asymmetry with real-world situations. To address this gap, we propose MotiveBench, which consists of 200 rich contextual scenarios and 600 reasoning tasks covering multiple levels of motivation. Using MotiveBench, we conduct extensive experiments on seven popular model families, comparing different scales and versions within each family. The results show that even the most advanced LLMs still fall short in achieving human-like motivational reasoning. Our analysis reveals key findings, including the difficulty LLMs face in reasoning about “love & belonging” motivations and their tendency toward excessive rationality and idealism. These insights highlight a promising direction for future research on the humanization of LLMs.
pdf
bib
abs
Confidence Improves Self-Consistency in LLMs
Amir Taubenfeld
|
Tom Sheffer
|
Eran Ofek
|
Amir Feder
|
Ariel Goldstein
|
Zorik Gekhman
|
Gal Yona
Self-consistency decoding enhances LLMs’ performance on reasoning tasks by sampling diverse reasoning paths and selecting the most frequent answer. However, it is computationally expensive, as sampling many of these (lengthy) paths is required to increase the chances that the correct answer emerges as the most frequent one. To address this, we introduce Confidence-Informed Self-Consistency (CISC). CISC performs a weighted majority vote based on confidence scores obtained directly from the model. By prioritizing high-confidence paths, it can identify the correct answer with a significantly smaller sample size. When tested on nine models and four datasets, CISC outperforms self-consistency in nearly all configurations, reducing the required number of reasoning paths by over 40% on average. In addition, we introduce the notion of within-question confidence evaluation, after showing that standard evaluation methods are poor predictors of success in distinguishing correct and incorrect answers to the same question. In fact, the most calibrated confidence method proved to be the least effective for CISC. Lastly, beyond these practical implications, our results and analyses show that LLMs can effectively judge the correctness of their own outputs, contributing to the ongoing debate on this topic.
pdf
bib
abs
None of the Above, Less of the Right Parallel Patterns in Human and LLM Performance on Multi-Choice Questions Answering
Zhi Rui Tam
|
Cheng-Kuang Wu
|
Chieh-Yen Lin
|
Yun-Nung Chen
Multiple-choice exam questions with “None of the above” (NA) options have been extensively studied in educational testing, in which existing research suggests that they better assess true knowledge. However, their impact on Large Language Models (LLMs) evaluation remains underexplored. Through systematic experiments with 28 LLMs on the MMLU benchmark, we examine how NA options affect model performance and confidence calibration. Our analysis reveals that NA options, when used as the correct answer, lead to a consistent 30-50% performance drop across models regardless of scale–suggesting that LLMs lack the meta-cognitive ability to systematically evaluate and reject all given options when none are correct. This degradation shows strong domain dependence, with minimal impact on mathematical reasoning (14.6% drop) but severe effects on tasks requiring uncertainty handling like business ethics (48.1% drop). Our results highlight important implications for benchmark design and raise questions about LLMs’ ability to handle uncertainty in real-world applications.
pdf
bib
abs
In Search of the Lost Arch in Dialogue: A Dependency Dialogue Acts Corpus for Multi-Party Dialogues
Jon Cai
|
Brendan King
|
Peyton Cameron
|
Susan Windisch Brown
|
Miriam Eckert
|
Dananjay Srinivas
|
George Arthur Baker
|
V Kate Everson
|
Martha Palmer
|
James Martin
|
Jeffrey Flanigan
Understanding the structure of multi-party conversation and the intentions and dialogue acts of each speaker remains a significant challenge in NLP. While a number of corpora annotated using theoretical frameworks of dialogue have been proposed, these typically focus on either utterance-level labeling of speaker intent, missing wider context, or the rhetorical structure of a dialogue, losing fine-grained intents captured in dialogue acts. Recently, the Dependency Dialogue Acts (DDA) framework has been proposed to for modeling both the fine-grained intents of each speaker and the structure of multi-party dialogues. However, there is not yet a corpus annotated with this framework available for the community to study. To address this gap, we introduce a new corpus of 33 dialogues and over 9,000 utterance units, densely annotated using the Dependency Dialogue Acts (DDA) framework.Our dataset spans four genres of multi-party conversations from different modalities: (1) physics classroom discussions, (2) engineering classroom discussions, (3) board game interactions, and (4) written online game chat logs. Each session is doubly annotated and adjudicated to ensure high-quality labeling. We present a description of the dataset and annotation process, an analysis of speaker dynamics enabled by our annotation, and a baseline evaluation of LLMs as DDA parsers. We discuss the implications of this dataset understanding dynamics between speakers and for developing more controllable dialogue agents.
pdf
bib
abs
ProMind-LLM: Proactive Mental Health Care via Causal Reasoning with Sensor Data
Xinzhe Zheng
|
Sijie Ji
|
Jiawei Sun
|
Renqi Chen
|
Wei Gao
|
Mani Srivastava
Mental health risk is a critical global public health challenge, necessitating innovative and reliable assessment methods. With the development of large language models (LLMs), they stand out to be a promising tool for explainable mental health care applications. Nevertheless, existing approaches predominantly rely on subjective textual mental records, which can be distorted by inherent mental uncertainties, leading to inconsistent and unreliable predictions. To address these limitations, this paper introduces ProMind-LLM. We investigate an innovative approach integrating objective behavior data as complementary information alongside subjective mental records for robust mental health risk assessment. Specifically, ProMind-LLM incorporates a comprehensive pipeline that includes domain-specific pretraining to tailor the LLM for mental health contexts, a self-refine mechanism to optimize the processing of numerical behavioral data, and causal chain-of-thought reasoning to enhance the reliability and interpretability of its predictions. Evaluations of two real-world datasets, PMData and Globem, demonstrate the effectiveness of our proposed methods, achieving substantial improvements over general LLMs. We anticipate that ProMind-LLM will pave the way for more dependable, interpretable, and scalable mental health case solutions.
pdf
bib
abs
Debiasing Online Preference Learning via Preference Feature Preservation
Dongyoung Kim
|
Jinsung Yoon
|
Jinwoo Shin
|
Jaehyung Kim
Recent preference learning frameworks for large language models (LLMs) simplify human preferences with binary pairwise comparisons and scalar rewards. This simplification could make LLMs’ responses biased to mostly preferred features, and would be exacerbated during the iterations of online preference learning steps. To address these challenges, we propose a novel framework coined PFP (Preference Feature Preservation). The key idea of PFP is maintaining the distribution of human preference features and utilizing such rich signals throughout the online preference learning process. Specifically, PFP first extract preference features from offline pairwise human preference data and trains a feature classifier. Then, using trained classifier and the distribution preserving optimization, PFP maps appropriate preference features for a new input instruction during online learning. Lastly, PFP trains LLM using the existing preference learning method, by incorporating the preference feature into system prompts and enabling LLM to explicitly handle various human preferences. Our experiments demonstrate that PFP successfully mitigates the bias in preference features during online learning, and hence achieves superior performance compared to previous preference learning methods on standard benchmarks to evaluate LLM alignment.
pdf
bib
abs
ShortGPT: Layers in Large Language Models are More Redundant Than You Expect
Xin Men
|
Mingyu Xu
|
Qingyu Zhang
|
Qianhao Yuan
|
Bingning Wang
|
Hongyu Lin
|
Yaojie Lu
|
Xianpei Han
|
Weipeng Chen
As Large Language Models (LLMs) continue to advance, their computational overhead has increased significantly. In this study, we identify notable redundancy across the layers of LLMs, where some layers contribute minimally to the overall network functionality. To quantify this, we introduce a metric called Block Influence (BI), which measures the importance of each layer based on the similarity between its input and output. Based on the observation of layer redundancy, we propose straightforward pruning methods for different tasks: ShortGPT for multiple-choice tasks and ShortGPT-gen for generative tasks. They prune redundant layers based on their BI scores. Our methods demonstrate superior performance over previous pruning methods. The ability to achieve better results through simple layer pruning, as opposed to more complex pruning techniques, suggests a high degree of redundancy across layers. We hope this work will contribute to future research for improving LLM efficiency.
pdf
bib
abs
ProjectEval: A Benchmark for Programming Agents Automated Evaluation on Project-Level Code Generation
Kaiyuan Liu
|
Youcheng Pan
|
Yang Xiang
|
Daojing He
|
Jing Li
|
Yexing Du
|
Tianrun Gao
Recently, LLM agents have made rapid progress in improving their programming capabilities. However, existing benchmarks lack the ability to automatically evaluate from users’ perspective, and also lack the explainability of the results of LLM agents’ code generation capabilities. Thus, we introduce ProjectEval, a new benchmark for LLM agents project-level code generation’s automated evaluation by simulating user interaction. ProjectEval is constructed by LLM with human reviewing. It has three different level inputs of natural languages or code skeletons. ProjectEval can evaluate the generated projects by user interaction simulation for execution, and by code similarity through existing objective indicators. Through ProjectEval, we find that systematic engineering project code, overall understanding of the project and comprehensive analysis capability are the keys for LLM agents to achieve practical projects. Our findings and benchmark provide valuable insights for developing more effective programming agents that can be deployed in future real-world production.
pdf
bib
abs
Unveiling the Lack of LVLM Robustness to Fundamental Visual Variations: Why and Path Forward
Zhiyuan Fan
|
Yumeng Wang
|
Sandeep Polisetty
|
Yi R. Fung
Large Vision Language Models (LVLMs) have shown impressive performance on various vision-language tasks. However, while objects in natural scenes inevitably exhibit visual variations in position, scale, orientation, and context due to changes in viewpoint and environment, the robustness of LVLMs to these fundamental visual variations remains largely unexplored. To address this gap, we introduce V²R-Bench, a comprehensive benchmark framework for evaluating Visual Variation Robustness of LVLMs, which encompasses automated evaluation dataset generation and principled metrics for thorough robustness assessment. Through extensive evaluation of 13 LVLMs, we reveal a surprising vulnerability to visual variations, affecting even advanced models that excel at complex vision-language tasks yet significantly underperform on simple tasks like object recognition. Interestingly, these models exhibit a distinct visual position bias that contradicts theories of effective receptive fields and demonstrate a human-like visual acuity threshold. To identify the source of these vulnerabilities, we propose a systematic framework for component-level analysis, featuring a novel visualization approach for aligned visual features. Results show that these vulnerabilities stem from error accumulation in the pipeline architecture and inadequate multimodal alignment. Complementary experiments with synthetic data further demonstrate that these limitations are fundamentally architectural challenges, underscoring the need for architectural innovations in future LVLM designs.
pdf
bib
abs
DYNTEXT: Semantic-Aware Dynamic Text Sanitization for Privacy-Preserving LLM Inference
Juhua Zhang
|
Zhiliang Tian
|
Minghang Zhu
|
Yiping Song
|
Taishu Sheng
|
Siyi Yang
|
Qiunan Du
|
Xinwang Liu
|
Minlie Huang
|
Dongsheng Li
LLMs face privacy risks when handling sensitive data. To ensure privacy, researchers use differential privacy (DP) to provide protection by adding noise during LLM training. However, users may be hesitant to share complete data with LLMs. Researchers follow local DP to sanitize the text on the user side and feed non-sensitive text to LLMs. The sanitization usually uses a fixed non-sensitive token list or a fixed noise distribution, which induces the risk of being attacked or semantic distortion. We argue that the token’s protection level should be adaptively adjusted according to its semantic-based information to balance the privacy-utility trade-off. In this paper, we propose DYNTEXT, an LDP-based Dynamic Text sanitization for privacy-preserving LLM inference, which dynamically constructs semantic-aware adjacency lists of sensitive tokens to sample non-sensitive tokens for perturbation. Specifically, DYNTEXT first develops a semantic-based density modeling under DP to extract each token’s density information. We propose token-level smoothing sensitivity by combining the idea of global sensitivity (GS) and local sensitivity (LS), which dynamically adjusts the noise scale to avoid excessive noise in GS and privacy leakage in LS. Then, we dynamically construct an adjacency list for each sensitive token based on its semantic density information. Finally, we apply the replacement mechanism to sample non-sensitive, semantically similar tokens from the adjacency list to replace sensitive tokens. Experiments show that DYNTEXT excels strong baselines on three datasets.
pdf
bib
abs
InImageTrans: Multimodal LLM-based Text Image Machine Translation
Fei Zuo
|
Kehai Chen
|
Yu Zhang
|
Zhengshan Xue
|
Min Zhang
Multimodal large language models (MLLMs) have shown remarkable capabilities across various downstream tasks. However, when MLLMs are transferred to the text image machine translation (TiMT) task, preliminary experiments reveal that MLLMs suffer from serious repetition and omission hallucinations. To alleviate these issues, this paper first designs an efficient MLLM named InImageTrans for TiMT and then proposes a simple and effective method named multi-conditional direct preference optimization (mcDPO) for advancing the TiMT. Particularly, the proposed mcDPO not only guides the MLLM in rejecting repetition output by creating text output preference pairs automatically, but also guides the MLLM in paying more attention to text information in images by creating image input preference pairs. Furthermore, we build a high-quality benchmark called MCiT for comprehensively evaluating the TiMT capabilities of InImageTrans. Experimental results show that the proposed method significantly outperforms existing open-source MLLMs on MCiT.
pdf
bib
abs
FRAME: Boosting LLMs with A Four-Quadrant Multi-Stage Pretraining Strategy
Xuemiao Zhang
|
Feiyu Duan
|
Xu Liangyu
|
Yongwei Zhou
|
Sirui Wang
|
Rongxiang Weng
|
Jingang Wang
|
Xunliang Cai
Large language models (LLMs) have significantly advanced human language understanding and generation, with pretraining data quality and organization being crucial to their performance. Multi-stage pretraining is a promising approach, but existing methods often lack quantitative criteria for data partitioning and instead rely on intuitive heuristics. In this paper, we propose the novel Four-quadRAnt Multi-stage prEtraining strategy (FRAME), guided by the established principle of organizing the pretraining process into four stages to achieve significant loss reductions four times. This principle is grounded in two key findings: first, training on high Perplexity (PPL) data followed by low PPL data, and second, training on low PPL difference (PD) data followed by high PD data, both causing the loss to drop significantly twice and performance enhancements. By partitioning data into four quadrants and strategically organizing them, FRAME achieves a remarkable 16.8% average improvement over random across MMLU and CMMLU for the 3B model, effectively boosting LLM performance.
pdf
bib
abs
When Large Language Models Meet Speech: A Survey on Integration Approaches
Zhengdong Yang
|
Shuichiro Shimizu
|
Yahan Yu
|
Chenhui Chu
Recent advancements in large language models (LLMs) have spurred interest in expanding their application beyond text-based tasks. A large number of studies have explored integrating other modalities with LLMs, notably speech modality, which is naturally related to text. This paper surveys the integration of speech with LLMs, categorizing the methodologies into three primary approaches: text-based, latent-representation-based, and audio-token-based integration. We also demonstrate how these methods are applied across various speech-related applications and highlight the challenges in this field to offer inspiration for future research.
pdf
bib
abs
KE-MHISTO: Towards a Multilingual Historical Knowledge Extraction Benchmark for Addressing the Long-Tail Problem
Arianna Graciotti
|
Leonardo Piano
|
Nicolas Lazzari
|
Enrico Daga
|
Rocco Tripodi
|
Valentina Presutti
|
Livio Pompianu
Large Language Models (LLMs) face significant challenges when queried about long-tail knowledge, i.e., information that is rarely encountered during their training process. These difficulties arise due to the inherent sparsity of such data. Furthermore, LLMs often lack the ability to verify or ground their responses in authoritative sources, which can lead to plausible yet inaccurate outputs when addressing infrequent subject matter. Our work aims to investigate these phenomena by introducing KE-MHISTO, a multilingual benchmark for Entity Linking and Question Answering in the domain of historical music knowledge, available in both Italian and English. We demonstrate that KE-MHISTO provides significantly broader coverage of long-tail knowledge compared to existing alternatives. Moreover, it poses substantial challenges for state-of-the-art models. Our experiments reveal that smaller, multilingual models can achieve performance comparable to significantly larger counterparts, highlighting the potential of efficient, language-aware approaches for long-tail knowledge extraction. KE-MHISTO is available at: https://github.com/polifonia-project/KE-MHISTO.
pdf
bib
abs
TailorKV: A Hybrid Framework for Long-Context Inference via Tailored KV Cache Optimization
Dingyu Yao
|
Bowen Shen
|
Zheng Lin
|
Wei Liu
|
Jian Luan
|
Bin Wang
|
Weiping Wang
The Key-Value (KV) cache in generative large language models (LLMs) introduces substantial memory overhead. Existing works mitigate this burden by offloading or compressing the KV cache. However, loading the entire cache incurs significant latency due to PCIe bandwidth bottlenecks in CPU-GPU communication, while aggressive compression causes notable performance degradation. We identify that certain layers in the LLM need to maintain global information and are unsuitable for selective loading. In contrast, other layers primarily focus on a few tokens with dominant activations that potentially incur substantial quantization error. This observation leads to a key insight that loading dominant tokens and quantizing all tokens can complement each other. Building on this insight, we propose a hybrid compression method, TailorKV, which seamlessly integrates quantization and offloading. TailorKV develops an inference framework along with a hardware-friendly implementation that leverages these complementary characteristics. Extensive long-context evaluations exhibit that TailorKV achieves nearly lossless performance under aggressive compression settings, outperforming the state-of-the-art. Particularly, the Llama-3.1-8B with 128k context can be served within a single RTX 3090 GPU, reaching 82 ms per token during decoding.
pdf
bib
abs
The Elephant in the Room: Exploring the Role of Neutral Words in Language Model Group-Agnostic Debiasing
Xinwei Guo
|
Jiashi Gao
|
Junlei Zhou
|
Jiaxin Zhang
|
Guanhua Chen
|
Xiangyu Zhao
|
Quanying Liu
|
Haiyan Wu
|
Xin Yao
|
Xuetao Wei
Large Language Models (LLMs) are increasingly integrated into our daily lives, raising significant ethical concerns, especially about perpetuating stereotypes.While group-specific debiasing methods have made progress, they often fail to address multiple biases simultaneously. In contrast, group-agnostic debiasing has the potential to mitigate a variety of biases at once, but remains underexplored.In this work, we investigate the role of neutral words—the group-agnostic component—in enhancing the group-agnostic debiasing process. We first reveal that neutral words are essential for preserving semantic modeling, and we propose 𝜖-DPCE, a method that incorporates a neutral word semantics-based loss function to effectively alleviate the deterioration of the Language Modeling Score (LMS) during the debiasing process. Furthermore, by introducing the SCM-Projection method, we demonstrate that SCM-based debiasing eliminates stereotypes by indirectly disrupting the association between attribute and neutral words in the Stereotype Content Model (SCM) space. Our experiments show that neutral words, which often embed multi-group stereotypical objects, play a key role in contributing to the group-agnostic nature of SCM-based debiasing.
pdf
bib
abs
LLMs Can Achieve High-quality Simultaneous Machine Translation as Efficiently as Offline
Biao Fu
|
Minpeng Liao
|
Kai Fan
|
Chengxi Li
|
Liang Zhang
|
Yidong Chen
|
Xiaodong Shi
When the complete source sentence is provided, Large Language Models (LLMs) perform excellently in offline machine translation even with a simple prompt “Translate the following sentence from [src lang] into [tgt lang]:”. However, in many real scenarios, the source tokens arrive in a streaming manner and simultaneous machine translation (SiMT) is required, then the efficiency and performance of decoder-only LLMs are significantly limited by their auto-regressive nature. To enable LLMs to achieve high-quality SiMT as efficiently as offline translation, we propose a novel paradigm that includes constructing supervised fine-tuning (SFT) data for SiMT, along with new training and inference strategies. To replicate the token input/output stream in SiMT, the source and target tokens are rearranged into an interleaved sequence, separated by special tokens according to varying latency requirements. This enables powerful LLMs to learn read and write operations adaptively, based on varying latency prompts, while still maintaining efficient auto-regressive decoding. Experimental results show that, even with limited SFT data, our approach achieves state-of-the-art performance across various SiMT benchmarks and different evaluation metrics, and preserves the original capabilities of offline translation. Moreover, our approach generalizes well to document-level SiMT setting without requiring specific fine-tuning, even beyond the offline translation model.
pdf
bib
abs
Beyond Completion: A Foundation Model for General Knowledge Graph Reasoning
Yin Hua
|
Zhiqiang Liu
|
Mingyang Chen
|
Zheng Fang
|
Chi Man Wong
|
Lingxiao Li
|
Chi Man Vong
|
Huajun Chen
|
Wen Zhang
In natural language processing (NLP) and computer vision (CV), the successful application of foundation models across diverse tasks has demonstrated their remarkable potential. However, despite the rich structural and textual information embedded in knowledge graphs (KGs), existing research of foundation model for KG has primarily focused on their structural aspects, with most efforts restricted to in-KG tasks (e.g., knowledge graph completion, KGC). This limitation has hindered progress in addressing more challenging out-of-KG tasks. In this paper, we introduce MERRY, a foundation model for general knowledge graph reasoning, and investigate its performance across two task categories: in-KG reasoning tasks (e.g., KGC) and out-of-KG tasks (e.g., KG question answering, KGQA). We not only utilize the structural information, but also the textual information in KGs. Specifically, we propose a multi-perspective Conditional Message Passing (CMP) encoding architecture to bridge the gap between textual and structural modalities, enabling their seamless integration. Additionally, we introduce a dynamic residual fusion module to selectively retain relevant textual information and a flexible edge scoring mechanism to adapt to diverse downstream tasks. Comprehensive evaluations on 28 datasets demonstrate that MERRY outperforms existing baselines in most scenarios, showcasing strong reasoning capabilities within KGs and excellent generalization to out-of-KG tasks such as KGQA.
pdf
bib
abs
Generative Error Correction for Emotion-aware Speech-to-text Translation
Zhengdong Yang
|
Sheng Li
|
Chenhui Chu
This paper explores emotion-aware speech-to-text translation (ST) using generative error correction (GER) by large language models (LLMs). Despite recent advancements in ST, the impact of the emotional content has been overlooked. First, we enhance the translation of emotional speech by adopting the GER paradigm: Finetuned an LLM to generate the translation based on the decoded N-best hypotheses. Moreover, we combine the emotion and sentiment labels into the LLM finetuning process to enable the model to consider the emotion content. In addition, we project the ST model’s latent representation into the LLM embedding space to further improve emotion recognition and translation. Experiments on an English-Chinese dataset show the effectiveness of the combination of GER, emotion and sentiment labels, and the projector for emotion-aware ST. Our code is available at https://github.com/N-Orien/EmoST.
pdf
bib
abs
SynapticRAG: Enhancing Temporal Memory Retrieval in Large Language Models through Synaptic Mechanisms
Yuki Hou
|
Haruki Tamoto
|
Qinghua Zhao
|
Homei Miyashita
Existing retrieval methods in Large Language Models show degradation in accuracy when handling temporally distributed conversations, primarily due to their reliance on simple similarity-based retrieval. Unlike existing memory retrieval methods that rely solely on semantic similarity, we propose SynapticRAG, which uniquely combines temporal association triggers with biologically-inspired synaptic propagation mechanisms. Our approach uses temporal association triggers and synaptic-like stimulus propagation to identify relevant dialogue histories. A dynamic leaky integrate-and-fire mechanism then selects the most contextually appropriate memories. Experiments on four datasets of English, Chinese and Japanese show that compared to state-of-the-art memory retrieval methods, SynapticRAG achieves consistent improvements across multiple metrics up to 14.66% points. This work bridges the gap between cognitive science and language model development, providing a new framework for memory management in conversational systems.
pdf
bib
abs
Localizing and Mitigating Errors in Long-form Question Answering
Rachneet Singh Sachdeva
|
Yixiao Song
|
Mohit Iyyer
|
Iryna Gurevych
Long-form question answering (LFQA) aims to provide thorough and in-depth answers to complex questions, enhancing comprehension. However, such detailed responses are prone to hallucinations and factual inconsistencies, challenging their faithful evaluation. This work introduces HaluQuestQA, the first hallucination dataset with localized error annotations for human-written and model-generated LFQA answers. HaluQuestQA comprises 698 QA pairs with 1.8k span-level error annotations for five different error types by expert annotators, along with preference judgments. Using our collected data, we thoroughly analyze the shortcomings of long-form answers and find that they lack comprehensiveness and provide unhelpful references. We train an automatic feedback model on this dataset that predicts error spans with incomplete information and provides associated explanations. Finally, we propose a prompt-based approach, Error-Informed Refinement, that uses signals from the learned feedback model to refine generated answers, which we show reduces errors and improves the quality of the answers across multiple models. Furthermore, humans find the answers generated by our approach comprehensive and highly prefer them (84%) over the baseline answers.
pdf
bib
abs
EMGLLM: Data-to-Text Alignment for Electromyogram Diagnosis Generation with Medical Numerical Data Encoding
Zefei Long
|
Zhenbiao Cao
|
Wei Chen
|
Zhongyu Wei
Electromyography (EMG) tables are crucial for diagnosing muscle and nerve disorders, and advancing the automation of EMG diagnostics is significant for improving medical efficiency. EMG tables contain extensive continuous numerical data, which current Large Language Models (LLMs) often struggle to interpret effectively. To address this issue, we propose EMGLLM, a data-to-text model specifically designed for medical examination tables. EMGLLM employs the EMG Alignment Encoder to simulate the process that doctors compare test values with reference values, aligning the data into word embeddings that reflect health degree. Additionally, we construct ETM, a dataset comprising 17,250 real cases and their corresponding diagnostic results, to support medical data-to-text tasks. Experimental results on ETM demonstrate that EMGLLM outperforms various baseline models in understanding EMG tables and generating high-quality diagnoses, which represents an effective paradigm for automatic diagnosis generation from medical examination table.
pdf
bib
abs
LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM
Sambal Shikhar
|
Mohammed Irfan Kurpath
|
Sahal Shaji Mullappilly
|
Jean Lahoud
|
Fahad Shahbaz Khan
|
Rao Muhammad Anwer
|
Salman Khan
|
Hisham Cholakkal
Recent advancements in speech-to-speech dialogue systems leverage LLMs for multimodal interactions, yet they remain hindered by fine-tuning requirements, high computational overhead, and text-speech misalignment. Existing speech-enabled LLMs often degrade conversational quality by modifying the LLM, thereby compromising its linguistic capabilities. In contrast, we propose LLMVoX, a lightweight 30M-parameter, LLM-agnostic, autoregressive streaming TTS system that generates high-quality speech with low latency, while fully preserving the capabilities of the base LLM. Our approach achieves a significantly lower Word Error Rate compared to speech-enabled LLMs, while operating at comparable latency. By decoupling speech synthesis from LLM processing via a multi-queue token streaming system, LLMVoX enables seamless, infinite-length dialogues. Its plug-and-play design also facilitates extension to various tasks with different backbones. Furthermore, LLMVoX generalizes to new languages with minimal dataset adaptation, attaining a low Character Error Rate on an Arabic speech task. Evaluations demonstrate that LLMVoX matches or surpasses existing speech-enabled LLMs in both speech quality and latency, while maintaining the original linguistic strengths of the LLM. Additionally, we have integrated LLMVoX with a Vision-Language Model to create an omni-model with speech, text, and vision capabilities, without requiring additional multimodal training.
pdf
bib
abs
Act2P: LLM-Driven Online Dialogue Act Classification for Power Analysis
Zhangwenbo Zhangwenbo
|
Wang Yuhan
In team communication, dialogue acts play a crucial role in helping team members understand each other’s intentions and revealing the roles and communication patterns within interactions. Although existing studies have focused on using Dialogue Act classification to capture the speaker’s intentions, few have explored the underlying power dynamics reflected by these dialogue acts. To this end, we present an online Dialogue Act Classification and Dynamic Power Analysis framework—Act2P, which is based on large language model. The framework combines the zero-shot learning capability of LLMs and introduces an online feedback classification method that allows for online classification with iterative feedback to previous stages, achieving efficient and accurate classification without the labeled data. Additionally, we also propose the PowerRank algorithm, which quantifies power dynamics through a graph-based structure. Through comparative experiments with existing methods, we demonstrate the significant superiority of Act2P in online scenarios and successfully visualize dialogue power in online, clearly presenting the distribution and dynamic transfer of power. This framework provides new scientific insights and practical tools for optimizing team collaboration.
pdf
bib
abs
MELABenchv1: Benchmarking Large Language Models against Smaller Fine-Tuned Models for Low-Resource Maltese NLP
Kurt Micallef
|
Claudia Borg
Large Language Models (LLMs) have demonstrated remarkable performance across various Natural Language Processing (NLP) tasks, largely due to their generalisability and ability to perform tasks without additional training. However, their effectiveness for low-resource languages remains limited. In this study, we evaluate the performance of 55 publicly available LLMs on Maltese, a low-resource language, using a newly introduced benchmark covering 11 discriminative and generative tasks. Our experiments highlight that many models perform poorly, particularly on generative tasks, and that smaller fine-tuned models often perform better across all tasks. From our multidimensional analysis, we investigate various factors impacting performance. We conclude that prior exposure to Maltese during pre-training and instruction-tuning emerges as the most important factor. We also examine the trade-offs between fine-tuning and prompting, highlighting that while fine-tuning requires a higher initial cost, it yields better performance and lower inference costs. Through this work, we aim to highlight the need for more inclusive language technologies and recommend for researchers working with low-resource languages to consider more “traditional” language modelling approaches.
pdf
bib
abs
TRATES: Trait-Specific Rubric-Assisted Cross-Prompt Essay Scoring
Sohaila Eltanbouly
|
Salam Albatarni
|
Tamer Elsayed
Research on holistic Automated Essay Scoring (AES) is long-dated; yet, there is a notable lack of attention for assessing essays according to individual traits. In this work, we propose TRATES, a novel trait-specific and rubric-based cross-prompt AES framework that is generic yet specific to the underlying trait. The framework leverages a Large Language Model (LLM) that utilizes the trait grading rubrics to generate trait-specific features (represented by assessment questions), then assesses those features given an essay. The trait-specific features are eventually combined with generic writing-quality and prompt-specific features to train a simple classical regression model that predicts trait scores of essays from an unseen prompt. Experiments show that TRATES achieves a new state-of-the-art performance across all traits on a widely-used dataset, with the generated LLM-based features being the most significant.
pdf
bib
abs
DAST: Context-Aware Compression in LLMs via Dynamic Allocation of Soft Tokens
Shaoshen Chen
|
Yangning Li
|
Zishan Xu
|
Yongqin Zeng
|
Shunlong Wu
|
Xinshuo Hu
|
Zifei Shan
|
Xin Su
|
Jiwei Tang
|
Yinghui Li
|
Hai-Tao Zheng
Large Language Models (LLMs) face computational inefficiencies and redundant processing when handling long context inputs, prompting a focus on compression techniques. While existing semantic vector-based compression methods achieve promising performance, these methods fail to account for the intrinsic information density variations between context chunks, instead allocating soft tokens uniformly across context chunks. This uniform distribution inevitably diminishes allocation to information-critical regions. To address this, we propose Dynamic Allocation of Soft Tokens (DAST), a simple yet effective method that leverages the LLM’s intrinsic understanding of contextual relevance to guide compression. DAST combines perplexity-based local information with attention-driven global information to dynamically allocate soft tokens to the informative-rich chunks, enabling effective, context-aware compression. Experimental results across multiple benchmarks demonstrate that DAST surpasses state-of-the-art methods.
pdf
bib
abs
A Multi-Expert Structural-Semantic Hybrid Framework for Unveiling Historical Patterns in Temporal Knowledge Graphs
Yimin Deng
|
Yuxia Wu
|
Yejing Wang
|
Guoshuai Zhao
|
Li Zhu
|
Qidong Liu
|
Derong Xu
|
Zichuan Fu
|
Xian Wu
|
Yefeng Zheng
|
Xiangyu Zhao
|
Xueming Qian
Temporal knowledge graph reasoning aims to predict future events with knowledge of existing facts and plays a key role in various downstream tasks. Previous methods focused on either graph structure learning or semantic reasoning, failing to integrate dual reasoning perspectives to handle different prediction scenarios. Moreover, they lack the capability to capture the inherent differences between historical and non-historical events, which limits their generalization across different temporal contexts. To this end, we propose a **M**ulti-**E**xpert **S**tructural-**S**emantic **H**ybrid (MESH) framework that employs three kinds of expert modules to integrate both structural and semantic information, guiding the reasoning process for different events. Extensive experiments on three datasets demonstrate the effectiveness of our approach.
pdf
bib
abs
MWPO: Enhancing LLMs Performance through Multi-Weight Preference Strength and Length Optimization
Shiyue Xu
|
Fu Zhang
|
Jingwei Cheng
|
Linfeng Zhou
Direct Preference Optimization (DPO) have proposed offline alternatives to Reinforcement Learning from Human Feedback (RLHF). In DPO, each preference pair, which serves as the foundation for learning, is typically constructed by first generating multiple responses to the same instruction and then annotating them to indicate the preferred choice. However, when the responses are highly similar, the weak preference signal can introduce annotation noise, which may hinder model optimization. Additionally, DPO suffers from the drawback of over-optimizing for verbose generation. A potential reason is the presence of length bias in preference datasets, which can lead to length exploitation. To address these issues, we propose a DPO-based **m**ulti-**w**eight **p**reference strength and length **o**ptimization (MWPO) method. Specifically, we propose to reweight preference pairs based on implicit reward margins and response length margins, unifying them through a geometric mixture to generate synthetic weights for optimization. This method allows preference pairs with stronger preference signals or more favorable length feature to have a more pronounced impact on model parameters. Moreover, our method does not require additional annotators. We validate our method on models of four different scales across multiple benchmarks. Our method surpasses state-of-the-art (SOTA) baselines, outperforming DPO by up to 8.7% on AlpacaEval 2 while reducing generation length by 9.4% in the Mistral setting. Our code is available at https://github.com/AIR-hl/MWPO.
pdf
bib
abs
CLEAR: Character Unlearning in Textual and Visual Modalities
Alexey Dontsov
|
Dmitrii Korzh
|
Alexey Zhavoronkin
|
Boris Mikheev
|
Denis Bobkov
|
Aibek Alanov
|
Oleg Rogov
|
Ivan Oseledets
|
Elena Tutubalina
Machine Unlearning (MU) is critical for removing private or hazardous information from deep learning models. While MU has advanced significantly in unimodal (text or vision) settings, multimodal unlearning (MMU) remains underexplored due to the lack of open benchmarks for evaluating cross-modal data removal. To address this gap, we introduce CLEAR, the first open-source benchmark designed specifically for MMU. CLEAR contains 200 fictitious individuals and 3,700 images linked with corresponding question-answer pairs, enabling a thorough evaluation across modalities. We conduct a comprehensive analysis of 11 MU methods (e.g., SCRUB, gradient ascent, DPO) across four evaluation sets, demonstrating that jointly unlearning both modalities outperforms single-modality approaches. The dataset is available at [link](https://huggingface.co/datasets/therem/CLEAR)
pdf
bib
abs
Assessing the Reasoning Capabilities of LLMs in the context of Evidence-based Claim Verification
John Dougrez-Lewis
|
Mahmud Elahi Akhter
|
Federico Ruggeri
|
Sebastian Löbbers
|
Yulan He
|
Maria Liakata
Although LLMs have shown great performance on Mathematics and Coding related reasoning tasks, the reasoning capabilities of LLMs regarding other forms of reasoning are still an open problem. Here, we examine the issue of reasoning from the perspective of claim verification. We propose a framework designed to break down any claim paired with evidence into atomic reasoning types that are necessary for verification. We use this framework to create RECV, the first claim verification benchmark, incorporating real-world claims, to assess the deductive and abductive reasoning capabilities of LLMs. The benchmark comprises of three datasets, covering reasoning problems of in creasing complexity. We evaluate three state of-the-art proprietary LLMs under multiple prompt settings. Our results show that while LLMs can address deductive reasoning prob lems, they consistently fail in cases of abductive reasoning. Moreover, we observe that enhancing LLMs with rationale generation is not always beneficial. Nonetheless, we find that generated rationales are semantically similar to those provided by humans, especially in deduc tive reasoning cases.
pdf
bib
abs
Language Models Lack Temporal Generalization and Bigger is Not Better
Stella Verkijk
|
Piek Vossen
|
Pia Sommerauer
This paper presents elaborate testing of various LLMs on their generalization capacities. We finetune six encoder models that have been pretrained with very different data (varying in size, language, and period) on a challenging event detection task in Early Modern Dutch archival texts. Each model is finetuned with 5 seeds on 15 different data splits, resulting in 450 finetuned models. We also pre-train a domain-specific Language Model on the target domain and fine-tune and evaluate it in the same way to provide an upper bound. Our experimental setup allows us to look at underresearched aspects of generalizability, namely i) shifts at multiple places in a modeling pipeline, ii) temporal and crosslingual shifts and iii) generalization over different initializations. The results show that none of the models reaches domain-specific model performance, demonstrating their incapacity to generalize. mBERT reaches highest F1 performance, and is relatively stable over different seeds and datasplits, contrary to XLM-R. We find that contemporary Dutch models do not generalize well to Early Modern Dutch as they underperform compared to crosslingual as well as historical models. We conclude that encoder LLMs lack temporal generalization capacities and that bigger models are not better, since even a model pre-trained with five hundred GPUs on 2.5 terabytes of training data (XLM-R) underperforms considerably compared to our domain-specific model, pre-trained on one GPU and 6 GB of data. All our code, data, and the domain-specific model are openly available.
pdf
bib
abs
DiffLM: Controllable Synthetic Data Generation via Diffusion Language Models
Ying Zhou
|
Xinyao Wang
|
Yulei Niu
|
Yaojie Shen
|
Lexin Tang
|
Fan Chen
|
Ben He
|
Le Sun
|
Longyin Wen
Recent advancements in large language models (LLMs) have significantly enhanced their knowledge and generative capabilities, leading to a surge of interest in leveraging LLMs for high-quality data synthesis. However, synthetic data generation via prompting LLMs remains challenging due to LLMs’ limited understanding of target data distributions and the complexity of prompt engineering, especially for structured formatted data. To address these issues, we introduce DiffLM, a controllable data synthesis framework based on variational autoencoder (VAE), which further (1) leverages diffusion models to reserve more information of original distribution and format structure in the learned latent distribution and (2) decouples the learning of target distribution knowledge from the LLM’s generative objectives via a plug-and-play latent feature injection module. As we observed significant discrepancies between the VAE’s latent representations and the real data distribution, the latent diffusion module is introduced into our framework to learn a fully expressive latent distribution. Evaluations on seven real-world datasets with structured formatted data (i.e., Tabular, Code, and Tool data) demonstrate that DiffLM generates high-quality data, with performance on downstream tasks surpassing that of real data by 2%–7% in certain cases. Data and code are available at https://github.com/bytedance/DiffLM.
pdf
bib
abs
Uncertainty Unveiled: Can Exposure to More In-context Examples Mitigate Uncertainty for Large Language Models?
Yifei Wang
|
Yu Sheng
|
Linjing Li
|
Daniel Dajun Zeng
Recent advances in handling long sequences have unlocked new possibilities for long-context in-context learning (ICL). While existing research predominantly focuses on performance gains driven by additional in-context examples, the impact on the trustworthiness of generated responses remains underexplored. This paper addresses this gap by investigating how increased examples influence predictive uncertainty—an essential aspect in trustworthiness. We begin by systematically quantifying uncertainty across different “shot” configurations in ICL, emphasizing the role of example quantity. Through uncertainty decomposition, we introduce a novel perspective on performance enhancement, focusing on epistemic uncertainty (EU). Our results reveal that additional examples reduce total uncertainty in both simple and complex tasks by injecting task-specific knowledge, thereby diminishing EU and enhancing performance. For complex tasks, these advantages emerge only after addressing the increased noise and uncertainty associated with longer inputs. Finally, we investigate the progression of internal confidence across layers, uncovering the underlying mechanisms that drive the reduction in uncertainty.
pdf
bib
abs
ToolSpectrum: Towards Personalized Tool Utilization for Large Language Models
Zihao Cheng
|
Hongru Wang
|
Zeming Liu
|
Yuhang Guo
|
Yuanfang Guo
|
Yunhong Wang
|
Haifeng Wang
While integrating external tools into large language models (LLMs) enhances their ability to access real-time information and domain-specific services, existing approaches focus narrowly on functional tool selection following user instructions while overlooking the critical role of context-aware personalization in tool selection. This oversight leads to suboptimal user satisfaction and inefficient tool utilization, particularly when overlapping toolsets require nuanced selection based on contextual factors. To bridge this gap, we introduce ToolSpectrum, a benchmark designed to evaluate LLMs’ capabilities in personalized tool utilization. Specifically, we formalize two key dimensions of personalization, user profile and environmental factors, and analyze their individual and synergistic impacts on tool selection. Through extensive experiments on ToolSpectrum, we demonstrate that personalized tool selection significantly improves user experience across diverse scenarios. However, even state-of-the-art LLMs exhibit the limited ability to reason jointly about user profiles and environmental factors, often prioritizing one dimension at the expense of the other. Our findings underscore the necessity of context-aware personalization in tool-augmented LLMs and reveal critical limitations for current models. Our data and code will be released soon.
pdf
bib
abs
Reverse Preference Optimization for Complex Instruction Following
Xiang Huang
|
Ting-En Lin
|
Feiteng Fang
|
Yuchuan Wu
|
Hangyu Li
|
Yuzhong Qu
|
Fei Huang
|
Yongbin Li
Instruction following (IF) is a critical capability for large language models (LLMs). However, handling complex instructions with multiple constraints remains challenging. Previous methods typically select preference pairs based on the number of constraints they satisfy, introducing noise where chosen examples may fail to follow some constraints and rejected examples may excel in certain respects over the chosen ones. To address the challenge of aligning with multiple preferences, we propose a simple yet effective method called Reverse Preference Optimization (RPO). It mitigates noise in preference pairs by dynamically reversing the constraints within the instruction to ensure the chosen response is perfect, alleviating the burden of extensive sampling and filtering to collect perfect responses. Besides, reversal also enlarges the gap between chosen and rejected responses, thereby clarifying the optimization direction and making it more robust to noise. We evaluate RPO on two multi-turn IF benchmarks, Sysbench and Multi-IF, demonstrating average improvements over the DPO baseline of 4.6 and 2.5 points (on Llama-3.1 8B), respectively. Moreover, RPO scales effectively across model sizes (8B to 70B parameters), with the 70B RPO model surpassing GPT-4o.
pdf
bib
abs
MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens
Jeong Hun Yeo
|
Hyeongseop Rha
|
Se Jin Park
|
Yong Man Ro
Audio-Visual Speech Recognition (AVSR) achieves robust speech recognition in noisy environments by combining auditory and visual information. However, recent Large Language Model (LLM) based AVSR systems incur high computational costs due to the high temporal resolution of audio-visual speech processed by LLMs. In this work, we introduce an efficient multimodal speech LLM framework that minimizes token length while preserving essential linguistic content. Our approach employs an early AV-fusion module for streamlined feature integration, an audio-visual speech Q-Former that dynamically allocates tokens based on input duration, and a refined query allocation strategy with a speech rate predictor to adjust token allocation according to speaking speed of each audio sample. Extensive experiments on the LRS3 dataset show that our method achieves state-of-the-art performance with a WER of 0.72% while using only 3.5 tokens per second. Moreover, our approach not only reduces token usage by 86% compared to the previous multimodal speech LLM framework, but also improves computational efficiency by reducing FLOPs by 35.7%. The code and models are available
https://github.com/JeongHun0716/MMS-LLaMA.
pdf
bib
abs
Def-DTS: Deductive Reasoning for Open-domain Dialogue Topic Segmentation
Seungmin Lee
|
Yongsang Yoo
|
Minhwa Jung
|
Min Song
Dialogue Topic Segmentation (DTS) aims to divide dialogues into coherent segments. DTS plays a crucial role in various NLP downstream tasks, but suffers from chronic problems: data shortage, labeling ambiguity, and incremental complexity of recently proposed solutions. On the other hand, Despite advances in Large Language Models (LLMs) and reasoning strategies, these have rarely been applied to DTS. This paper introduces Def-DTS: Deductive Reasoning for Open-domain Dialogue Topic Segmentation, which utilizes LLM-based multi-step deductive reasoning to enhance DTS performance and enable case study using intermediate result. Our method employs a structured prompting approach for bidirectional context summarization, utterance intent classification, and deductive topic shift detection. In the intent classification process, we propose the generalizable intent list for domain-agnostic dialogue intent classification. Experiments in various dialogue settings demonstrate that Def-DTS consistently outperforms traditional and state-of-the-art approaches, with each subtask contributing to improved performance, particularly in reducing type 2 error. We also explore the potential for autolabeling, emphasizing the importance of LLM reasoning techniques in DTS.
pdf
bib
abs
Exploring Jailbreak Attacks on LLMs through Intent Concealment and Diversion
Tiehan Cui
|
Yanxu Mao
|
Peipei Liu
|
Congying Liu
|
Datao You
Although large language models (LLMs) have achieved remarkable advancements, their security remains a pressing concern. One major threat is jailbreak attacks, where adversarial prompts bypass model safeguards to generate harmful or objectionable content. Researchers study jailbreak attacks to understand security and robustness of LLMs. However, existing jailbreak attack methods face two main challenges: (1) an excessive number of iterative queries, and (2) poor generalization across models. In addition, recent jailbreak evaluation datasets focus primarily on question-answering scenarios, lacking attention to text generation tasks that require accurate regeneration of toxic content.To tackle these challenges, we propose two contributions:(1) **ICE**, a novel black-box jailbreak method that employs **I**ntent **C**oncealment and div**E**rsion to effectively circumvent security constraints. **ICE** achieves high attack success rates (ASR) with a single query, significantly improving efficiency and transferability across different models.(2) **BiSceneEval**, a comprehensive dataset designed for assessing LLM robustness in question-answering and text-generation tasks. Experimental results demonstrate that **ICE** outperforms existing jailbreak techniques, revealing critical vulnerabilities in current defense mechanisms. Our findings underscore the necessity of a hybrid security strategy that integrates predefined security mechanisms with real-time semantic decomposition to enhance the security of LLMs.
pdf
bib
abs
Verbosity-Aware Rationale Reduction: Sentence-Level Rationale Reduction for Efficient and Effective Reasoning
Joonwon Jang
|
Jaehee Kim
|
Wonbin Kweon
|
Seonghyeon Lee
|
Hwanjo Yu
Large Language Models (LLMs) rely on generating extensive intermediate reasoning units (e.g., tokens, sentences) to enhance final answer quality across a wide range of complex tasks. While this approach has proven effective, it inevitably increases substantial inference costs. Previous methods adopting token-level reduction without clear criteria result in poor performance compared to models trained with complete rationale. To address this challenge, we propose a novel sentence-level rationale reduction framework leveraging likelihood-based criteria, *verbosity*, to identify and remove redundant reasoning sentences. Unlike previous approaches, our method leverages *verbosity* to selectively remove redundant reasoning sentences while preserving reasoning capabilities. Our experimental results across various reasoning tasks demonstrate that our method improves performance by an average of 7.71% while reducing token generation by 19.87% compared to model trained with complete reasoning paths.
pdf
bib
abs
Exploring the Role of Mental Health Conversational Agents in Training Medical Students and Professionals: A Systematic Literature Review
Thushari Atapattu
|
Menasha Thilakaratne
|
Duc Nhan Do
|
Mahen Herath
|
Katrina E. Falkner
The integration of Artificial Intelligence (AI) into mental health education and training (MHET) has become a promising solution to meet the increasing demand for skilled mental health professionals. This systematic review analyses 38 studies on AI-powered conversational agents (CAs) in MHET, selected from a total of 1003 studies published between 2019 and 2024. Following the PRISMA protocol, we reviewed papers from computer science, medicine, and interdisciplinary databases, assessing key aspects such as technological approaches, data characteristics, application areas, and evaluation methodologies. Our findings reveal that AI-based approaches, including Large Language Models (LLMs), dominate the field, with training as the application area being the most prevalent. These technologies show promise in simulating therapeutic interactions but face challenges such as limited public datasets, lack of standardised evaluation frameworks, and difficulty in ensuring authentic emotional responses, along with gaps in ethical considerations and clinical efficacy. This review presents a comprehensive framework for understanding the role of CAs in MHET while providing valuable recommendations to guide future research.
pdf
bib
abs
Bandit-Based Prompt Design Strategy Selection Improves Prompt Optimizers
Rin Ashizawa
|
Yoichi Hirose
|
Nozomu Yoshinari
|
Kento Uchida
|
Shinichi Shirakawa
Prompt optimization aims to search for effective prompts that enhance the performance of large language models (LLMs). Although existing prompt optimization methods have discovered effective prompts, they often differ from sophisticated prompts carefully designed by human experts. Prompt design strategies, representing best practices for improving prompt performance, can be key to improving prompt optimization. Recently, a method termed the Autonomous Prompt Engineering Toolbox (APET) has incorporated various prompt design strategies into the prompt optimization process. In APET, the LLM is needed to implicitly select and apply the appropriate strategies because prompt design strategies can have negative effects. This implicit selection may be suboptimal due to the limited optimization capabilities of LLMs. This paper introduces Optimizing Prompts with sTrategy Selection (OPTS), which implements explicit selection mechanisms for prompt design. We propose three mechanisms, including a Thompson sampling-based approach, and integrate them into EvoPrompt, a well-known prompt optimizer. Experiments optimizing prompts for two LLMs, Llama-3-8B-Instruct and GPT-4o mini, were conducted using BIG-Bench Hard. Our results show that the selection of prompt design strategies improves the performance of EvoPrompt, and the Thompson sampling-based mechanism achieves the best overall results. Our experimental code is provided at https://github.com/shiralab/OPTS.
pdf
bib
abs
STORYTELLER: An Enhanced Plot-Planning Framework for Coherent and Cohesive Story Generation
Jiaming Li
|
Yukun Chen
|
Ziqiang Liu
|
Minghuan Tan
|
Lei Zhang
|
Yunshui Li
|
Run Luo
|
Longze Chen
|
Jing Luo
|
Ahmadreza Argha
|
Hamid Alinejad-Rokny
|
Wei Zhou
|
Min Yang
Stories are central to human culture, serving to share ideas, preserve traditions, and foster connections. Automatic story generation, a key advancement in artificial intelligence (AI), offers new possibilities for creating personalized content, exploring creative ideas, and enhancing interactive experiences. However, existing methods struggle to maintain narrative coherence and logical consistency. This disconnect compromises the overall storytelling experience, underscoring the need for substantial improvements. Inspired by human cognitive processes, we introduce Storyteller, a novel approach that systemically improves the coherence and consistency of automatically generated stories. Storyteller introduces a plot node structure based on linguistically grounded subject-verb-object (SVO) triplets, which capture essential story events and ensure a consistent logical flow. Unlike previous methods, Storyteller integrates two dynamic modules—the STORYLINE and narrative entity knowledge graph (NEKG)—that continuously interact with the story generation process. This integration produces structurally sound, cohesive and immersive narratives. Extensive experiments demonstrate that Storyteller significantly outperforms existing approaches, achieving an 84.33% average win rate through human preference evaluation. At the same time, it is also far ahead in other aspects including creativity, coherence, engagement, and relevance.
pdf
bib
abs
SelectLLM: Query-Aware Efficient Selection Algorithm for Large Language Models
Kaushal Kumar Maurya
|
Kv Aditya Srivatsa
|
Ekaterina Kochmar
Large language models (LLMs) have been widely adopted due to their remarkable performance across various applications, driving the accelerated development of a large number of diverse models. However, these individual LLMs show limitations in generalization and performance on complex tasks due to inherent training biases, model size constraints, and the quality or diversity of pre-training datasets. A promising direction is to efficiently harness the diverse capabilities of LLMs to overcome these individual limitations. To address these limitations, we introduce a novel LLM selection algorithm called SelectLLM, which efficiently directs input queries to the most suitable subset of LLMs from a large pool, ensuring that the selected models collectively provide accurate responses. SelectLLM employs a multi-label classifier and policy based on the classifier’s predictions and confidence scores in selecting an optimal, query-aware, and lightweight subset of LLMs. Our findings indicate that the proposed model outperforms existing ensemble-based baselines and achieves competitive performance with similarly sized top-performing LLMs while maintaining efficiency. Specifically, it achieves a huge reduction in inference latency on two challenging reasoning benchmarks: 13% on GSM8K and 70% on MMLU, compared to the top-performing baseline. Also, we establish a theoretical upper bound by an Oracle with LLMs and perform an in-depth linguistic analysis to understand the performance gap between the Oracle and SelectLLM.
pdf
bib
abs
SkyLLM: Cross-LLM-APIs Federation for Cost-effective Query Processing
Heng Zhao
|
Yifei Zhu
Large language models (LLMs) have demonstrated exceptional capabilities across a wide range of tasks, from text generation to complex problem-solving. LLM APIs provide easy access to these models by streamlining deployment and usage. Combining LLMs with complementary strengths has been shown to yield substantial performance gains over a monolithic LLM. However, invoking a fixed set of LLM APIs for each query incurs higher API costs and increased inference latency. To address these limitations, we propose SkyLLM, a system composed of a set of estimators and an API selector, which federates multiple LLM APIs and dynamically assigns a non-empty subset of these APIs to each query prior to inference under cost and latency budgets. The selected subset consists of either a single LLM or multiple LLMs. A single LLM efficiently handles simple queries at low cost, whereas multiple LLMs are employed for more complex queries to overcome performance limitations. We evaluate SkyLLM against individual LLMs and representative ensemble LLM methods from the literature. SkyLLM achieves the highest accuracy under a high budget. It can also be cost-effective, matching the most accurate individual LLM while cutting costs by 67.8%.
pdf
bib
abs
Matina: A Culturally-Aligned Persian Language Model Using Multiple LoRA Experts
Sara Bourbour Hosseinbeigi
|
MohammadAli SeifKashani
|
Javad Seraj
|
Fatemeh Taherinezhad
|
Ali Nafisi
|
Fatemeh Nadi
|
Iman Barati
|
Hosein Hasani
|
Mostafa Amiri
|
Mostafa Masoudi
Large language models (LLMs) are powerful tools for a variety of applications, but to interact effectively with users, they must align with the cultural values and linguistic nuances of their audience. However, existing LLMs often fall short in adequately modeling underrepresented languages and cultures, such as Persian, limiting their applicability and acceptance. To address this, we construct diverse, high-quality datasets specifically tailored to Persian linguistic and cultural contexts, ensuring a more authentic and context-aware training process. Using these datasets, we develop Matina, a Persian-focused multi-expert model designed to embody Iranian cultural values and linguistic structures. Matina is trained by fine-tuning LLaMA3.1 8B-Instruct models across five domains: culinary, tourism, socio-culture, translation, and summarization. These experts are combined using a classifier to create a unified multi-expert system. By leveraging culturally aligned datasets, Matina outperforms baseline models in both task performance and user satisfaction, demonstrating the importance of data-driven cultural adaptation in LLM development.
pdf
bib
abs
PM3-KIE: A Probabilistic Multi-Task Meta-Model for Document Key Information Extraction
Birgit Kirsch
|
Héctor Allende-Cid
|
Stefan Rueping
Key Information Extraction (KIE) from visually rich documents is commonly approached as either fine-grained token classification or coarse-grained entity extraction. While token-level models capture spatial and visual cues, entity-level models better represent logical dependencies and align with real-world use cases.We introduce PM3-KIE, a probabilistic multi-task meta-model that incorporates both fine-grained and coarse-grained models. It serves as a lightweight reasoning layer that jointly predicts entities and all appearances in a document. PM3-KIE incorporates domain-specific schema constraints to enforce logical consistency and integrates large language models for semantic validation, thereby reducing extraction errors.Experiments on two public datasets, DeepForm and FARA, show that PM3-KIE outperforms three state-of-the-art models and a stacked ensemble, achieving a statistically significant 2% improvement in F1 score.
pdf
bib
abs
TechniqueRAG: Retrieval Augmented Generation for Adversarial Technique Annotation in Cyber Threat Intelligence Text
Ahmed Lekssays
|
Utsav Shukla
|
Husrev Taha Sencar
|
Md Rizwan Parvez
Accurately identifying adversarial techniques in security texts is critical for effective cyber defense. However, existing methods face a fundamental trade-off: they either rely on generic models with limited domain precision or require resource-intensive pipelines that depend on large labeled datasets and task-specific optimizations—such as custom hard-negative mining and denoising—resources rarely available in specialized domains.We propose TechniqueRAG, a domain-specific retrieval-augmented generation (RAG) framework that bridges this gap by integrating off-the-shelf retrievers, instruction-tuned LLMs, and minimal text–technique pairs. Our approach addresses data scarcity by fine-tuning only the generation component on limited in-domain examples, circumventing the need for resource-intensive retrieval training. While conventional RAG mitigates hallucination by coupling retrieval and generation, its reliance on generic retrievers often introduces noisy candidates, limiting domain-specific precision. To address this, we enhance retrieval quality and domain specificity through zero-shot LLM re-ranking, which explicitly aligns retrieved candidates with adversarial techniques.Experiments on multiple security benchmarks demonstrate that TechniqueRAG achieves state-of-the-art performance without extensive task-specific optimizations or labeled data, while comprehensive analysis provides further insights.
pdf
bib
abs
G2S: A General-to-Specific Learning Framework for Temporal Knowledge Graph Forecasting with Large Language Models
Long Bai
|
Zixuan Li
|
Xiaolong Jin
|
Jiafeng Guo
|
Xueqi Cheng
|
Tat-Seng Chua
Forecasting over Temporal Knowledge Graphs (TKGs) which predicts future facts based on historical ones has received much attention. Recent studies have introduced Large Language Models (LLMs) for this task to enhance the models’ generalization abilities. However, these models perform forecasting via simultaneously learning two kinds of entangled knowledge in the TKG: (1) general patterns, i.e., invariant temporal structures shared across different scenarios; and (2) scenario information, i.e., factual knowledge engaged in specific scenario, such as entities and relations. As a result, the learning processes of these two kinds of knowledge may interfere with each other, which potentially impact the generalization abilities of the models. To enhance the generalization ability of LLMs on this task, in this paper, we propose a General-to-Specific learning framework (G2S) that disentangles the learning processes of the above two kinds of knowledge. In the general learning stage, we mask the scenario information in different TKGs and convert it into anonymous temporal structures. After training on these structures, the model is able to capture the general patterns across different TKGs. In the specific learning stage, we inject the scenario information into the structures via either in-context learning or fine-tuning modes. Experimental results show that G2S effectively improves the generalization abilities of LLMs.
pdf
bib
abs
Disentangling Reasoning Tokens and Boilerplate Tokens For Language Model Fine-tuning
Ziang Ye
|
Zhenru Zhang
|
Yang Zhang
|
Jianxin Ma
|
Junyang Lin
|
Fuli Feng
When using agent-task datasets to enhance agent capabilities for Large Language Models (LLMs), current methodologies often treat all tokens within a sample equally. However, we argue that tokens serving different roles—specifically, reasoning tokens versus boilerplate tokens (e.g., those governing output format)—differ significantly in importance and learning complexity, necessitating their disentanglement and distinct treatment. To address this, we propose a novel Shuffle-Aware Discriminator (SHAD) for adaptive token discrimination. SHAD classifies tokens by exploiting predictability differences observed after shuffling input-output combinations across samples: boilerplate tokens, due to their repetitive nature among samples, maintain predictability, whereas reasoning tokens do not. Using SHAD, we propose the Reasoning-highlighted Fine-Tuning (RFT) method, which adaptively emphasizes reasoning tokens during fine-tuning, yielding notable performance gains over common Supervised Fine-Tuning (SFT).
pdf
bib
abs
APT: Improving Specialist LLM Performance with Weakness Case Acquisition and Iterative Preference Training
Jun Rao
|
Zepeng Lin
|
Xuebo Liu
|
Xiaopeng Ke
|
Lian Lian
|
Dong Jin
|
Shengjun Cheng
|
Jun Yu
|
Min Zhang
Large Language Models (LLMs) often require domain-specific fine-tuning to address targeted tasks, which risks degrading their general capabilities. Maintaining a balance between domain-specific enhancements and general model utility is a key challenge. This paper proposes a novel approach named APT (Weakness Case Acquisition and Iterative Preference Training) to enhance domain-specific performance with self-generated dis-preferred weakness data (bad cases and similar cases). APT uniquely focuses on training the model using only those samples where errors occur, alongside a small, similar set of samples retrieved for this purpose. This targeted training minimizes interference with the model’s existing knowledge base, effectively retaining generic capabilities. Experimental results on the LLama-2 and Mistral-V0.3 models across various benchmarks demonstrate that APT ensures no reduction in generic capacity and achieves superior performance on downstream tasks compared to various existing methods. This validates our method as an effective strategy for enhancing domain-specific capabilities without sacrificing the model’s broader applicability.
pdf
bib
abs
EasyEA: Large Language Model is All You Need in Entity Alignment Between Knowledge Graphs
Jingwei Cheng
|
Chenglong Lu
|
Linyan Yang
|
Guoqing Chen
|
Fu Zhang
Entity alignment (EA) aims to identify entities in different knowledge graphs (KGs) that represent the same real-world objects. Traditional EA methods typically embed entity information into vector space under the guidance of seed entity pairs, and align entities by calculating and comparing the similarity between entity embeddings. With the advent of large language models (LLMs), emerging methods are increasingly integrating LLMs with traditional methods to leverage external knowledge and improve EA accuracy. However, this integration also introduces additional computational complexity and operational overhead, and still requires seed pairs that are scarce and expensive to obtain. To address these challenges, we propose EasyEA, the first end-to-end EA framework based on LLMs that requires no training. EasyEA consists of three main stages: (1) Information Summarization, (2) Embedding and Feature Fusion, and (3) Candidate Selection. By automating the EA process, EasyEA significantly reduces the reliance on seed entity pairs while demonstrating superior performance across various datasets, covering crosslingual, sparse, large-scale, and heterogeneous scenarios. Extensive experimental results show that EasyEA not only simplifies the EA process but also achieves state-of-the-art (SOTA) performance on diverse datasets, providing a promising solution for advancing EA tasks.
pdf
bib
abs
An Adaptive Multi-Threshold Loss and a General Framework for Collaborating Losses in Document-Level Relation Extraction
Huangming Xu
|
Fu Zhang
|
Jingwei Cheng
The goal of document-level relation extraction (DocRE) is to identify relations for a given entity pair within a document. As a multilabel classification task, the most commonly employed method involves introducing an adaptive threshold. Specifically, for an entity pair, if the scores of predicted relations exceed the threshold, the relations exist. However, we observe two phenomena that significantly weaken the model’s performance in DocRE: (1) as the label space (the number of relations) expands, the model’s performance gradually declines; (2) the model tends to prioritize predicting high-frequency relations in the long-tail problem. To address these challenges, we propose an innovative **A**daptive **M**ulti-**T**hreshold **L**oss (AMTL), which for the first time proposes to partition the label space into different sub-label spaces (thus reducing its overall size) and learn an adaptive threshold for each sub-label space. This approach allows for more precise tuning of the model’s sensitivity to diverse relations, mitigating the performance degradation associated with label space expansion and the long-tail problem. Moreover, our adaptive multi-threshold method can be considered as a general framework that seamlessly integrates different losses in different sub-label spaces, facilitating the concurrent application of multiple losses. Experimental results demonstrate that AMTL significantly enhances the performance of existing DocRE models across four datasets, achieving state-of-the-art results. The experiments on the concurrent application of multiple losses with our framework show stable performance and outperform single-loss methods. Code is available at https://github.com/xhm-code/AMTL.
pdf
bib
abs
RoleMRC: A Fine-Grained Composite Benchmark for Role-Playing and Instruction-Following
Junru Lu
|
Jiazheng Li
|
Guodong Shen
|
Lin Gui
|
Siyu An
|
Yulan He
|
Di Yin
|
Xing Sun
Role-playing is important for Large Language Models (LLMs) to follow diverse instructions while maintaining role identity and the role’s pre-defined ability limits. Existing role-playing datasets mostly contribute to controlling role style and knowledge boundaries, but overlook role-playing in instruction-following scenarios. We introduce a fine-grained role-playing and instruction-following composite benchmark, named RoleMRC, including: (1) Multi-turn dialogues between ideal roles and humans, including free chats or discussions upon given passages; (2) Role-playing machine reading comprehension, involving response, refusal, and attempts according to passage answerability and role ability; (3) More complex scenarios with nested, multi-turn and prioritized instructions. The final RoleMRC features a 10.2k role profile meta-pool, 37.9k well-synthesized role-playing instructions, and 1.4k testing samples. We develop a pipeline to quantitatively evaluate the fine-grained role-playing and instruction-following capabilities of several mainstream LLMs, as well as models that are fine-tuned on our data. Moreover, cross-evaluation on external role-playing datasets confirms that models fine-tuned on RoleMRC enhances instruction-following without compromising general role-playing and reasoning capabilities. We also probe the neural-level activation maps of different capabilities over post-tuned LLMs. Access to our RoleMRC, RoleMRC-mix and Codes: https://github.com/LuJunru/RoleMRC.
pdf
bib
abs
C²RBench: A Chinese Complex Reasoning Benchmark for Large Language Models
Junru Wu
|
Tianhao Shen
|
Linxi Su
|
Deyi Xiong
Large language models (LLMs) have achieved remarkable progress in autonomous reasoning, evolving from basic text processing to sophisticated multimodal reasoning, a critical capability for general-purpose AI assistants. However, existing benchmarks usually fail to adequately capture the intricate multi-step reasoning demands inherent in real-world scenarios. To bridge this gap, we propose **C²RBench**: a **C**hinese **C**omplex **R**easoning **Bench**mark for evaluating multi-step, multimodal advanced reasoning capability of LLMs. C²RBench comprises 1,115 carefully curated Chinese tasks, which are organized into eight domain-specific subsets, each meticulously designed to mirror real-world challenges. This hierarchical benchmark features three difficulty tiers based on the number of reasoning steps required (average 8.44 steps per task), significantly exceeding existing benchmarks in cognitive complexity. Extensive evaluations of 20 LLMs (including DeepSeek-R1) and 24 multimodal large language models (MLLMs) on C²RBench reveal critical performance gaps: GPT-4.1 achieves only 52.11% accuracy, indicating substantial room for improvement. The dataset and evaluation code are publicly available.
pdf
bib
abs
Unlocking LLMs’ Self-Improvement Capacity with Autonomous Learning for Domain Adaptation
Ke Ji
|
Junying Chen
|
Anningzhe Gao
|
Wenya Xie
|
Xiang Wan
|
Benyou Wang
Self-supervised pre-training and instruction fine-tuning demonstrate the potential of large language models (LLMs) for domain adaptation (DA). In pursuit of superhuman performance, LLMs have demonstrated significant potential in math and coding through self-improvement algorithms that rely on iterative training with self-generated data. This success stems from the clear reward signals in these environments, which provide a solid foundation for self-improvement. However, when it comes to general DA scenarios, two main challenges emerge: 1) ambiguous self-improvement reward signals and 2) lack of high-quality instruction fine-tuning datasets. This motivates this paper addresses how LLMs can adapt autonomously to new domains using only a large amount of unlabeled target corpora. Inspired by the human practice of self-reflection through open- and closed-book exercises to achieve domain generalization, we propose autonomous learning, which creates a self-improvement learning environment for DA. Here, the model generates questions from documents and conducts two explorations—one with the original document and one with a masked version. By comparing these explorations, the LLMs can independently identify and enhance its policy for reducing knowledge gaps. Experiments across various DA tasks demonstrate that autonomous learning enhances the DA performance of existing models, outperforming traditional fine-tuning and self-improvement methods. Our code is publicly available at https://github.com/FreedomIntelligence/AL.
pdf
bib
abs
How Personality Traits Shape LLM Risk-Taking Behaviour
John Hartley
|
Conor Brian Hamill
|
Dale Seddon
|
Devesh Batra
|
Ramin Okhrati
|
Raad Khraishi
Large Language Models (LLMs) are increasingly deployed as autonomous agents for simulation and decision-making, necessitating a deeper understanding of their decision-making behaviour under risk. We investigate the relationship between LLMs’ personality traits and risk-propensity, applying Cumulative Prospect Theory (CPT) and the Big Five personality framework. We compare the behaviour of several LLMs to human baselines. Our findings show that the majority of the models investigated are risk-neutral rational agents, whilst displaying higher Conscientiousness and Agreeableness traits, coupled with lower Neuroticism. Interventions on Big Five traits, particularly Openness, influence the risk-propensity of several LLMs. Advanced models mirror human personality-risk patterns, suggesting that cognitive biases can be surfaced by optimal prompting. However, their distilled variants show no cognitive bias, suggesting limitations to knowledge transfer processes. Notably, Openness emerges as the most influential factor to risk-propensity, aligning with human baselines. In contrast, less advanced models demonstrate inconsistent generalization of the personality-risk relationship. This research advances our understanding of LLM behaviour under risk and highlights the potential and limitations of personality-based interventions in shaping LLM decision-making.
pdf
bib
abs
Word-Level Detection of Code-Mixed Hate Speech with Multilingual Domain Transfer
Karin Niederreiter
|
Dagmar Gromann
The exponential growth of offensive language on social media tends to fuel online harassment and challenges detection mechanisms. Hate speech detection is commonly treated as a monolingual or multilingual sentence-level classification task. However, profane language tends to contain code-mixing, a combination of more than one language, which requires a more nuanced detection approach than binary classification. A general lack of available code-mixed datasets aggravates the problem. To address this issue, we propose five word-level annotated hate speech datasets, EN and DE from social networks, one subset of the DE-EN Offensive Content Detection Code-Switched Dataset, one DE-EN code-mixed German rap lyrics held-out test set, and a cross-domain held-out test set. We investigate the capacity of fine-tuned German-only, German-English bilingual, and German-English code-mixed token classification XLM-R models to generalize to code-mixed hate speech in German rap lyrics in zero-shot domain transfer as well as across different domains. The results show that bilingual fine-tuning facilitates not only the detection of code-mixed hate speech, but also neologisms, addressing the inherent dynamics of profane language use.
pdf
bib
abs
Evaluation of Attribution Bias in Generator-Aware Retrieval-Augmented Large Language Models
Amin Abolghasemi
|
Leif Azzopardi
|
Seyyed Hadi Hashemi
|
Maarten de Rijke
|
Suzan Verberne
Attributing answers to source documents is an approach used to enhance the verifiability of a model’s output in retrieval-augmented generation (RAG). Prior work has mainly focused on improving and evaluating the attribution quality of large language models (LLMs) in RAG, but this may come at the expense of inducing biases in the attribution of answers. We define and examine two aspects in the evaluation of LLMs in RAG pipelines, namely attribution sensitivity and bias with respect to authorship information. We explicitly inform an LLM about the authors of source documents, instruct it to attribute its answers, and analyze (i) how sensitive the LLM’s output is to the author of source documents, and (ii) whether the LLM exhibits a bias towards human-written or AI-generated source documents. We design an experimental setup in which we use counterfactual evaluation to study three LLMs in terms of their attribution sensitivity and bias in RAG pipelines. Our results show that adding authorship information to source documents can significantly change the attribution quality of LLMs by 3 to 18%. We show that LLMs can have an attribution bias towards explicit human authorship, which can serve as a competing hypothesis for findings of prior work that shows that LLM-generated content may be preferred over human-written contents. Our findings indicate that metadata of source documents can influence LLMs’ trust, and how they attribute their answers. Furthermore, our research highlights attribution bias and sensitivity as a novel aspect of the vulnerability of LLMs.
pdf
bib
abs
Implicit Cross-Lingual Rewarding for Efficient Multilingual Preference Alignment
Wen Yang
|
Junhong Wu
|
Chen Wang
|
Chengqing Zong
|
Jiajun Zhang
Direct Preference Optimization (DPO) has become a prominent method for aligning Large Language Models (LLMs) with human preferences. While DPO has enabled significant progress in aligning English LLMs, multilingual preference alignment is hampered by data scarcity. To address this, we propose a novel approach that captures learned preferences from well-aligned English models by implicit rewards and transfers them to other languages through iterative training. Specifically, we derive an implicit reward model from the logits of an English DPO-aligned model and its corresponding reference model. This reward model is then leveraged to annotate preference relations in cross-lingual instruction-following pairs, using English instructions to evaluate multilingual responses. The annotated data is subsequently used for multilingual DPO fine-tuning, facilitating preference knowledge transfer from English to other languages. Fine-tuning Llama3 for two iterations resulted in a 12.72% average improvement in Win Rate and a 5.97% increase in Length Control Win Rate across all training languages on the X-AlpacaEval leaderboard. Our findings demonstrate that leveraging existing English-aligned models can enable efficient and effective multilingual preference alignment, significantly reducing the need for extensive multilingual preference data.
pdf
bib
abs
Diagnosing Failures in Large Language Models’ Answers: Integrating Error Attribution into Evaluation Framework
Zishan Xu
|
Shuyi Xie
|
Qingsong Lv
|
Shupei Xiao
|
Linlin Song
|
Sui Wenjuan
|
Fan Lin
With the widespread application of Large Language Models (LLMs) in various tasks, the mainstream LLM platforms generate massive user-model interactions daily. In order to efficiently analyze the performance of models and diagnose failures in their answers, it is essential to develop an automated framework to systematically categorize and attribute errors. However, existing evaluation models lack error attribution capability. In this work, we establish a comprehensive Misattribution Framework with 6 primary and 15 secondary categories to facilitate in-depth analysis. Based on this framework, we present AttriData, a dataset specifically designed for error attribution, encompassing misattribution, along with the corresponding scores and feedback. We also propose MisAttributionLLM, a fine-tuned model on AttriData, which is the first general-purpose judge model capable of simultaneously generating score, misattribution, and feedback. Extensive experiments and analyses are conducted to confirm the effectiveness and robustness of our proposed method.
pdf
bib
abs
Encode Errors: Representational Retrieval of In-Context Demonstrations for Multilingual Grammatical Error Correction
Guangyue Peng
|
Wei Li
|
Wen Luo
|
Houfeng Wang
Grammatical Error Correction (GEC) involves detecting and correcting the wrong usage of grammar. While large language models (LLMs) with in-context learning (ICL) capabilities have shown significant progress on various natural language processing (NLP) tasks, their few-shot performance on GEC remains suboptimal. This is mainly due to the challenge of retrieving suitable in-context demonstrations that capture error patterns instead of semantic similarity. In this paper, we demonstrate that LLMs can inherently capture information related to grammatical errors through their internal states. From these states, we extract the Grammatical Error Representation (GER), an informative and semantically neutral encoding of grammatical errors. Our novel GER-based retrieval method significantly boosts performance in ICL settings on multilingual GEC datasets, improving the precision of correction. For high-resource languages, our results on 8B-sized open-source models match those of closed-source models such as Deepseek2.5 and GPT-4o-mini. For low-resource languages, our F0.5 scores surpass the baseline by up to a factor of 1.2. This method provides a more precise and resource-efficient solution for multilingual GEC, offering a promising direction for interpretable GEC research.
pdf
bib
abs
Preference Curriculum: LLMs Should Always Be Pretrained on Their Preferred Data
Xuemiao Zhang
|
Xu Liangyu
|
Feiyu Duan
|
Yongwei Zhou
|
Sirui Wang
|
Rongxiang Weng
|
Jingang Wang
|
Xunliang Cai
Large language models (LLMs) generally utilize a consistent data distribution throughout the pretraining process. However, as the model’s capability improves, it is intuitive that its data preferences dynamically change, indicating the need for pretraining with different data at various training stages. To achieve it, we propose the Perplexity Difference (PD) based Preference Curriculum learning (PDPC) framework, which always perceives and uses the data preferred by LLMs to train and boost them. First, we introduce the PD metric to quantify the difference in how challenging a sample is for weak versus strong models. Samples with high PD are more challenging for weak models to learn and are more suitable to be arranged in the later stage of pretraining. Second, we propose the preference function to approximate and predict the data preference of the LLM at any training step, so as to complete the arrangement of the dataset offline and ensure continuous training without interruption. Experimental results on 1.3B and 3B models demonstrate that PDPC significantly surpasses baselines. Notably, the 3B model trained on 1T tokens achieves an increased average accuracy of over 8.1% across MMLU and CMMLU.
pdf
bib
abs
Can Input Attributions Explain Inductive Reasoning in In-Context Learning?
Mengyu Ye
|
Tatsuki Kuribayashi
|
Goro Kobayashi
|
Jun Suzuki
Interpreting the internal process of neural models has long been a challenge. This challenge remains relevant in the era of large language models (LLMs) and in-context learning (ICL); for example, ICL poses a new issue of interpreting which example in the few-shot examples contributed to identifying/solving the task. To this end, in this paper, we design synthetic diagnostic tasks of inductive reasoning, inspired by the generalization tests in linguistics; here, most in-context examples are ambiguous w.r.t. their underlying rule, and one critical example disambiguates the task demonstrated. The question is whether conventional input attribution (IA) methods can track such a reasoning process, i.e., identify the influential example, in ICL. Our experiments provide several practical findings; for example, a certain simple IA method works the best, and the larger the model, the generally harder it is to interpret the ICL with gradient-based IA methods.
pdf
bib
abs
Modal Dependency Parsing via Biaffine Attention with Self-Loop
Jayeol Chun
|
Nianwen Xue
A modal dependency structure represents a web of connections between events and sources of information in a document that allows for tracing of who-said-what with what levels of certainty, thereby establishing factuality in an event-centric approach. Obtaining such graphs defines the task of modal dependency parsing, which involves event and source identification along with the modal relations between them. In this paper, we propose a simple yet effective solution based on biaffine attention that specifically optimizes against the domain-specific challenges of modal dependency parsing by integrating self-loop. We show that our approach, when coupled with data augmentation by leveraging the Large Language Models to translate annotations from one language to another, outperforms the previous state-of-the-art on English and Chinese datasets by 2% and 4% respectively.
pdf
bib
abs
Beyond Profile: From Surface-Level Facts to Deep Persona Simulation in LLMs
Zixiao Wang
|
Duzhen Zhang
|
Ishita Agarwal
|
Shen Gao
|
Le Song
|
Xiuying Chen
Previous approaches to persona simulation large language models (LLMs) have typically relied on learning basic biographical information, or using limited role-play dialogue datasets to capture a character’s responses. However, a holistic representation of an individual goes beyond surface-level facts or conversations to deeper thoughts and thinking. In this work, we introduce CharacterBot, a model designed to replicate both the linguistic patterns and distinctive thought patterns as manifested in the textual works of a character. Using Lu Xun, a renowned Chinese writer as a case study, we propose four training tasks derived from his 17 essay collections. These include a pre-training task focused on mastering external linguistic structures and knowledge, as well as three fine-tuning tasks: multiple-choice question answering, generative question answering, and style transfer, each aligning the LLM with Lu Xun’s internal ideation and writing style. To optimize learning across these tasks, we introduce a CharLoRA parameter updating mechanism, where a general linguistic style expert collaborates with other task-specific experts to better study both the language style and the understanding of deeper thoughts. We evaluate CharacterBot on three tasks for linguistic accuracy and opinion comprehension, demonstrating that it significantly outperforms the baselines on our adapted metrics. We hope this work inspires future research on deep character persona simulation LLMs: https://github.com/zxwang63/characterbot
pdf
bib
abs
Measuring What Makes You Unique: Difference-Aware User Modeling for Enhancing LLM Personalization
Yilun Qiu
|
Xiaoyan Zhao
|
Yang Zhang
|
Yimeng Bai
|
Wenjie Wang
|
Hong Cheng
|
Fuli Feng
|
Tat-Seng Chua
Personalizing Large Language Models (LLMs) has become a critical step in facilitating their widespread application to enhance individual life experiences. In pursuit of personalization, distilling key preference information from an individual’s historical data as instructional preference context to customize LLM generation has emerged as a promising direction. However, these methods face a fundamental limitation by overlooking the inter-user comparative analysis, which is essential for identifying the inter-user differences that truly shape preferences. To address this limitation, we propose Difference-aware Personalization Learning (DPL), a novel approach that emphasizes extracting inter-user differences to enhance LLM personalization. DPL strategically selects representative users for comparison and establishes a structured standard to extract meaningful, task-relevant differences for customizing LLM generation. Extensive experiments on real-world datasets demonstrate that DPL significantly enhances LLM personalization. We release our code at
https://github.com/SnowCharmQ/DPL.
pdf
bib
abs
VideoRAG: Retrieval-Augmented Generation over Video Corpus
Soyeong Jeong
|
Kangsan Kim
|
Jinheon Baek
|
Sung Ju Hwang
Retrieval-Augmented Generation (RAG) is a powerful strategy for improving the factual accuracy of models by retrieving external knowledge relevant to queries and incorporating it into the generation process. However, existing approaches primarily focus on text, with some recent advancements considering images, and they largely overlook videos, a rich source of multimodal knowledge capable of representing contextual details more effectively than any other modality. Also, while very recent studies explore the use of videos in response generation, they either predefine query-associated videos without retrieval or convert videos into textual descriptions, losing multimodal richness. To tackle these, we introduce VideoRAG, a novel framework that not only dynamically retrieves videos based on their relevance with queries but also utilizes both visual and textual information. The operation of VideoRAG is powered by recent Large Video Language Models (LVLMs), which enable the direct processing of video content to represent it for retrieval and the seamless integration of retrieved videos jointly with queries for response generation. Also, inspired by that the context size of LVLMs may not be sufficient to process all frames in extremely long videos and not all frames are equally important, we introduce a video frame selection mechanism to extract the most informative subset of frames, along with a strategy to extract textual information from videos (as it can aid the understanding of video content) when their subtitles are not available. We experimentally validate the effectiveness of VideoRAG, showcasing that it is superior to relevant baselines. Our code is available at https://github.com/starsuzi/VideoRAG.
pdf
bib
abs
Synergistic Augmentation: Enhancing Cross-Domain Zero-Shot Slot Filling with Small Model-Assisted Large Language Models
Weizhen Li
|
Junbao Huang
|
Peijie Huang
|
Yuhong Xu
|
Jiekun Fan
In real-world scenarios, cross-domain slot filling in spoken language understanding remains a significant challenge due to data scarcity. Previous works exhibit limited generalization ability in the target domain, demonstrating effective knowledge transfer only on seen slots while performing poorly on unseen slots. Although large language models (LLMs) can alleviate this issue to some extent, they underperform on seen slots compared to small models. To address these challenges, we introduce a novel framework that harnesses the power of a small model to augment the inferential capabilities of LLMs without additional training. Initially, we utilize target domain samples synthesized by LLMs as pre-calculated demonstrations, which are curated and chosen using confidence metrics derived from a small model. We further extract slot predictions from the small model to fully exploit its robust learning of familiar slots. Finally, during the inference process for test inputs, we integrate these demonstrations and slot prediction insights as references to enhance the slot filling performance of LLMs. Experiments on a slot filling dataset and a NER dataset including eight cross-domain settings show our framework achieves the best results. Our codes are publicly available at https://github.com/SIGSDSscau/SLSF.
pdf
bib
abs
A Classifier of Word-Level Variants in Witnesses of Biblical Hebrew Manuscripts
Iglika Nikolova-Stoupak
|
Maxime Amblard
|
Sophie Robert-Hayek
|
Frédérique Rey
The current project is inscribed within the field of stemmatology or the study and/or reconstruction of textual transmission based on the relationship between the available witnesses of given texts. In particular, the variants (differences) at the word-level in manuscripts written in Biblical Hebrew are considered. A strong classifier (F1 value of 0.80) is trained to predict the category of difference between word pairs (‘plus/minus’, ‘inversion’, ‘morphological’, ‘lexical’ or ‘unclassifiable’) as present in collated (aligned) pairs of witnesses. The classifier is non-neural and makes use of the two words themselves as well as part-of-speech (POS) tags, hand-crafted rules per category and synthetically derived data. Other models experimented with include neural ones based on the state-of-the-art model for Modern Hebrew, DictaBERT. Other features whose relevance is tested are different types of morphological information pertaining to the word pairs and the Levenshtein distance between words. A selection of the strongest classifiers as well as the used synthetic data and the steps taken at its derivation are made available. Coincidentally, the corelation between two sets of morphological labels is investigated: professionally established as per the Qumran-Digital online library and automatically derived with the sub-model DictaBERT-morph.
pdf
bib
abs
NOVA: An Iterative Planning Framework for Enhancing Scientific Innovation with Large Language Models
Xiang Hu
|
Hongyu Fu
|
Jinge Wang
|
Yifeng Wang
|
Zhikun Li
|
Renjun Xu
|
Yu Lu
|
Yaochu Jin
|
Lili Pan
|
Zhenzhong Lan
Scientific innovation is pivotal for humanity, and harnessing large language models (LLMs) to generate research ideas could transform discovery. However, existing LLMs often produce simplistic and repetitive suggestions due to their limited ability in acquiring external knowledge for innovation. To address this problem, we introduce an enhanced planning and search methodology designed to boost the creative potential of LLM-based systems. Our approach involves an iterative process to purposely plan the retrieval of external knowledge, progressively enriching the idea generation with broader and deeper insights. Validation through automated and human assessments demonstrates that our framework substantially elevates the quality of generated ideas, particularly in novelty and diversity. The number of unique novel ideas produced by our framework is 3.4 times higher than without it. Moreover, our method outperforms the current state-of-the-art, generating at least 2.5 times more top-rated ideas based on 170 seed papers in a Swiss Tournament evaluation. Our code is available at https://github.com/hflyzju/Nova
pdf
bib
abs
Query-Driven Multimodal GraphRAG: Dynamic Local Knowledge Graph Construction for Online Reasoning
Chenyang Bu
|
Guojie Chang
|
Zihao Chen
|
CunYuan Dang
|
Zhize Wu
|
Yi He
|
Xindong Wu
An increasing adoption of Large Language Models (LLMs) in complex reasoning tasks necessitates their interpretability and reliability. Recent advances to that end include retrieval-augmented generation (RAG) and knowledge graph-enhanced RAG (GraphRAG), whereas they are constrained by static knowledge bases and ineffective multimodal data integration. In response, we propose a Query-Driven Multimodal GraphRAG framework that dynamically constructs local knowledge graphs tailored to query semantics. Our approach 1) derives graph patterns from query semantics to guide knowledge extraction, 2) employs a multi-path retrieval strategy to pinpoint core knowledge, and 3) supplements missing multimodal information ad hoc. Experimental results on the MultimodalQA and WebQA datasets demonstrate that our framework achieves the state-of-the-art performance among unsupervised competitors, particularly excelling in cross-modal understanding of complex queries.
pdf
bib
abs
A Survey of Uncertainty Estimation Methods on Large Language Models
Zhiqiu Xia
|
Jinxuan Xu
|
Yuqian Zhang
|
Hang Liu
Large language models (LLMs) have demonstrated remarkable capabilities across various tasks. However, these models could offer biased, hallucinated, or non-factual responses camouflaged by their fluency and realistic appearance. Uncertainty estimation is the key method to address this challenge. While research efforts in uncertainty estimation are ramping up, there is a lack of comprehensive and dedicated surveys on LLM uncertainty estimation. This survey presents four major avenues of LLM uncertainty estimation. Furthermore, we perform extensive experimental evaluations across multiple methods and datasets. At last, we provide critical and promising future directions for LLM uncertainty estimation.
pdf
bib
abs
Beyond Single-Value Metrics: Evaluating and Enhancing LLM Unlearning with Cognitive Diagnosis
Yicheng Lang
|
Kehan Guo
|
Yue Huang
|
Yujun Zhou
|
Haomin Zhuang
|
Tianyu Yang
|
Yao Su
|
Xiangliang Zhang
Due to the widespread use of LLMs and the rising critical ethical and safety concerns, LLM unlearning methods have been developed to remove harmful knowledge and undesirable capabilities. In this context, evaluations are mostly based on single-value metrics such as QA accuracy. However, these metrics often fail to capture the nuanced retention of harmful knowledge components, making it difficult to assess the true effectiveness of unlearning. To address this issue, we propose UNCD (UNlearning evaluation using Cognitive Diagnosis), a novel framework that leverages Cognitive Diagnosis Modeling for fine-grained evaluation of LLM unlearning. Our dedicated benchmark, UNCD-Cyber, provides a detailed assessment of the removal of dangerous capabilities. Moreover, we introduce UNCD-Agent, which refines unlearning by diagnosing knowledge remnants and generating targeted unlearning data. Extensive experiments across eight unlearning methods and two base models demonstrate that UNCD not only enhances evaluation but also effectively facilitates the removal of harmful LLM abilities.
pdf
bib
abs
Natural Language Processing in Support of Evidence-based Medicine: A Scoping Review
Zihan Xu
|
Haotian Ma
|
Yihao Ding
|
Gongbo Zhang
|
Chunhua Weng
|
Yifan Peng
Evidence-based medicine (EBM) is at the forefront of modern healthcare, emphasizing the use of the best available scientific evidence to guide clinical decisions. Due to the sheer volume and rapid growth of medical literature and the high cost of curation, there is a critical need to investigate Natural Language Processing (NLP) methods to identify, appraise, synthesize, summarize, and disseminate evidence in EBM. This survey presents an in-depth review of 129 research studies on leveraging NLP for EBM, illustrating its pivotal role in enhancing clinical decision-making processes. The paper systematically explores how NLP supports the five fundamental steps of EBM—Ask, Acquire, Appraise, Apply, and Assess. The review not only identifies current limitations within the field but also proposes directions for future research, emphasizing the potential for NLP to revolutionize EBM by refining evidence extraction, evidence synthesis, appraisal, summarization, enhancing data comprehensibility, and facilitating a more efficient clinical workflow.
pdf
bib
abs
How do Transformer Embeddings Represent Compositions? A Functional Analysis
Aishik Nagar
|
Ishaan Singh Rawal
|
Mansi Dhanania
|
Cheston Tan
Compositionality is a key aspect of human intelligence, essential for reasoning and generalization. While transformer-based models have become the de facto standard for many language modeling tasks, little is known about how they represent compound words, and whether these representations are compositional. In this study, we test compositionality in Mistral, OpenAI Large, and Google embedding models, and compare them with BERT. First, we evaluate compositionality in the representations by examining six diverse models of compositionality (addition, multiplication, dilation, regression, etc.). We find that ridge regression, albeit linear, best accounts for compositionality. Surprisingly, we find that the classic vector addition model performs almost as well as any other model. Next, we verify that most embedding models are highly compositional, while BERT shows much poorer compositionality. We verify and visualize our findings with a synthetic dataset consisting of fully transparent adjective-noun compositions. Overall, we present a thorough investigation of compositionality.
pdf
bib
abs
Entriever: Energy-based Retriever for Knowledge-Grounded Dialog Systems
Yucheng Cai
|
Ke Li
|
Yi Huang
|
Junlan Feng
|
Zhijian Ou
The retriever, which retrieves relevant knowledge pieces from a knowledge base given a context, is an important component in many natural language processing (NLP) tasks. Retrievers have been introduced in knowledge-grounded dialog systems to improve knowledge acquisition. In knowledge-grounded dialog systems, when conditioning on a given context, there may be multiple relevant and correlated knowledge pieces. However, knowledge pieces are usually assumed to be conditionally independent in current retriever models. To address this issue, we propose Entriever, an energy-based retriever. The Entriever directly models the candidate retrieval results as a whole instead of modeling the knowledge pieces separately, with the relevance score defined by an energy function. We explore various architectures of energy functions and different training methods for Entriever, and show that Entriever substantially outperforms the strong cross-encoder baseline in knowledge retrieval tasks. Furthermore, we show that in semi-supervised training of knowledge-grounded dialog systems, Entriever enables the effective scoring of retrieved knowledge pieces and leads to a significant improvement in the end-to-end performance of the dialog system.
pdf
bib
abs
MONTROSE: LLM-driven Monte Carlo Tree Search Self-Refinement for Cross-Domain Rumor Detection
Shanshan Liu
|
Menglong Lu
|
Zhen Huang
|
Zejiang He
|
Liu Liu
|
Zhigang Sun
|
Dongsheng Li
With the emergence of new topics on social media as sources of rumor dissemination, addressing the distribution shifts between source and target domains remains a crucial task in cross-domain rumor detection. Existing feature alignment methods, which aim to reduce the discrepancies between domains, are often susceptible to task interference during training. Additionally, data distribution alignment methods, which rely on existing data to synthesize new training samples, inherently introduce noise. To deal with these challenges, a new cross-domain rumor detection method, MONTROSE, is proposed. It combines LLM-driven Monte Carlo Tree Search (MCTS) data synthesis to generate high-quality synthetic data for the target domain and a domain-sharpness-aware (DSAM) self-refinement approach to train rumor detection models with these synthetic data effectively. Experiments demonstrate the superior performance of MONTROSE in cross-domain rumor detection.
pdf
bib
abs
PEToolLLM: Towards Personalized Tool Learning in Large Language Models
Qiancheng Xu
|
Yongqi Li
|
Heming Xia
|
Fan Liu
|
Min Yang
|
Wenjie Li
Tool learning has emerged as a promising direction by extending Large Language Models’ (LLMs) capabilities with external tools. Existing tool learning studies primarily focus on the general-purpose tool-use capability, which addresses explicit user requirements in instructions. However, they overlook the importance of personalized tool-use capability, leading to an inability to handle implicit user preferences. To address the limitation, we first formulate the task of personalized tool learning, which integrates user’s interaction history towards personalized tool usage. To fill the gap of missing benchmarks, we construct PEToolBench, featuring diverse user preferences reflected in interaction history under three distinct personalized settings, and encompassing a wide range of tool-use scenarios. Moreover, we propose a framework PEToolLLaMA to adapt LLMs to the personalized tool learning task, which is trained through supervised fine-tuning and direct preference optimization. Extensive experiments on PEToolBench demonstrate the superiority of PEToolLLaMA over existing LLMs. We release code and data at https://github.com/travis-xu/PEToolBench.
pdf
bib
abs
A Comprehensive Graph Framework for Question Answering with Mode-Seeking Preference Alignment
Quanwei Tang
|
Sophia Yat Mei Lee
|
Junshuang Wu
|
Dong Zhang
|
Shoushan Li
|
Erik Cambria
|
Guodong Zhou
Recent advancements in retrieval-augmented generation (RAG) have enhanced large language models in question answering by integrating external knowledge. However, challenges persist in achieving global understanding and aligning responses with human ethical and quality preferences. To address these issues, we propose GraphMPA, a comprehensive graph-based framework with mode-seeking preference alignment. Our approach constructs a hierarchical document graph using a general similarity measurement, mimicking human cognitive processes for information understanding and synthesis. Additionally, we introduce mode-seeking preference optimization to better align model outputs with human preferences through probability-matching constraints. Extensive experiments on six datasets demonstrate the effectiveness of our GraphMPA.
pdf
bib
abs
A MISMATCHED Benchmark for Scientific Natural Language Inference
Firoz Shaik
|
Mobashir Sadat
|
Nikita Gautam
|
Doina Caragea
|
Cornelia Caragea
Scientific Natural Language Inference (NLI) is the task of predicting the semantic relation between a pair of sentences extracted from research articles. Existing datasets for this task are derived from various computer science (CS) domains, whereas non-CS domains are completely ignored. In this paper, we introduce a novel evaluation benchmark for scientific NLI, called MisMatched. The new MisMatched benchmark covers three non-CS domains–Psychology, Engineering, and Public Health, and contains 2,700 human annotated sentence pairs. We establish strong baselines on MisMatched using both Pre-trained Small Language Models (SLMs) and Large Language Models (LLMs). Our best performing baseline shows a Macro F1 of only 78.17% illustrating the substantial headroom for future improvements. In addition to introducing the MisMatched benchmark, we show that incorporating sentence pairs having an implicit scientific NLI relation between them in model training improves their performance on scientific NLI. We make our dataset and code publicly available on GitHub.
pdf
bib
abs
TagRouter: Learning Route to LLMs through Tags for Open-Domain Text Generation Tasks
Zhou Chen
|
Zhiqiang Wei
|
Yuqi Bai
|
Xue Xiong
|
Jianmin Wu
Model routing allocates queries to the suitable model, improving system performance while reducing costs. However, existing routing methods face practical limitations that hinder scalability in large-scale applications and struggle to keep up with the rapid growth of the large language model (LLM) ecosystem. To tackle these challenges, we propose TagRouter, a training-free model routing method designed to optimize the synergy among multiple LLMs for open-domain text generation tasks. Experimental results demonstrate that TagRouter outperforms 13 baseline methods, increasing the accept rate of system by 6.15% and reducing costs by 17.20%, achieving optimal cost-efficiency. Our findings provides the LLM community with an efficient and scalable solution for model ensembling, offering users an evolvable “super model.”
pdf
bib
abs
The Reasoning-Memorization Interplay in Language Models Is Mediated by a Single Direction
Yihuai Hong
|
Meng Cao
|
Dian Zhou
|
Lei Yu
|
Zhijing Jin
Large language models (LLMs) excel on a variety of reasoning benchmarks, but previous studies suggest they sometimes struggle to generalize to unseen questions, potentially due to over-reliance on memorized training examples. However, the precise conditions under which LLMs switch between reasoning and memorization during text generation remain unclear. In this work, we provide a mechanistic understanding of LLMs’ reasoning-memorization dynamics by identifying a set of linear features in the model’s residual stream that govern the balance between genuine reasoning and memory recall. These features not only distinguish reasoning tasks from memory-intensive ones but can also be manipulated to causally influence model performance on reasoning tasks. Additionally, we show that intervening in these reasoning features helps the model more accurately activate the most relevant problem-solving capabilities during answer generation. Our findings offer new insights into the underlying mechanisms of reasoning and memory in LLMs and pave the way for the development of more robust and interpretable generative AI systems. Our code and data are at https://github.com/yihuaihong/Linear_Reasoning_Memory_Features.
pdf
bib
abs
MPBench: A Comprehensive Multimodal Reasoning Benchmark for Process Errors Identification
xu Zhao Pan
|
Pengfei Zhou
|
Jiaxin Ai
|
Wangbo Zhao
|
Kai Wang
|
Xiaojiang Peng
|
Wenqi Shao
|
Hongxun Yao
|
Kaipeng Zhang
Reasoning is an essential capacity for large language models (LLMs) to address complex tasks, whereas the identification of process errors is vital for improving this ability. Recently, process-level reward models (PRMs) were proposed to provide step-wise rewards that facilitate reinforcement learning and data production during training and guide LLMs toward correct steps during inference, thereby improving reasoning accuracy. However, existing benchmarks of PRMs are text-based and focus on error detection, neglecting other scenarios like reasoning search. To address this gap, we introduce MPBench, a comprehensive, multi-task, multimodal benchmark designed to systematically assess the effectiveness of PRMs in diverse scenarios. MPBench employs three evaluation paradigms, each targeting a specific role of PRMs in the reasoning process: (1) Step Correctness, which assesses the correctness of each intermediate reasoning step; (2) Answers Aggregation, which aggregates multiple solutions and selects the best one; and (3) Reasoning Process Search, which guides the search for optimal reasoning steps during inference. Through these paradigms, MPBench makes comprehensive evaluations and provides insights into the development of multimodal PRMs.
pdf
bib
abs
CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents
Tianqi Xu
|
Linyao Chen
|
Dai-Jie Wu
|
Yanjun Chen
|
Zecheng Zhang
|
Xiang Yao
|
Zhiqiang Xie
|
Yongchao Chen
|
Shilong Liu
|
Bochen Qian
|
Anjie Yang
|
Zhaoxuan Jin
|
Jianbo Deng
|
Philip Torr
|
Bernard Ghanem
|
Guohao Li
The development of autonomous agents increasingly relies on Multimodal Language Models (MLMs) to perform tasks described in natural language with GUI environments, such as websites, desktop computers, or mobile phones. Existing benchmarks for MLM agents in interactive environments are limited by their focus on a single environment, lack of detailed and generalized evaluation methods, and thecomplexities of constructing tasks and evaluators. To overcome these limitations, we introduce CRAB, the first cross-environment agent benchmark framework, incorporating a graph-based fine-grained evaluation method and an efficient task generation method. Our framework supports multiple devices and can be easily extended to any environment with a Python interface. Leveraging CRAB, we develope CRAB Benchmark-v0 comprising 120 tasks in computer desktop and mobile phone environments. We evaluated 6 advanced MLMs using different single and multi-agent system configurations on this benchmark. The experimental results demonstrate that the single agent with GPT-4o achieves the best completion ratio of 38.01%.
pdf
bib
abs
Towards A “Novel” Benchmark: Evaluating Literary Fiction with Large Language Models
Wenqing Wang
|
Mingqi Gao
|
Xinyu Hu
|
Xiaojun Wan
Current exploration on creative generation focuses mainly on short stories, poetry, and scripts. With the expansion of Large Language Models (LLMs) context windows, “novel” avenues emerge. This study aims to extend the boundaries of Natural Language Generation (NLG) evaluation by exploring LLMs’ capabilities in more challenging long-form fiction. We propose a new multi-level evaluation framework that incorporates ten metrics across the Macro, Meso, and Micro levels. An annotated fiction dataset, sourced from human authors, LLMs, and human-AI collaborations in both English and Chinese is then constructed. Human evaluation reveals notable disparities between LLM-generated and human-authored fictions, particularly the “high-starting, low-ending” pattern in LLM outputs. We further probe ten high-performing LLMs through different prompt templates, achieving moderate correlations by strategically utilizing diverse LLMs tailored to different levels, as an initial step towards better automatic fiction evaluation. Finally, we offer a fine-grained analysis of LLMs capabilities through six issues, providing promising insights for future advancements.
pdf
bib
abs
A Reinforcement Learning Framework for Cross-Lingual Stance Detection Using Chain-of-Thought Alignment
Binghui Li
|
Minghui Zou
|
Xiaowang Zhang
|
Shizhan Chen
|
Zhiyong Feng
Cross-lingual stance detection identifies users’ attitudes toward specific targets in texts by transferring knowledge from source languages to target languages. Previous studies have typically facilitated this transfer by translating and aligning labels or targets. However, these methods cannot effectively perform cross-lingual transfer of the complex reasoning processes in stance detection. To address this challenge, we propose a reinforcement learning framework using cross-lingual Chain-of-Thought (CoT) alignment, referred to as RCCA. Specifically, we adopt a cross-lingual CoT alignment strategy to obtain the high-quality CoTs generated from target language inputs. After that, we leverage reinforcement learning by sampling CoTs and assigning rewards according to predefined rules, aiming to enhance the model’s generalization capabilities in the target language. Experimental results on four multilingual datasets demonstrate that our approach outperforms competitive methods.
pdf
bib
abs
CARE-STaR: Constraint-aware Self-taught Reasoner
Zhiliang Li
|
Bo Tang
|
Yijun Niu
|
Beihong Jin
|
Qiwen Shi
|
Yuchen Feng
|
Zhiyu Li
|
Jie Hu
|
Mingchuan Yang
|
Feiyu Xiong
In real-world applications, large language models (LLMs) often need to handle diverse and complex instructions. Specifically, when instructions are subject to multiple constraints, some of which are somewhat ambiguous, LLMs often fail to produce answers that satisfy all constraints, limiting their effectiveness in various tasks. To address this challenge, we examine the different constraints in the instructions and discover that the difficulty of determining whether an answer meets a constraint varies widely, from extremely straightforward to exceptionally perplexing. Correspondingly, we propose to assign constraints to different constraint levels. Furthermore, inspired by chain-of-thought (CoT) and self-taught reasoner (STaR), we propose a two-stage method named CARE-STaR (Constraint-AwaRE STaR). Our method distinguishes constraints within instructions by generating different CoTs and guides LLMs to autonomously learn optimal answers by setting the positive rewards to the CoTs that are beneficial to generating accurate answers and iteratively optimizing these answers. We have conducted extensive experiments on three instruction-following benchmarks, taking three existing LLMs as base LLMs, respectively. Experimental results indicate that our method substantially enhances the capability of these LLMs to handle complex instructions, outperforming supervised fine-tuning (SFT). Our code is available at https://github.com/lzl0124/carestar.
pdf
bib
abs
Is It JUST Semantics? A Case Study of Discourse Particle Understanding in LLMs
William Berkeley Sheffield
|
Kanishka Misra
|
Valentina Pyatkin
|
Ashwini Deo
|
Kyle Mahowald
|
Junyi Jessy Li
Discourse particles are crucial elements that subtly shape the meaning of text. These words, often polyfunctional, give rise to nuanced and often quite disparate semantic/discourse effects,as exemplified by the diverse uses of the particle *just* (e.g., exclusive, temporal, emphatic). This work investigates the capacity of LLMs to distinguish the fine-grained senses of English *just*, a well-studied example in formal semantics, using data meticulously created and labeled by expert linguists. Our findings reveal that while LLMs exhibit some ability to differentiate between broader categories, they struggle to fully capture more subtle nuances, highlighting a gap in their understanding of discourse particles.
pdf
bib
abs
War of Thoughts: Competition Stimulates Stronger Reasoning in Large Language Models
Yibin Chen
|
Jinyi Liu
|
Yan Zheng
|
Yifu Yuan
|
Jianye Hao
Recent advances in Large Language Models (LLMs) have reshaped the landscape of reasoning tasks, particularly through test-time scaling (TTS) to enhance LLM reasoning. Prior research has used structures such as trees or graphs to guide LLMs in searching for optimal solutions. These methods are time-consuming and require a strong reward model (RM) to support effective solution space exploration. Tournament-style approaches eliminate the reliance on RMs through comparative evaluation but suffer from transitivity dilemmas, leading to unstable ordering. To address these issues, we propose War of Thoughts (**WoT**), a novel post-hoc method that enhances reasoning without finetuning. WoT comprises two distinct stages: (1) *Exploration*, in which diverse and meaningful candidate solutions are generated through contrastive demonstrations and multi-granularity reasoning specifications; and (2) *Competition*, where these candidate solutions are subjected to multiple rounds of matchups within a competitive arena. Throughout this iterative process, the solutions are optimized and improved, with the optimal solution being determined based on Elo ratings. Extensive experiments across various LLMs demonstrate the superiority of WoT, surpassing baselines by **10–30%**. WoT can effectively stimulate stronger reasoning abilities, achieving impressive TTS performance in both generation budget and model size. It shows higher scalability efficiency compared to the baseline within the same budget. Notably, WoT exhibits excellent scalability with model size, even outperforming a 72B model despite using a 7B model.
pdf
bib
abs
Does Rationale Quality Matter? Enhancing Mental Disorder Detection via Selective Reasoning Distillation
Hoyun Song
|
Huije Lee
|
Jisu Shin
|
Sukmin Cho
|
Changgeon Ko
|
Jong C. Park
The detection of mental health problems from social media and the interpretation of these results have been extensively explored. Research has shown that incorporating clinical symptom information into a model enhances domain expertise, improving its detection and interpretation performance. While large language models (LLMs) are shown to be effective for generating explanatory rationales in mental health detection, their substantially big parameter size and high computational cost limit their practicality. Reasoning distillation transfers this ability to smaller language models (SLMs), but inconsistencies in the relevance and domain alignment of LLM-generated rationales pose a challenge. This paper investigates how rationale quality impacts SLM performance in mental health detection and explanation generation. We hypothesize that ensuring high-quality and domain-relevant rationales enhances the distillation. To this end, we propose a framework that selects rationales based on their alignment with expert clinical reasoning. Experiments show that our quality-focused approach significantly enhances SLM performance in both mental disorder detection and rationale generation. This work highlights the importance of rationale quality and offers an insightful framework for knowledge transfer in mental health applications.
pdf
bib
abs
Rethinking Table Instruction Tuning
Naihao Deng
|
Rada Mihalcea
Recent advances in table understanding have focused on instruction-tuning large language models (LLMs) for table-related tasks. However, existing research has overlooked the impact of hyperparameter choices, and also lacks a comprehensive evaluation of the out-of-domain table understanding ability and the general capabilities of these table LLMs. In this paper, we evaluate these abilities in existing table LLMs, and find significant declines in both out-of-domain table understanding and general capabilities as compared to their base models. Through systematic analysis, we show that hyperparameters, such as learning rate, can significantly influence both table-specific and general capabilities. Contrary to the previous table instruction-tuning work, we demonstrate that smaller learning rates and fewer training instances can enhance table understanding while preserving general capabilities. Based on our findings, we introduce TAMA, a TAble LLM instruction-tuned from LLaMA 3.1 8B Instruct, which achieves performance on par with, or surpassing GPT-3.5 and GPT-4 on table tasks, while maintaining strong out-of-domain generalization and general capabilities. Our findings highlight the potential for reduced data annotation costs and more efficient model development through careful hyperparameter selection. We open-source the project and our models.
pdf
bib
abs
CliniDial: A Naturally Occurring Multimodal Dialogue Dataset for Team Reflection in Action During Clinical Operation
Naihao Deng
|
Kapotaksha Das
|
Rada Mihalcea
|
Vitaliy Popov
|
Mohamed Abouelenien
In clinical operations, teamwork can be the crucial factor that determines the final outcome. Prior studies have shown that sufficient collaboration is the key factor that determines the outcome of an operation. To understand how the team practices teamwork during the operation, we collected **CliniDial** from simulations of medical operations. **CliniDial** includes the audio data and its transcriptions, the simulated physiology signals of the patient manikins, and how the team operates from two camera angles. We annotate behavior codes following an existing framework to understand the teamwork process for **CliniDial**. We pinpoint three main characteristics of our dataset, including its label imbalances, rich and natural interactions, and multiple modalities, and conduct experiments to test existing LLMs’ capabilities on handling data with these characteristics. Experimental results show that **CliniDial** poses significant challenges to the existing models, inviting future effort on developing methods that can deal with real-world clinical data. We open-source the codebase at https://github.com/MichiganNLP/CliniDial.
pdf
bib
abs
Chumor 2.0: Towards Better Benchmarking Chinese Humor Understanding from (Ruo Zhi Ba)
Ruiqi He
|
Yushu He
|
Longju Bai
|
Jiarui Liu
|
Zhenjie Sun
|
Zenghao Tang
|
He Wang
|
Hanchen Xia
|
Rada Mihalcea
|
Naihao Deng
Existing humor datasets and evaluations predominantly focus on English, leaving limited resources for culturally nuanced humor in non-English languages like Chinese. To address this gap, we construct **Chumor**, the first and the largest Chinese humor explanation dataset. **Chumor** is sourced from Ruo Zhi Ba (RZB, 弱智吧), a Chinese Reddit-like platform known for sharing intellectually challenging and culturally specific jokes. We test ten LLMs through direct and chain-of-thought prompting, revealing that **Chumor** poses significant challenges to existing LLMs, with their accuracy slightly above random and far below human. In addition, our analysis highlights that human-annotated humor explanations are significantly better than those generated by GPT-4o and ERNIE4-turbo. We release **Chumor** at https://huggingface.co/datasets/MichiganNLP/Chumor , our project page is at https://github.com/MichiganNLP/Chumor-2.0 , our leaderboard is at https://huggingface.co/spaces/MichiganNLP/Chumor-leaderboard , and our codebase is at https://github.com/MichiganNLP/Chumor-2.0 .
pdf
bib
abs
Explicit Bayesian Inference to Uncover the Latent Themes of Large Language Models
Raymond Li
|
Chuyuan Li
|
Gabriel Murray
|
Giuseppe Carenini
Large language models (LLMs) have demonstrated impressive generative capabilities, yet their inner mechanisms remain largely opaque. In this work, we introduce a novel approach to interpret LLMs generation process through the lens of an explicit Bayesian framework by inferring latent topic variables via variational inference. Specifically, we leverage a variational autoencoder-based neural topic model to dynamically approximate the posterior distribution over the high-level latent topic variables at each generation step. By reconstructing the LLM’s next-token predictions through these latent topics and maintaining a regularized latent space, our method yields interpretable and diverse topic representations but also has the ability to effectively captures semantic shifts throughout the text. We validate our approach on multiple datasets, showing that our latent topics outperform state-of-the-art topic models on intrinsic measures of coherence and diversity. Furthermore, we demonstrate the utility of our approach in downstream applications by using the inferred topic distributions to retrieve relevant demonstration examples for in-context learning, resulting in significant gains on classification and summarization tasks.
pdf
bib
abs
Improving Occupational ISCO Classification of Multilingual Swiss Job Postings with LLM-Refined Training Data
Ann-Sophie Gnehm
|
Simon Clematide
Classifying occupations in multilingual job postings is challenging due to noisy labels, language variation, and domain-specific terminology. We present a method that refines silver-standard ISCO labels by consolidating them with predictions from pre-fine-tuned models, using large language model (LLM) evaluations to resolve discrepancies. The refined labels are used in Multiple Negatives Ranking (MNR) training for SentenceBERT-based classification. This approach substantially improves performance, raising Top-1 accuracy on silver data from 37.2% to 58.3% and reaching up to 80% precision on held-out data—an over 30-point gain validated by both GPT and human raters. The model benefits from cross-lingual transfer, with particularly strong gains in French and Italian. These results demonstrate hat LLM-guided label refinement can substantially improve multilingual occupation classification in fine-grained taxonomies such as CH-ISCO with 670 classes.
pdf
bib
abs
Brevity is the soul of sustainability: Characterizing LLM response lengths
Soham Poddar
|
Paramita Koley
|
Janardan Misra
|
Niloy Ganguly
|
Saptarshi Ghosh
A significant portion of the energy consumed by Large Language Models (LLMs) arises from their inference processes; hence developing energy-efficient methods for inference is crucial. While several techniques exist for inference optimization, output compression remains relatively unexplored, with only a few preliminary efforts addressing this aspect. In this work, we first benchmark 12 decoder-only LLMs across 5 datasets, revealing that these models often produce responses that are substantially longer than necessary. We then conduct a comprehensive quality assessment of LLM responses, formally defining six information categories present in LLM responses. We show that LLMs often tend to include redundant or additional information besides the minimal answer. To address this issue of long responses by LLMs, we explore several simple and intuitive prompt-engineering strategies.Empirical evaluation shows that appropriate prompts targeting length reduction and controlling information content can achieve significant energy optimization between 25-60% by reducing the response length while preserving the quality of LLM responses.
pdf
bib
abs
Adversarial Preference Learning for Robust LLM Alignment
Yuanfu Wang
|
Pengyu Wang
|
Chenyang Xi
|
Bo Tang
|
Junyi Zhu
|
Wenqiang Wei
|
Chen Chen
|
Chao Yang
|
Jingfeng Zhang
|
Chaochao Lu
|
Yijun Niu
|
Keming Mao
|
Zhiyu Li
|
Feiyu Xiong
|
Jie Hu
|
Mingchuan Yang
Modern language models often rely on Reinforcement Learning from Human Feedback (RLHF) to encourage safe behaviors. However, they remain vulnerable to adversarial attacks due to three key limitations: (1) the inefficiency and high cost of human annotation, (2) the vast diversity of potential adversarial attacks, and (3) the risk of feedback bias and reward hacking. To address these challenges, we introduce Adversarial Preference Learning (APL), an iterative adversarial training method incorporating three key innovations. First, a direct harmfulness metric based on the model’s intrinsic preference probabilities, eliminating reliance on external assessment. Second, a conditional generative attacker that synthesizes input-specific adversarial variations. Third, an iterative framework with automated closed-loop feedback, enabling continuous adaptation through vulnerability discovery and mitigation. Experiments on Mistral-7B-Instruct-v0.3 demonstrate that APL significantly enhances robustness, achieving 83.33% harmlessness win rate over the base model (evaluated by GPT-4o), reducing harmful outputs from 5.88% to 0.43% (measured by LLaMA-Guard), and lowering attack success rate by up to 65% according to HarmBench. Notably, APL maintains competitive utility, with an MT-Bench score of 6.59 (comparable to the baseline 6.78) and an LC-WinRate of 46.52% against the base model.
pdf
bib
abs
gMBA: Expression Semantic Guided Mixed Boolean-Arithmetic Deobfuscation Using Transformer Architectures
Youjeong Roh
|
Joon-Young Paik
|
Jingun Kwon
|
Eun-Sun Cho
Mixed Boolean-Arithmetic (MBA) obfuscation protects intellectual property by converting programs into forms that are more complex to analyze. However, MBA has been increasingly exploited by malware developers to evade detection and cause significant real-world problems. Traditional MBA deobfuscation methods often consider these expressions as part of a black box and overlook their internal semantic information. To bridge this gap, we propose a truth table, which is an automatically constructed semantic representation of an expression’s behavior that does not rely on external resources. The truth table is a mathematical form that represents the output of expression for all possible combinations of input. We also propose a general and extensible guided MBA deobfuscation framework (gMBA) that modifies a Transformer-based neural encoder-decoder Seq2Seq architecture to incorporate this semantic guidance. Experimental results and in-depth analysis show that integrating expression semantics significantly improves performance and highlights the importance of internal semantic expressions in recovering obfuscated code to its original form.
pdf
bib
abs
READoc: A Unified Benchmark for Realistic Document Structured Extraction
Zichao Li
|
Aizier Abulaiti
|
Yaojie Lu
|
Xuanang Chen
|
Jia Zheng
|
Hongyu Lin
|
Xianpei Han
|
Shanshan Jiang
|
Bin Dong
|
Le Sun
Document Structured Extraction (DSE) aims to extract structured content from raw documents. Despite the emergence of numerous DSE systems, their unified evaluation remains inadequate, significantly hindering the field’s advancement. This problem is largely attributed to existing benchmark paradigms, which exhibit fragmented and localized characteristics. To offer a thorough evaluation of DSE systems, we introduce a novel benchmark named READoc, which defines DSE as a realistic task of converting unstructured PDFs into semantically rich Markdown. The READoc dataset is derived from 3,576 diverse and real-world documents from arXiv, GitHub, and Zenodo. In addition, we develop a DSE Evaluation S3uite comprising Standardization, Segmentation and Scoring modules, to conduct a unified evaluation of state-of-the-art DSE approaches. By evaluating a range of pipeline tools, expert visual models, and general Vision-Language Models, we identify the gap between current work and the unified, realistic DSE objective for the first time. We aspire that READoc will catalyze future research in DSE, fostering more comprehensive and practical solutions.
pdf
bib
abs
TicTac: Time-aware Supervised Fine-tuning for Automatic Text Dating
Han Ren
|
Minna Peng
Pre-trained langauge models have achieved success in many natural language processing tasks, whereas they are trapped by the time-agnostic setting, impacting the performance in automatic text dating. This paper introduces TicTac, a supervised fine-tuning model for automatic text dating. Unlike the existing models that always ignore the temporal relatedness of documents, TicTac has the ability to learn temporal semantic information, which is helpful for capturing the temporal implications over long-time span corpora. As a fine-tuning framework, TicTac employs a contrastive learning-based approach to model two types of temporal relations of diachronic documents. TicTac also adopts a metric learning approach, where the temporal distance between a historical text and its category label is estimated, which benefits to learn temporal semantic information on texts with temporal ordering. Experiments on two diachronic corpora show that our model effectively captures the temporal semantic information and outperforms state-of-the-art baselines.
pdf
bib
abs
Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting
Hao Feng
|
Shu Wei
|
Xiang Fei
|
Wei Shi
|
Yingdong Han
|
Lei Liao
|
Jinghui Lu
|
Binghong Wu
|
Qi Liu
|
Chunhui Lin
|
Jingqun Tang
|
Hao Liu
|
Can Huang
Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Current approaches either assemble specialized expert models or directly generate page-level content autoregressively, facing integration overhead, efficiency bottlenecks, and layout structure degradation despite their decent performance. To address these limitations, we present Dolphin ( Document Image Parsing via Heterogeneous Anchor Prompting), a novel multimodal document image parsing model following an analyze-then-parse paradigm. In the first stage, Dolphin generates a sequence of layout elements in reading order. These heterogeneous elements, serving as anchors and coupled with task-specific prompts, are fed back to Dolphin for parallel content parsing in the second stage. To train Dolphin, we construct a large-scale dataset of over 30 million samples, covering multi-granularity parsing tasks. Through comprehensive evaluations on both prevalent benchmarks and self-constructed ones, Dolphin achieves state-of-the-art performance across diverse page-level and element-level settings, while ensuring superior efficiency through its lightweight architecture and parallel parsing mechanism. The code and pre-trained models are publicly available at https://github.com/ByteDance/Dolphin
pdf
bib
abs
FanChuan: A Multilingual and Graph-Structured Benchmark For Parody Detection and Analysis
Yilun Zheng
|
Sha Li
|
Fangkun Wu
|
Yang Ziyi
|
Lin Hongchao
|
Zhichao Hu
|
Cai Xinjun
|
Ziming Wang
|
Jinxuan Chen
|
Sitao Luan
|
Jiahao Xu
|
Lihui Chen
Parody is an emerging phenomenon on social media, where individuals imitate a role or position opposite to their own, often for humor, provocation, or controversy. Detecting and analyzing parody can be challenging and is often reliant on context, yet it plays a crucial role in understanding cultural values, promoting subcultures, and enhancing self-expression. However, the study of parody is hindered by limited available data and deficient diversity in current datasets. To bridge this gap, we built seven parody datasets from both English and Chinese corpora, with 14,755 annotated users and 21,210 annotated comments in total. To provide sufficient context information, we also collect replies and construct user-interaction graphs to provide richer contextual information, which is lacking in existing datasets. With these datasets, we test traditional methods and Large Language Models (LLMs) on three key tasks: (1) parody detection, (2) comment sentiment analysis with parody, and (3) user sentiment analysis with parody. Our extensive experiments reveal that parody-related tasks still remain challenging for all models, and contextual information plays a critical role. Interestingly, we find that, in certain scenarios, traditional sentence embedding methods combined with simple classifiers can outperform advanced LLMs, i.e. DeepSeek-R1 and GPT-o3, highlighting parody as a significant challenge for LLMs.
pdf
bib
abs
P-CoT: A Pedagogically-motivated Participatory Chain-of-Thought Prompting for Phonological Reasoning in LLMs
Dongjun Jang
|
Youngchae Ahn
|
Hyopil Shin
This study explores the potential of phonological reasoning within text-based large language models (LLMs). Utilizing the PhonologyBench benchmark, we assess tasks like rhyme word generation, g2p conversion, and syllable counting. Our evaluations across 12 LLMs reveal that while few-shot learning offers inconsistent gains, the introduction of a novel Pedagogically-motivated Participatory Chain-of-Thought (P-CoT) prompt, which is anchored in educational theories like scaffolding and discovery learning, consistently enhances performance. This method leverages structured guidance to activate latent phonological abilities, achieving up to 52% improvement and even surpassing human baselines in certain tasks. Future work could aim to optimize P-CoT prompts for specific models or explore their application across different linguistic domains.
pdf
bib
abs
DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation
Wenhao Hu
|
Jinhao Duan
|
Chunchen Wei
|
Li Zhang
|
Yue Zhang
|
Kaidi Xu
The rapid advancement of large language models (LLMs) has significantly improved their performance in code generation tasks. However, existing code benchmarks remain static, consisting of fixed datasets with predefined problems. This makes them vulnerable to memorization during training, where LLMs recall specific test cases instead of generalizing to new problems, leading to data contamination and unreliable evaluation results. To address these issues, we introduce DynaCode, a dynamic, complexity-aware benchmark that overcomes the limitations of static datasets. DynaCode evaluates LLMs systematically using a complexity-aware metric, incorporating both code complexity and call-graph structures. DynaCode achieves large-scale diversity, generating up to 189 million unique nested code problems across 4 units of code complexity and 16 types of call graphs. Results on 12 latest LLMs show an average performance drop of 16.8 to 45.7 compared to MBPP+, with performance progressively decreasing as complexity increases. This demonstrates DynaCode’s ability to effectively differentiate model performance based on code complexity and how different parts of a program interact. Our benchmark and evaluation code are available at https://github.com/HWH-2000/DynaCode.
pdf
bib
abs
Small Encoders Can Rival Large Decoders in Detecting Groundedness
Istabrak Abbes
|
Gabriele Prato
|
Quentin Fournier
|
Fernando Rodriguez
|
Alaa Boukhary
|
Adam Elwood
|
Sarath Chandar
Augmenting large language models (LLMs) with external context significantly improves their performance in natural language processing (NLP) tasks. However, LLMs struggle to answer queries reliably when the provided context lacks information, often resorting to ungrounded speculation or internal knowledge. Groundedness – generating responses strictly supported by the context – is essential for ensuring factual consistency and trustworthiness. This study focuses on detecting whether a given query is grounded in a document provided in context before the costly answer generation by LLMs. Such a detection mechanism can significantly reduce both inference time and resource consumption. We show that lightweight, task-specific encoder models such as RoBERTa and NomicBERT, fine-tuned on curated datasets, can achieve accuracy comparable to state-of-the-art LLMs, such as Llama3 8B and GPT4o, in groundedness detection while reducing inference latency by orders of magnitude.
pdf
bib
abs
KITAB-Bench: A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding
Ahmed Heakl
|
Muhammad Abdullah Sohail
|
Mukul Ranjan
|
Rania Elbadry
|
Ghazi Shazan Ahmad
|
Mohamed El-Geish
|
Omar Maher
|
Zhiqiang Shen
|
Fahad Shahbaz Khan
|
Salman Khan
With the growing adoption of Retrieval-Augmented Generation (RAG) in document processing, robust text recognition has become increasingly critical for knowledge extraction. While OCR (Optical Character Recognition) for English and other languages benefits from large datasets and well-established benchmarks, Arabic OCR faces unique challenges due to its cursive script, right-to-left text flow, and complex typographic and calligraphic features. We present KITAB-Bench, a comprehensive Arabic OCR benchmark that fills the gaps in current evaluation systems. Our benchmark comprises 8,809 samples across 9 major domains and 36 subdomains, encompassing diverse document types including handwritten text, structured tables, and specialized coverage of 21 chart types for business intelligence. Our findings show that modern vision language models (such as GPT-4o, Gemini, and Qwen) outperform traditional OCR approaches (such as EasyOCR, PaddleOCR, and Surya) by an average of 60% in the character error rate (CER). Furthermore, we highlight significant limitations of current Arabic OCR models, particularly in PDF-to-Markdown conversion, where the best model Gemini-2.0-Flash achieves only 65% accuracy. This underscores the challenges of accurately recognizing Arabic text, including issues with complex fonts, numeral recognition errors, word elongation, and table structure detection. This work establishes a rigorous evaluation framework that can drive improvements in Arabic document analysis methods and bridge the performance gap with English OCR technologies.
pdf
bib
abs
Robustness and Confounders in the Demographic Alignment of LLMs with Human Perceptions of Offensiveness
Shayan Alipour
|
Indira Sen
|
Mattia Samory
|
Tanu Mitra
Despite a growing literature finding that large language models (LLMs) exhibit demographic biases, reports with whom they align best are hard to generalize or even contradictory. In this work, we examine the alignment of LLMs with human annotations in five offensive language datasets, comprising approximately 220K annotations. While demographic traits, particularly race, influence alignment, these effects vary across datasets and are often entangled with other factors. Confounders introduced in the annotation process—such as document difficulty, annotator sensitivity, and within-group agreement—account for more variation in alignment patterns than demographic traits. Alignment increases with annotator sensitivity and group agreement, and decreases with document difficulty. Our results underscore the importance of multi-dataset analyses and confounder-aware methodologies in developing robust measures of demographic bias.
pdf
bib
abs
AL-QASIDA: Analyzing LLM Quality and Accuracy Systematically in Dialectal Arabic
Nathaniel Romney Robinson
|
Shahd Abdelmoneim
|
Kelly Marchisio
|
Sebastian Ruder
Dialectal Arabic (DA) varieties are under-served by language technologies, particularly large language models (LLMs). This trend threatens to exacerbate existing social inequalities and limits LLM applications, yet the research community lacks operationalized performance measurements in DA. We present a framework that comprehensively assesses LLMs’ DA modeling capabilities across four dimensions: fidelity, understanding, quality, and diglossia. We evaluate nine LLMs in eight DA varieties and provide practical recommendations. Our evaluation suggests that LLMs do not produce DA as well as they understand it, not because their DA fluency is poor, but because they are reluctant to generate DA. Further analysis suggests that current post-training can contribute to bias against DA, that few-shot examples can overcome this deficiency, and that otherwise no measurable features of input text correlate well with LLM DA performance.
pdf
bib
abs
Is Large Language Model Performance on Reasoning Tasks Impacted by Different Ways Questions Are Asked?
Seok Hwan Song
|
Mohna Chakraborty
|
Qi Li
|
Wallapak Tavanapong
Large Language Models (LLMs) have been evaluated using diverse question types, e.g., multiple-choice, true/false, and short/long answers. This study answers an unexplored question about the impact of different question types on LLM accuracy on reasoning tasks. We investigate the performance of five LLMs on three different types of questions using quantitative and deductive reasoning tasks. The performance metrics include accuracy in the reasoning steps and choosing the final answer. Key Findings: (1) Significant differences exist in LLM performance across different question types. (2) Reasoning accuracy does not necessarily correlate with the final selection accuracy. (3) The number of options and the choice of words, influence LLM performance.
pdf
bib
abs
MutantPrompt: Prompt Optimization via Mutation Under a Budget on Modest-sized LMs
Arijit Nag
|
Animesh Mukherjee
|
Niloy Ganguly
|
Soumen Chakrabarti
Prompts serve as a critical instruction interface to unlock the diverse capabilities of Large Language Models (LLMs), thus directly influencing the quality of their outputs. While prompt engineering has shown great promise, identifying optimal prompts remains a significant challenge, particularly for low-resource languages, which often face higher computational costs due to increased token generation and limited gold standard task data. In response, we propose MutantPrompt, a framework that leverages multi-armed bandit algorithms to efficiently identify optimal prompts tailored to low-resource languages. By framing prompt selection as an exploration-exploitation problem under a fixed computational budget, the framework dynamically balances exploring new prompts with exploiting known high-performing ones. We demonstrate the framework’s effectiveness across multiple low-resource Indic language tasks, including classification, question-answering and causal reasoning using three small parameter-size LLMs. The results highlight the cost efficiency of the search method in finding optimal prompts and resulting performance improvements.
pdf
bib
abs
Heuristic-based Search Algorithm in Automatic Instruction-focused Prompt Optimization: A Survey
Wendi Cui
|
Jiaxin Zhang
|
Zhuohang Li
|
Hao Sun
|
Damien Lopez
|
Kamalika Das
|
Bradley A. Malin
|
Sricharan Kumar
Recent advances in Large Language Models(LLMs) have led to remarkable achievements across a variety of Natural Language Processing(NLP) tasks, making prompt engineering increasingly central to guiding model outputs. While manual methods (e.g., “chain-of-thought,” “step-by-step” prompts) can be effective, they typically rely on intuition and do not automatically refine prompts over time. In contrast, automatic prompt optimization employing heuristic-based search algorithms can systematically explore and improve prompts with minimal human oversight. This survey proposes a comprehensive taxonomy of these methods, categorizing them by where optimization occurs, what is optimized, what criteria drive the optimization, which operators generate new prompts, and which iterative search algorithms are applied. We further highlight specialized datasets and tools that support and accelerate automated prompt refinement. We conclude by discussing key open challenges, pointing toward future opportunities for more robust and versatile LLM applications.
pdf
bib
abs
CONSENSAGENT: Towards Efficient and Effective Consensus in Multi-Agent LLM Interactions Through Sycophancy Mitigation
Priya Pitre
|
Naren Ramakrishnan
|
Xuan Wang
Multi-agent large language model (LLM) systems have shown remarkable performance in tasks such as reasoning, planning, and decision-making. However, their applicability is limited by challenges such as high computational costs and robustness issues. In this work, we identify and systematically evaluate a critical yet overlooked challenge: sycophancy, where agents reinforce each other’s responses instead of critically engaging with the debate. This behavior inflates computational costs by requiring additional debate rounds to reach consensus, limiting the efficiency of multi-agent LLM systems. Through experiments on six benchmark reasoning datasets across three models, we analyze the impact of sycophancy and its role in reducing the reliability of multi-agent debate. Motivated by our findings, we propose CONSENSAGENT, a novel framework that dynamically refines prompts based on agent interactions to mitigate sycophancy. CONSENSAGENT improves accuracy of the debate while maintaining efficiency. It significantly outperforms both single-agent and multi-agent baselines, achieving state-of-the-art results across all benchmark datasets. Our findings highlight the crucial role of structured prompt optimization in multi-agent setups and establish a foundation for more reliable, efficient multi-agent LLM systems in real-world applications.
pdf
bib
abs
The Structural Safety Generalization Problem
Julius Broomfield
|
Tom Gibbs
|
George Ingebretsen
|
Ethan Kosak-Hine
|
Tia Nasir
|
Jason Zhang
|
Reihaneh Iranmanesh
|
Sara Pieri
|
Reihaneh Rabbany
|
Kellin Pelrine
LLM jailbreaks are a widespread safety challenge. Given this problem has not yet been tractable, we suggest targeting a key failure mechanism: the failure of safety to generalize across semantically equivalent inputs. We further focus the target by requiring desirable tractability properties of attacks to study: explainability, transferability between models, and transferability between goals. We perform red-teaming within this framework by uncovering new vulnerabilities to multi-turn, multi-image, and translation-based attacks. These attacks are semantically equivalent by our design to their single-turn, single-image, or untranslated counterparts, enabling systematic comparisons; we show that the different structures yield different safety outcomes. We then demonstrate the potential for this framework to enable new defenses by proposing a Structure Rewriting Guardrail, which converts an input to a structure more conducive to safety assessment. This guardrail significantly improves refusal of harmful inputs, without over-refusing benign ones. Thus, by framing this intermediate challenge—more tractable than universal defenses but essential for long-term safety—we highlight a critical milestone for AI safety research.
pdf
bib
abs
DPO Kernels: A Semantically-Aware, Kernel-Enhanced, and Divergence-Rich Paradigm for Direct Preference Optimization
Amitava Das
|
Suranjana Trivedy
|
Danush Khanna
|
Yaswanth Narsupalli
|
Basab Ghosh
|
Rajarshi Roy
|
Gurpreet Singh
|
Vinija Jain
|
Vasu Sharma
|
Aishwarya Naresh Reganti
|
Aman Chadha
The rapid advancement of large language models (LLMs) has revolutionized numerous applications, but presents significant challenges in aligning these models with diverse human values, ethical standards, and specific user preferences. Direct Preference Optimization (DPO) has become a cornerstone for preference alignment but is constrained by reliance on fixed divergence measures and limited feature transformations. We introduce DPO-Kernels, an innovative enhancement of DPO that integrates kernel methods to overcome these challenges through four key contributions: (i) Kernelized Representations: These representations enhance divergence measures by using polynomial, RBF, Mahalanobis, and spectral kernels for richer feature transformations. Additionally, we introduce a hybrid loss that combines embedding-based loss with probability-based loss; (ii) Divergence Alternatives: Beyond Kullback–Leibler (KL), we incorporate Jensen-Shannon, Hellinger, Rényi, Bhattacharyya, Wasserstein, and other f-divergences to boost stability and robustness; (iii) Data-Driven Selection: Choosing the optimal kernel-divergence pair among 28 combinations (4 kernels × 7 divergences) is challenging. We introduce automatic metrics that analyze the data to select the best kernel-divergence pair, eliminating the need for manual tuning; (iv) Hierarchical Mixture of Kernels (HMK): Combining local and global kernels for precise and large-scale semantic modeling. This approach automatically selects the optimal kernel mixture during training, enhancing modeling flexibility. DPO-Kernels achieve state-of-the-art generalization in factuality, safety, reasoning, and instruction following across 12 datasets. While alignment risks overfitting, Heavy-Tailed Self-Regularization (HT-SR) theory confirms that DPO-Kernels ensure robust generalization in LLMs. Comprehensive resources are available to facilitate further research and application of DPO-Kernels.
pdf
bib
abs
Model-Dependent Moderation: Inconsistencies in Hate Speech Detection Across LLM-based Systems
Neil Fasching
|
Yphtach Lelkes
Content moderation systems powered by large language models (LLMs) are increasingly deployed to detect hate speech; however, no systematic comparison exists between different systems. If different systems produce different outcomes for the same content, it undermines consistency and predictability, leading to moderation decisions that appear arbitrary or unfair. Analyzing seven leading models—dedicated Moderation Endpoints (OpenAI, Mistral), frontier LLMs (Claude 3.5 Sonnet, GPT-4o, Mistral Large, DeepSeek V3), and specialized content moderation APIs (Google Perspective API)—we demonstrate that moderation system choice fundamentally determines hate speech classification outcomes. Using a novel synthetic dataset of 1.3+ million sentences from a factorial design, we find identical content receives markedly different classification values across systems, with variations especially pronounced for specific demographic groups. Analysis across 125 distinct groups reveals these divergences reflect systematic differences in how models establish decision boundaries around harmful content, highlighting significant implications for automated content moderation.
pdf
bib
abs
Label-semantics Aware Generative Approach for Domain-Agnostic Multilabel Classification
Subhendu Khatuya
|
Shashwat Naidu
|
Saptarshi Ghosh
|
Pawan Goyal
|
Niloy Ganguly
The explosion of textual data has made manual document classification increasingly challenging. To address this, we introduce a robust, efficient domain-agnostic generative model framework for multi-label text classification. Instead of treating labels as mere atomic symbols, our approach utilizes predefined label descriptions and is trained to generate these descriptions based on the input text. During inference, the generated descriptions are matched to the predefined labels using a finetuned sentence transformer. We integrate this with a dual-objective loss function, combining cross-entropy loss and cosine similarity of the generated sentences with the predefined target descriptions, ensuring both semantic alignment and accuracy. Our proposed model LAGAMC stands out for its parameter efficiency and versatility across diverse datasets, making it well-suited for practical applications. We demonstrate the effectiveness of our proposed model by achieving new state-of-the-art performances across all evaluated datasets, surpassing several strong baselines. We achieve improvements of 13.94 % in Micro-F1 and 24.85 % in Macro-F1 compared to the closest baseline across all datasets.
pdf
bib
abs
Unsupervised Morphological Tree Tokenizer
Qingyang Zhu
|
Xiang Hu
|
Pengyu Ji
|
Wei Wu
|
Kewei Tu
As a cornerstone in language modeling, tokenization involves segmenting text inputs into pre-defined atomic units. Conventional statistical tokenizers often disrupt constituent boundaries within words, thereby corrupting semantic information. To address this drawback, we introduce morphological structure guidance to tokenization and propose a deep model to induce character-level structures of words. Specifically, the deep model jointly encodes internal structures and representations of words with a mechanism named MorphOverriding to ensure the indecomposability of morphemes. By training the model with self-supervised objectives, our method is capable of inducing character-level structures that align with morphological rules without annotated training data. Based on the induced structures, our algorithm tokenizes words through vocabulary matching in a top-down manner. Empirical results indicate that the proposed method effectively retains complete morphemes and outperforms widely adopted methods such as BPE and WordPiece on both morphological segmentation tasks and language modeling tasks.
pdf
bib
abs
CausalLink: An Interactive Evaluation Framework for Causal Reasoning
Jinyue Feng
|
Frank Rudzicz
We present CausalLink, an innovative evaluation framework that interactively assesses thecausal reasoning skill to identify the correct intervention in conversational language models. Each CausalLink test case creates a hypothetical environment in which the language models are instructed to apply interventions to entities whose interactions follow predefined causal relations generated from controllable causal graphs. Our evaluation framework isolates causal capabilities from the confounding effects of world knowledge and semantic cues. We evaluate a series of LLMs in a scenario featuring movements of geometric shapes and discover that models start to exhibit reliable reasoning on two or three variables at the 14-billion-parameter scale. However, the performance of state-of-the-art models such as GPT4o degrades below random chance as the number of variables increases. We identify and analyze several key failure modes.
pdf
bib
abs
Toward Global AI Inclusivity: A Large-Scale Multilingual Terminology Dataset (GIST)
Jiarui Liu
|
Iman Ouzzani
|
Wenkai Li
|
Lechen Zhang
|
Tianyue Ou
|
Houda Bouamor
|
Zhijing Jin
|
Mona T. Diab
The field of machine translation has achieved significant advancements, yet domain-specific terminology translation, particularly in AI, remains challenging. This work introduces GIST, a large-scale multilingual AI terminology dataset containing 5K terms extracted from top AI conference papers spanning 2000 to 2023. The terms were translated into Arabic, Chinese, French, Japanese, and Russian using a hybrid framework that combines LLMs for extraction with human expertise for translation. The dataset’s quality was benchmarked against existing resources, demonstrating superior translation accuracy through crowdsourced evaluation. GIST was integrated into translation workflows using post-translation refinement methods that required no retraining, where LLM prompting consistently improved BLEU and COMET scores. A web demonstration on the ACL Anthology platform highlights its practical application, showcasing improved accessibility for non-English speakers. We address a critical gap in AI terminology resources and fosters global inclusivity and collaboration in AI research.
pdf
bib
abs
A Joint Optimization Framework for Enhancing Efficiency of Tool Utilization in LLM Agents
Bin Wu
|
Edgar Meij
|
Emine Yilmaz
Large Language Models (LLMs) augmented with external tools have demonstrated remarkable capabilities in complex problem solving. Existing efforts for tool utilization typically involve an LLM agent that contains instructions on using the description of the available tools to determine and call the tools required to solve the problem. Inference Scaling techniques, such as chain-of-thought and tree-of-thought reasoning, are commonly used but require significant computational overhead and rendering such methods impractical in real-world applications. In this work, we recognize and formalize the critical role of instructions provided in agent prompts and tool descriptions—collectively referred to as *context*—and show that incomplete *context* is one of the reasons for this computational overhead.To fill this efficiency gap, we propose an optimization framework that jointly refines both the instructions provided in the agent prompt and tool description, enhancing their interaction. Experiments on StableToolBench and RestBench demonstrate that our optimized agents achieve superior efficiency while maintaining effectiveness. Our findings underscore the critical role of context optimization in improving LLM agents for tool utilization, paving the way for more responsive and cost-effective LLM agents. Our code is available at [https://github.com/Bingo-W/ToolOptimization](https://github.com/Bingo-W/ToolOptimization).
pdf
bib
abs
When Claims Evolve: Evaluating and Enhancing the Robustness of Embedding Models Against Misinformation Edits
Jabez Magomere
|
Emanuele La Malfa
|
Manuel Tonneau
|
Ashkan Kazemi
|
Scott A. Hale
Online misinformation remains a critical challenge, and fact-checkers increasingly rely on claim matching systems that use sentence embedding models to retrieve relevant fact-checks. However, as users interact with claims online, they often introduce edits, and it remains unclear whether current embedding models used in retrieval are robust to such edits. To investigate this, we introduce a perturbation framework that generates valid and natural claim variations, enabling us to assess the robustness of a wide-range of sentence embedding models in a multi-stage retrieval pipeline and evaluate the effectiveness of various mitigation approaches. Our evaluation reveals that standard embedding models exhibit notable performance drops on edited claims, while LLM-distilled embedding models offer improved robustness at a higher computational cost. Although a strong reranker helps to reduce the performance drop, it cannot fully compensate for first-stage retrieval gaps. To address these retrieval gaps, we evaluate train- and inference-time mitigation approaches, demonstrating that they can improve in-domain robustness by up to 17 percentage points and boost out-of-domain generalization by 10 percentage points. Overall, our findings provide practical improvements to claim-matching systems, enabling more reliable fact-checking of evolving misinformation.
pdf
bib
abs
Splintering Nonconcatenative Languages for Better Tokenization
Bar Gazit
|
Shaltiel Shmidman
|
Avi Shmidman
|
Yuval Pinter
Common subword tokenization algorithms like BPE and UnigramLM assume that text can be split into meaningful units by concatenative measures alone. This is not true for languages such as Hebrew and Arabic, where morphology is encoded in root-template patterns, or Malay and Georgian, where split affixes are common. We present SPLINTER, a pre-processing step which rearranges text into a linear form that better represents such nonconcatenative morphologies, enabling meaningful contiguous segments to be found by the tokenizer. We demonstrate SPLINTER’s merit using both intrinsic measures evaluating token vocabularies in Hebrew, Arabic, and Malay; as well as on downstream tasks using BERT-architecture models trained for Hebrew.
pdf
bib
abs
Aria-UI: Visual Grounding for GUI Instructions
Yuhao Yang
|
Yue Wang
|
Dongxu Li
|
Ziyang Luo
|
Bei Chen
|
Chao Huang
|
Junnan Li
Digital agents for automating tasks across different platforms by directly manipulating the GUIs are increasingly important. For these agents, grounding from language instructions to target elements remains a significant challenge due to reliance on HTML or AXTree inputs. In this paper, we introduce Aria-UI, a large multimodal model specifically designed for GUI grounding. Aria-UI adopts a pure-vision approach, eschewing reliance on auxiliary inputs. To adapt to heterogeneous planning instructions, we propose a scalable data pipeline that synthesizes diverse and high-quality instruction samples for grounding. To handle dynamic contexts in task performing, Aria-UI incorporates textual and text-image interleaved action histories, enabling robust context-aware reasoning for grounding. Aria-UI sets new state-of-the-art results across offline and online agent benchmarks, outperforming both vision-only and AXTree-reliant baselines. We release all training data and model checkpoints to foster further research.
pdf
bib
abs
Revealing Hidden Mechanisms of Cross-Country Content Moderation with Natural Language Processing
Neemesh Yadav
|
Jiarui Liu
|
Francesco Ortu
|
Roya Ensafi
|
Zhijing Jin
|
Rada Mihalcea
The ability of Natural Language Processing (NLP) methods to categorize text into multiple classes has motivated their use in online content moderation tasks, such as hate speech and fake news detection. However, there is limited understanding of how or why these methods make such decisions, or why certain content is moderated in the first place. To investigate the hidden mechanisms behind content moderation, we explore multiple directions: 1) training classifiers to reverse-engineer content moderation decisions across countries; 2) explaining content moderation decisions by analyzing Shapley values and LLM-guided explanations. Our primary focus is on content moderation decisions made across countries, using pre-existing corpora sampled from the Twitter Stream Grab. Our experiments reveal interesting patterns in censored posts, both across countries and over time. Through human evaluations of LLM-generated explanations across three LLMs, we assess the effectiveness of using LLMs in content moderation. Finally, we discuss potential future directions, as well as the limitations and ethical considerations of this work.
pdf
bib
abs
Unilogit: Robust Machine Unlearning for LLMs Using Uniform-Target Self-Distillation
Stefan Vasilev
|
Christian Herold
|
Baohao Liao
|
Seyyed Hadi Hashemi
|
Shahram Khadivi
|
Christof Monz
This paper introduces Unilogit, a novel self-distillation method for machine unlearning in Large Language Models. Unilogit addresses the challenge of selectively forgetting specific information while maintaining overall model utility, a critical task in compliance with data privacy regulations like GDPR. Unlike prior methods that rely on static hyperparameters or starting model outputs, Unilogit dynamically adjusts target logits to achieve a uniform probability for the target token, leveraging the current model’s outputs for more accurate self-distillation targets. This approach not only eliminates the need for additional hyperparameters but also enhances the model’s ability to approximate the golden targets. Extensive experiments on public benchmarks and an in-house e-commerce dataset demonstrate Unilogit’s superior performance in balancing forget and retain objectives, outperforming state-of-the-art methods such as NPO and UnDIAL. Our analysis further reveals Unilogit’s robustness across various scenarios, highlighting its practical applicability and effectiveness in achieving efficacious machine unlearning.
pdf
bib
abs
Creating a Lens of Chinese Culture: A Multimodal Dataset for Chinese Pun Rebus Art Understanding
Tuo Zhang
|
Tiantian Feng
|
Yibin Ni
|
Mengqin Cao
|
Ruying Liu
|
Kiana Avestimehr
|
Katharine Butler
|
Yanjun Weng
|
Mi Zhang
|
Shrikanth Narayanan
|
Salman Avestimehr
Large vision-language models (VLMs) have demonstrated remarkable abilities in understanding everyday content. However, their performance in the domain of art, particularly culturally rich art forms, remains less explored. As a pearl of human wisdom and creativity, art encapsulates complex cultural narratives and symbolism. In this paper, we offer the Pun Rebus Art Dataset, a multimodal dataset for art understanding deeply rooted in traditional Chinese culture. We focus on three primary tasks: identifying salient visual elements, matching elements with their symbolic meanings, and explanations for the conveyed messages. Our evaluation reveals that state-of-the-art VLMs struggle with these tasks, often providing biased and hallucinated explanations and showing limited improvement through in-context learning. By releasing the Pun Rebus Art Dataset, we aim to facilitate the development of VLMs that can better understand and interpret culturally specific content, promoting greater inclusiveness beyond English-based corpora. The dataset and evaluation code are available at [this link](https://github.com/zhang-tuo-pdf/Pun-Rebus-Art-Benchmark).
pdf
bib
abs
FastDraft: How to Train Your Draft
Ofir Zafrir
|
Igor Margulis
|
Dorin Shteyman
|
Shira Guskin
|
Guy Boudoukh
Speculative Decoding has gained popularity as an effective technique for accelerating the auto-regressive inference process of Large Language Models. However, Speculative Decoding entirely relies on the availability of efficient draft models, which are often lacking for many existing language models due to a stringent constraint of vocabulary compatibility. In this work we introduce FastDraft, a novel and efficient approach for pre-training and aligning a draft model to any large language model by incorporating efficient pre-training, followed by fine-tuning over synthetic datasets generated by the target model. We demonstrate FastDraft by training two highly parameter efficient drafts for the popular Phi-3-mini and Llama-3.1-8B models. Using FastDraft, we were able to produce a draft model with approximately 10 billion tokens on a single server with 8 Intel Gaudi 2 accelerators in under 24 hours. Our results show that the draft model achieves impressive results in key metrics of acceptance rate, block efficiency and up to 3x memory bound speed up when evaluated on code completion and up to 2x in summarization, text completion and instruction tasks. We validate our theoretical findings through benchmarking on the latest Intel Core Ultra, achieving a wall-clock time speedup of up to 2x, indicating a significant reduction in runtime. Due to its high quality, FastDraft unlocks large language models inference on AI-PC and other edge-devices.
pdf
bib
abs
SignMusketeers: An Efficient Multi-Stream Approach for Sign Language Translation at Scale
Shester Gueuwou
|
Xiaodan Du
|
Greg Shakhnarovich
|
Karen Livescu
A persistent challenge in sign language video processing, including the task of sign language to written language translation, is how we train efficient model given the nature of videos. Informed by the nature and linguistics of signed languages, our proposed method focuses on just the most relevant parts in a signing video: the face, hands and body posture of the signer. However, instead of using pose estimation coordinates from off-the-shelf pose tracking models, which have inconsistent performance for hands and faces, we propose to learn the complex handshapes and rich facial expressions of sign languages in a self-supervised fashion. Our approach is based on learning from individual frames (rather than video sequences) and is therefore much more efficient than prior work on sign language pre-training. Compared to a recent model trained on publicly avaiable data that established a new state of the art in sign language translation on the How2Sign dataset, our approach yields similar translation performance, using less than 3% of the compute.
pdf
bib
abs
GUI Agents: A Survey
Dang Nguyen
|
Jian Chen
|
Yu Wang
|
Gang Wu
|
Namyong Park
|
Zhengmian Hu
|
Hanjia Lyu
|
Junda Wu
|
Ryan Aponte
|
Yu Xia
|
Xintong Li
|
Jing Shi
|
Hongjie Chen
|
Viet Dac Lai
|
Zhouhang Xie
|
Sungchul Kim
|
Ruiyi Zhang
|
Tong Yu
|
Mehrab Tanjim
|
Nesreen K. Ahmed
|
Puneet Mathur
|
Seunghyun Yoon
|
Lina Yao
|
Branislav Kveton
|
Jihyung Kil
|
Thien Huu Nguyen
|
Trung Bui
|
Tianyi Zhou
|
Ryan A. Rossi
|
Franck Dernoncourt
Graphical User Interface (GUI) agents, powered by Large Foundation Models, have emerged as a transformative approach to automating human-computer interaction. These agents autonomously interact with digital systems via GUIs, emulating human actions such as clicking, typing, and navigating visual elements across diverse platforms. Motivated by the growing interest and fundamental importance of GUI agents, we provide a comprehensive survey that categorizes their benchmarks, evaluation metrics, architectures, and training methods. We propose a unified framework that delineates their perception, reasoning, planning, and acting capabilities. Furthermore, we identify important open challenges and discuss key future directions. Finally, this work serves as a basis for practitioners and researchers to gain an intuitive understanding of current progress, techniques, benchmarks, and critical open problems that remain to be addressed.
pdf
bib
abs
MEDEC: A Benchmark for Medical Error Detection and Correction in Clinical Notes
Asma Ben Abacha
|
Wen-wai Yim
|
Yujuan Fu
|
Zhaoyi Sun
|
Meliha Yetisgen
|
Fei Xia
|
Thomas Lin
Several studies have shown that Large Language Models (LLMs) can answer medical questions correctly, even outperforming the average human score in some medical exams. However, to our knowledge, no study has been conducted to assess the ability of language models to validate existing or generated medical text for correctness and consistency. In this paper, we introduce MEDEC (https://github.com/abachaa/MEDEC), the first publicly available benchmark for medical error detection and correction in clinical notes, covering five types of errors (Diagnosis, Management, Treatment, Pharmacotherapy, and Causal Organism). MEDEC consists of 3,848 clinical texts, including 488 clinical notes from three US hospital systems that were not previously seen by any LLM. The dataset has been used in the MEDIQA-CORR 2024 shared task to evaluate seventeen participating systems. In this paper, we describe the data creation methods and we evaluate recent LLMs (e.g., o1-preview, GPT-4, Claude 3.5 Sonnet, Gemini 2.0 Flash, and DeepSeek-R1) for the tasks of detecting and correcting medical errors requiring both medical knowledge and reasoning capabilities. We also conducted a comparative study where two medical doctors performed the same task on the MEDEC test set. The results showed that MEDEC is a sufficiently challenging benchmark to assess the ability of models to validate existing or generated notes and to correct medical errors. We also found that although recent LLMs have a good performance in error detection and correction, they are still outperformed by medical doctors in these tasks. We discuss the potential factors behind this gap, the insights from our experiments, the limitations of current evaluation metrics, and share potential pointers for future research.
pdf
bib
abs
Understanding the Influence of Synthetic Data for Text Embedders
Jacob Mitchell Springer
|
Vaibhav Adlakha
|
Siva Reddy
|
Aditi Raghunathan
|
Marius Mosbach
Recent progress in developing general purpose text embedders has been driven by training on ever-growing corpora of synthetic LLM-generated data. Nonetheless, no publicly available synthetic dataset exists, posing a barrier to studying its role for generalization. To address this issue, we first reproduce and publicly release the synthetic data proposed by Wang et al. (2024) (Mistral-E5). Our synthetic data is high quality and leads to consistent improvements in performance. Next, we critically examine where exactly synthetic data improves model generalization. Our analysis reveals that benefits from synthetic data are sparse and highly localized to individual datasets. Moreover, we observe trade-offs between the performance on different categories and data that benefits one task, degrades performance on another. Our findings highlight the limitations of current synthetic data approaches for building general-purpose embedders and challenge the notion that training on synthetic data leads to more robust embedding models across tasks.
pdf
bib
abs
Dynamic Knowledge Integration for Evidence-Driven Counter-Argument Generation with Large Language Models
Anar Yeginbergen
|
Maite Oronoz
|
Rodrigo Agerri
This paper investigates the role of dynamic external knowledge integration in improving counter-argument generation using Large Language Models (LLMs). While LLMs have shown promise in argumentative tasks, their tendency to generate lengthy, potentially non-factual responses highlights the need for more controlled and evidence-based approaches. We introduce a reconstructed and manually curated dataset of argument and counter-argument pairs specifically designed to balance argumentative complexity with evaluative feasibility. We also propose a new LLM-as-a-Judge evaluation methodology that shows a stronger correlation with human judgments compared to traditional reference-based metrics. Our experimental results demonstrate that integrating dynamic external knowledge from the web significantly improves the quality of generated counter-arguments, particularly in terms of relatedness, persuasiveness, and factuality. The findings suggest that combining LLMs with real-time external knowledge retrieval offers a promising direction for developing more effective and reliable counter-argumentation systems. Data and code are publicly available: https://github.com/anaryegen/ counter-argument-generation
pdf
bib
abs
Tell, Don’t Show: Leveraging Language Models’ Abstractive Retellings to Model Literary Themes
Li Lucy
|
Camilla Griffiths
|
Sarah Levine
|
Jennifer L Eberhardt
|
Dorottya Demszky
|
David Bamman
Conventional bag-of-words approaches for topic modeling, like latent Dirichlet allocation (LDA), struggle with literary text. Literature challenges lexical methods because narrative language focuses on immersive sensory details instead of abstractive description or exposition: writers are advised to *show, don’t tell*. We propose Retell, a simple, accessible topic modeling approach for literature. Here, we prompt resource-efficient, generative language models (LMs) to *tell* what passages *show*, thereby translating narratives’ surface forms into higher-level concepts and themes. By running LDA on LMs’ retellings of passages, we can obtain more precise and informative topics than by running LDA alone or by directly asking LMs to list topics. To investigate the potential of our method for cultural analytics, we compare our method’s outputs to expert-guided annotations in a case study on racial/cultural identity in high school English language arts books.
pdf
bib
abs
BottleHumor: Self-Informed Humor Explanation using the Information Bottleneck Principle
EunJeong Hwang
|
Peter West
|
Vered Shwartz
Humor is prevalent in online communications and it often relies on more than one modality (e.g., cartoons and memes).Interpreting humor in multimodal settings requires drawing on diverse types of knowledge, including metaphorical, sociocultural, and commonsense knowledge. However, identifying the most useful knowledge remains an open question. We introduce BottleHumor, a method inspired by the information bottleneck principle that elicits relevant world knowledge from vision and language models which is iteratively refined for generating an explanation of the humor in an unsupervised manner. Our experiments on three datasets confirm the advantage of our method over a range of baselines. Our method can further be adapted in the future for additional tasks that can benefit from eliciting and conditioning on relevant world knowledge and open new research avenues in this direction.
pdf
bib
abs
Financial Language Model Evaluation (FLaME)
Glenn Matlin
|
Mika Okamoto
|
Huzaifa Pardawala
|
Yang Yang
|
Sudheer Chava
Language Models (LMs) have demonstrated impressive capabilities with core Natural Language Processing (NLP) tasks. The effectiveness of LMs for highly specialized knowledge-intensive tasks in finance remains difficult to assess due to major gaps in the methodologies of existing evaluation frameworks, which have caused an erroneous belief in a far lower bound of LMs’ performance on common Finance NLP (FinNLP) tasks. To demonstrate the potential of LMs for these FinNLP tasks, we present the first holistic benchmarking suite for Financial Language Model Evaluation (FLaME). We are the first research paper to comprehensively study LMs against ‘reasoning-reinforced’ LMs, with an empirical study of 23 foundation LMs over 20 core NLP tasks in finance. We open-source our framework software along with all data and results.
pdf
bib
abs
CausalRAG: Integrating Causal Graphs into Retrieval-Augmented Generation
Nengbo Wang
|
Xiaotian Han
|
Jagdip Singh
|
Jing Ma
|
Vipin Chaudhary
Large language models (LLMs) have revolutionized natural language processing (NLP), particularly through Retrieval-Augmented Generation (RAG), which enhances LLM capabilities by integrating external knowledge. However, traditional RAG systems face critical limitations, including disrupted contextual integrity due to text chunking, and over-reliance on semantic similarity for retrieval. To address these issues, we propose CausalRAG, a novel framework that incorporates causal graphs into the retrieval process. By constructing and tracing causal relationships, CausalRAG preserves contextual continuity and improves retrieval precision, leading to more accurate and interpretable responses. We evaluate CausalRAG against regular RAG and graph-based RAG approaches, demonstrating its superiority across multiple metrics. Our findings suggest that grounding retrieval in causal reasoning provides a promising approach to knowledge-intensive tasks.
pdf
bib
abs
Towards Safety Reasoning in LLMs: AI-agentic Deliberation for Policy-embedded CoT Data Creation
Tharindu Kumarage
|
Ninareh Mehrabi
|
Anil Ramakrishna
|
Xinyan Zhao
|
Richard Zemel
|
Kai-Wei Chang
|
Aram Galstyan
|
Rahul Gupta
|
Charith Peris
Safety reasoning is a recent paradigm where LLMs reason over safety policies before generating responses, thereby mitigating limitations in existing safety measures such as over-refusal and jailbreak vulnerabilities. However, implementing this paradigm is challenging due to the resource-intensive process of creating high-quality policy-embedded chain-of-thought (CoT) datasets while ensuring reasoning remains accurate and free from hallucinations or policy conflicts. To tackle this, we propose AIDSAFE: Agentic Iterative Deliberation for Safety Reasoning, a novel data generation recipe that leverages multi-agent deliberation to iteratively expand reasoning on safety policies. A data refiner stage in AIDSAFE ensures high-quality outputs by eliminating repetitive, redundant, and deceptive thoughts. AIDSAFE-generated CoTs provide a strong foundation for supervised fine-tuning (SFT)-based safety training. Additionally, to address the need of preference data in alignment stages, such as DPO training, we introduce a supplemental recipe that uses belief augmentation to create distinct selected and rejected CoT samples. Our evaluations demonstrate that AIDSAFE-generated CoTs achieve superior policy adherence and reasoning quality. Consequently, we show that fine-tuning open-source LLMs on these CoTs can significantly improve safety generalization and jailbreak robustness while maintaining acceptable utility and over-refusal accuracy.
pdf
bib
abs
Explain then Rank: Scale Calibration of Neural Rankers Using Natural Language Explanations from LLMs
Puxuan Yu
|
Daniel Cohen
|
Hemank Lamba
|
Joel R. Tetreault
|
Alejandro Jaimes
In search settings, calibrating the scores during the ranking process to quantities such as click-through rates or relevance levels enhances a system’s usefulness and trustworthiness for downstream users. While previous research has improved this notion of calibration for low complexity learning-to-rank models, the larger data demands and parameter count specific to modern neural text rankers produce unique obstacles that hamper the efficacy of methods intended for the learning-to-rank setting.This paper proposes exploiting large language models (LLMs) to provide relevance and uncertainty signals for these neural text rankers to produce scale-calibrated scores through Monte Carlo sampling of natural language explanations (NLEs). Our approach transforms the neural ranking task from ranking textual query-document pairs to ranking corresponding synthesized NLEs. Comprehensive experiments on two popular document ranking datasets show that the NLE-based calibration approach consistently outperforms past calibration methods and LLM-based methods for ranking, calibration, and query performance prediction tasks.
pdf
bib
abs
Beyond instruction-conditioning, MoTE: Mixture of Task Experts for Multi-task Embedding Models
Miguel Romero Calvo
|
Shuoyang Ding
|
Corey D Barrett
|
Georgiana Dinu
|
George Karypis
Dense embeddings are fundamental to modern machine learning systems, powering Retrieval-Augmented Generation (RAG), information retrieval, and representation learning. While instruction-conditioning has become the dominant approach for embedding specialization, its direct application to low-capacity models imposes fundamental representational constraints that limit the performance gains derived from specialization. In this paper, we analyze these limitations and introduce the Mixture of Task Experts (MoTE) transformer block, which leverages task-specialized parameters trained with Task-Aware Contrastive Learning () to enhance the model’s ability to generate specialized embeddings. Empirical results show that MoTE achieves 64% higher performance gains in retrieval datasets (+3.27→ +5.21) and 43% higher performance gains across all datasets (+1.81→ 2.60). Critically, these gains are achieved without altering instructions, training data, inference time, or number of active parameters.
pdf
bib
abs
Metagent-P: A Neuro-Symbolic Planning Agent with Metacognition for Open Worlds
YanfangZhou YanfangZhou
|
Yuntao Liu
|
Xiaodong Li
|
Yongqiang Zhao
|
Xintong Wang
|
Jinlong Tian
|
Zhenyu Li
|
Xinhai Xu
The challenge of developing agents capable of open-world planning remains fundamental to artificial general intelligence (AGI). While large language models (LLMs) have made progress with their vast world knowledge, their limitations in perception, memory, and reliable reasoning still hinder LLM-based agents from achieving human-level performance in long-term tasks. Drawing inspiration from human cognitive-metacognitive collaboration, we propose Metagent-P, integrating the world knowledge of LLMs, the symbolic reasoning capabilities of cognitive architectures, and the self-reflection characteristic of metacognition to construct a “planning-verification-execution-reflection” framework. Metagent-P improves experience utilization through multimodal memory integration. It uses a neural-symbolic hierarchical representation structure to ensure the plan’s reasoning correctness in advance. Finally, it actively adapts the agent to dynamic environments through monitoring, evaluation, and regulation mechanisms. Experimental results show Metagent-P significantly outperforms current state-of-the-art methods in Minecraft. In long-term tasks, Metagent-P reduces the average replanning counts by 34% and exceeds the average human success rate by 18.96%. Additionally, Metagent-P also demonstrates self-evolution through step-by-step open-world exploration.
pdf
bib
abs
Q-STRUM Debate: Query-Driven Contrastive Summarization for Recommendation Comparison
George-Kirollos Saad
|
Scott Sanner
Query-driven recommendation with unknown items poses a challenge for users to understand why certain items are appropriate for their needs. Query-driven Contrastive Summarization (QCS) is a methodology designed to address this issue by leveraging language-based item descriptions to clarify contrasts between them. However, existing state-of-the-art contrastive summarization methods such as STRUM-LLM fall short of this goal. To overcome these limitations, we introduce Q-STRUM Debate, a novel extension of STRUM-LLM that employs debate-style prompting to generate focused and contrastive summarizations of item aspects relevant to a query. Leveraging modern large language models (LLMs) as powerful tools for generating debates, Q-STRUM Debate provides enhanced contrastive summaries. Experiments across three datasets demonstrate that Q-STRUM Debate yields significant performance improvements over existing methods on key contrastive summarization criteria, thus introducing a novel and performant debate prompting methodology for QCS.
pdf
bib
abs
Inductive Linguistic Reasoning with Large Language Models
Raghav Ramji
|
Keshav Ramji
Evaluating large language models (LLMs) on their linguistic reasoning capabilities is an important task to understand the gaps in their skills that may surface during large-scale adoption. In this work, we investigate the abilities of such models to perform abstract multilingual reasoning through the lens of linguistic puzzles on extremely low-resource languages. As these translation tasks involve inductive and deductive reasoning from reference instances, we examine whether diverse auxiliary demonstrations can be automatically induced from seed exemplars, through analogical prompting. We employ a two-stage procedure, first generating analogical exemplars with a language model, and then applying them in-context along with provided target language exemplars. Our results on the modeLing dataset show that analogical prompting is effective in eliciting models’ knowledge of language grammar similarities, boosting the performance of GPT-4o by as much as 8.1% and Llama-3.1-405B-Instruct by 5.9% over chain-of-thought approaches. Furthermore, we demonstrate that our method generalizes to other tasks present in Linguistics Olympiad competitions, achieving state-of-the-art results across nearly all problem types and difficulty levels in the LINGOLY dataset.
pdf
bib
abs
Evaluating LLMs’ Mathematical and Coding Competency through Ontology-guided Interventions
Pengfei Hong
|
Navonil Majumder
|
Deepanway Ghosal
|
Somak Aditya
|
Rada Mihalcea
|
Soujanya Poria
Recent advancements in Large Language Models (LLMs) have showcased striking results on existing logical reasoning benchmarks, with some models even surpassing human performance. However, the true depth of their competencies and robustness in reasoning tasks remains an open question. To this end, in this paper, we focus on two popular reasoning tasks: arithmetic reasoning and code generation. Particularly, we introduce (i) a general ontology of perturbations for math and coding questions, (ii) a semi-automatic method to apply these perturbations, and (iii) two datasets, GSMore and HumanEval-Core, respectively, of perturbed math and coding problems to probe LLM capabilities in numeric reasoning and coding tasks.Through comprehensive evaluations of both closed-source and open-source LLMs, we show a significant performance drop across all the models against the perturbed questions, suggesting that the current LLMs lack robust problem solving skills and structured reasoning abilities in many areas, as defined by our ontology.
pdf
bib
abs
Exploiting Phonetics and Glyph Representation at Radical-level for Classical Chinese Understanding
Junyi Xiang
|
Maofu Liu
The diachronic gap between classical and modern Chinese arises from century-scale language evolution through cumulative changes in phonological, syntactic, and lexical systems, resulting in substantial semantic variation that poses significant challenges for the computational modeling of historical texts. Current methods always enhance classical Chinese understanding of pre-trained language models through corpus pre-training or semantic integration. However, they overlook the synergistic relationship between phonetic and glyph features within Chinese characters, which is a critical factor in deciphering characters’ semantics. In this paper, we propose RPGCM, a radical-level phonetics and glyph representation enhanced Chinese model, with powerful fine-grained semantic modeling capabilities. Our model establishes robust contextualized representations through: (1) rules-based radical decomposition and bype pair encoder (BPE) based radical aggregated for structural pattern recognition, (2) phonetic-glyph semantic mapping, and (3) dynamic semantic fusion. Experimental results on CCMRC, WYWEB, and C³Bench benchmarks demonstrate the RPGCM’s superiority and validate that explicit radical-level modeling effectively mitigates semantic variations.
pdf
bib
abs
Tokens for Learning, Tokens for Unlearning: Mitigating Membership Inference Attacks in Large Language Models via Dual-Purpose Training
Toan Tran
|
Ruixuan Liu
|
Li Xiong
Large language models (LLMs) have become the backbone of modern natural language processing but pose privacy concerns about leaking sensitive training data. Membership inference attacks (MIAs), which aim to infer whether a sample is included in a model’s training dataset, can serve as a foundation for broader privacy threats. Existing defenses designed for traditional classification models do not account for the sequential nature of text data. As a result, they either require significant computational resources or fail to effectively mitigate privacy risks in LLMs. In this work, we propose DuoLearn, a lightweight yet effective empirical privacy defense for protecting training data of language models by leveraging token-specific characteristics. By analyzing token dynamics during training, we propose a token selection strategy that categorizes tokens into hard tokens for learning and memorized tokens for unlearning. Subsequently, our training-phase defense optimizes a novel dual-purpose token-level loss to achieve a Pareto-optimal balance between utility and privacy. Extensive experiments demonstrate that our approach not only provides strong protection against MIAs but also improves language modeling performance by around 10% across various LLM architectures and datasets compared to the baselines.
pdf
bib
abs
Verify with Caution: The Pitfalls of Relying on Imperfect Factuality Metrics
Ameya Godbole
|
Robin Jia
Improvements in large language models have led to increasing optimism that they can serve as reliable evaluators of natural language generation outputs. In this paper, we challenge this optimism in regards to factuality evaluation.We re-evaluate five state-of-the-art factuality metrics on a collection of 11 datasets for summarization, retrieval-augmented generation, and question answering.We find that these evaluators are inconsistent with each other and often misestimate the factual accuracy of NLG systems, both of which can lead to a variety of pitfalls.We further show that these metrics exhibit biases against highly paraphrased outputs and outputs that draw upon faraway parts of the source documents.We urge users of factuality metrics to proceed with caution and manually validate the reliability of these metrics in their domain of interest.
pdf
bib
abs
TabXEval: Why this is a Bad Table? An eXhaustive Rubric for Table Evaluation
Vihang Pancholi
|
Jainit Sushil Bafna
|
Tejas Anvekar
|
Manish Shrivastava
|
Vivek Gupta
Evaluating tables qualitatively and quantitatively poses a significant challenge, as standard metrics often overlook subtle structural and content-level discrepancies. To address this, we propose a rubric-based evaluation framework that integrates multi-level structural descriptors with fine-grained contextual signals, enabling more precise and consistent table comparison. Building on this, we introduce TabXEval, an eXhaustive and eXplainable two-phase evaluation framework. TabXEval first aligns reference and predicted tables structurally via TabAlign, then performs semantic and syntactic comparison using TabCompare, offering interpretable and granular feedback. We evaluate TabXEval on TabXBench, a diverse, multi-domain benchmark featuring realistic table perturbations and human annotations. A sensitivity-specificity analysis further demonstrates the robustness and explainability of TabXEval across varied table tasks. Code and data are available at https://corallab- asu.github.io/tabxeval/.
pdf
bib
abs
LADDER: Language-Driven Slice Discovery and Error Rectification in Vision Classifiers
Shantanu Ghosh
|
Rayan Syed
|
Chenyu Wang
|
Vaibhav Choudhary
|
Binxu Li
|
Clare B Poynton
|
Shyam Visweswaran
|
Kayhan Batmanghelich
Slice discovery refers to identifying systematic biases in the mistakes of pre-trained vision models. Current slice discovery methods in computer vision rely on converting input images into sets of attributes and then testing hypotheses about configurations of these pre-computed attributes associated with elevated error patterns. However, such methods face several limitations: 1) they are restricted by the predefined attribute bank; 2) they lack the
common sense reasoning and domain-specific knowledge often required for specialized fields radiology; 3) at best, they can only identify biases in image attributes while overlooking those introduced during preprocessing or data preparation. We hypothesize that bias-inducing variables leave traces in the form of language (logs), which can be captured as unstructured text. Thus, we introduce ladder, which leverages the reasoning capabilities and latent domain knowledge of Large Language Models (LLMs) to generate hypotheses about these mistakes. Specifically, we project the internal activations of a pre-trained model into text using a retrieval approach and prompt the LLM to propose potential bias hypotheses. To detect biases from preprocessing pipelines, we convert the preprocessing data into text and prompt the LLM. Finally, ladder generates pseudo-labels for each identified bias, thereby mitigating all biases without requiring expensive attribute annotations.Rigorous evaluations on 3 natural and 3 medical imaging datasets, 200+ classifiers, and 4 LLMs with varied architectures and pretraining strategies – demonstrate that ladder consistently outperforms current methods. Code is available:
https://github.com/batmanlab/Ladder.
pdf
bib
abs
GSQ-Tuning: Group-Shared Exponents Integer in Fully Quantized Training for LLMs On-Device Fine-tuning
Sifan Zhou
|
Shuo Wang
|
Zhihang Yuan
|
Mingjia Shi
|
Yuzhang Shang
|
Dawei Yang
Large Language Models (LLMs) fine-tuning technologies have achieved remarkable results. However, traditional LLM fine-tuning approaches face significant challenges: they require large Floating Point(FP) computation, raising privacy concerns when handling sensitive data, and are impractical for resource-constrained edge devices. While Parameter-Efficient Fine-Tuning (PEFT) techniques reduce trainable parameters, their reliance on floating-point arithmetic creates fundamental incompatibilities with edge hardware. In this work, we introduce a novel framework for on-device LLM fine-tuning that eliminates the need for floating-point operations in both inference and training, named GSQ-Tuning. At its core is the Group-Shared Exponents Integer format, which efficiently represents model parameters in integer format using shared exponents among parameter groups. When combined with LoRA-like adapters, this enables fully integer-based fine-tuning that is both memory and compute efficient. We demonstrate that our approach achieves accuracy comparable to FP16-based fine-tuning while significantly reducing memory usage ( 50%). Moreover, compared to FP8, at comparable performance levels, our method can reduce 5x power consumption and 11x chip area, making large-scale model adaptation feasible on edge devices.
pdf
bib
abs
Evaluation of LLMs in Medical Text Summarization: The Role of Vocabulary Adaptation in High OOV Settings
Gunjan Balde
|
Soumyadeep Roy
|
Mainack Mondal
|
Niloy Ganguly
Large Language Models (LLMs) recently achieved great success in medical text summarization by simply using in-context learning. However, these recent efforts do not perform fine-grained evaluations under difficult settings where LLMs might fail. They typically report performance scores over the entire dataset. Through our benchmarking study, we show that LLMs show a significant performance drop for data points with high concentration of out-of-vocabulary (OOV) words or with high novelty. Vocabulary adaptation is an intuitive solution to this vocabulary mismatch issue where the LLM vocabulary gets updated with certain expert domain (here, medical) words or subwords. An interesting finding from our study is that Llama-3.1, even with a vocabulary size of around 128K tokens, still faces _over-fragmentation_ issue with medical words. To that end, we show vocabulary adaptation helps improve the LLM summarization performance even in difficult settings. Through extensive experimentation of multiple vocabulary adaptation strategies, two continual pretraining strategies, and three benchmark medical summarization datasets, we gain valuable insights into the role of vocabulary adaptation strategies for customizing LLMs to the medical domain. We also performed a human evaluation study with medical experts where they found that vocabulary adaptation results in more relevant and faithful summaries. Our codebase is made publicly available at https://github.com/gb-kgp/LLM-MedicalSummarization-Benchmark.
pdf
bib
abs
UniT: One Document, Many Revisions, Too Many Edit Intention Taxonomies
Fangping Lan
|
Abdullah Aljebreen
|
Eduard Dragut
Writing is inherently iterative, each revision enhancing information representation. One revision may contain many edits. Examination of the intentions behind edits provides valuable insights into an editor’s expertise, the dynamics of collaborative writing, and the evolution of a document. Current research on edit intentions lacks a comprehensive edit intention taxonomy (EIT) that spans multiple application domains. As a result, researchers often create new EITs tailored to specific needs, a process that is both time-consuming and costly. To address this gap, we propose UniT, a Unified edit intention Taxonomy that integrates existing EITs encompassing a wide range of edit intentions. We examine the lineage relationship and the construction of 24 EITs. They together have 232 categories across various domains. During the literature survey and integration process, we identify challenges such as one-to-many category matches, incomplete definitions, and varying hierarchical structures. We propose solutions for resolving these issues. Finally, our evaluation shows that our UniT achieves higher inter-annotator agreement scores compared to existing EITs and is applicable to a large set of application domains.
pdf
bib
abs
Predicting Depression in Screening Interviews from Interactive Multi-Theme Collaboration
Xianbing Zhao
|
Yiqing Lyu
|
Di Wang
|
Buzhou Tang
Automatic depression detection provides cues for early clinical intervention by clinicians. Clinical interviews for depression detection involve dialogues centered around multiple themes. Existing studies primarily design end-to-end neural network models to capture the hierarchical structure of clinical interview dialogues. However, these methods exhibit defects in modeling the thematic content of clinical interviews: 1) they fail to explicitly capture intra-theme and inter-theme correlation, and 2) they do not allow clinicians to intervene and focus on themes of interest. To address these issues, this paper introduces an interactive depression detection framework, namely Predicting Depression in Screening Interviews from Interactive Multi-Theme Collaboration (PDIMC). PDIMC leverages in-context learning techniques to identify themes in clinical interviews and then models both intra-theme and inter-theme correlation. Additionally, it employs AI-driven feedback to simulate the interests of clinicians, enabling interactive adjustment of theme importance. PDIMC achieves absolute improvements of 12% on Recall and 35% on F1-dep. metrics, compared to the previous state-of-the-art model on the depression detection dataset DAIC-WOZ, which demonstrates the effectiveness of capturing theme correlation and incorporating interactive external feedback.
pdf
bib
abs
Your Language Model May Think Too Rigidly: Achieving Reasoning Consistency with Symmetry-Enhanced Training
Yihang Yao
|
Zhepeng Cen
|
Miao Li
|
William Han
|
Yuyou Zhang
|
Emerson Liu
|
Zuxin Liu
|
Chuang Gan
|
Ding Zhao
Large Language Models (LLMs) have demonstrated strong reasoning capabilities across various tasks. However, even minor variations in query phrasing, despite preserving the underlying semantic meaning, can significantly affect their performance. To address this, we focus on enhancing LLMs’ awareness of symmetry in query variations and propose syMmetry-ENhanceD (MEND) data augmentation, a data-centric approach that improves the model’s ability to extract useful information from context. Unlike existing methods that emphasize reasoning chain augmentation, our approach improves model robustness at the knowledge extraction stage through query augmentation, enabling more data-efficient training and stronger generalization to Out-of-Distribution (OOD) settings. Extensive experiments on both logical and arithmetic reasoning tasks show that MEND enhances reasoning performance across diverse query variations, providing new insights into improving LLM robustness through structured dataset curation.
pdf
bib
abs
TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators
Jianling Li
|
ShangZhan Li
|
Zhenye Gao
|
Qi Shi
|
Yuxuan Li
|
Zefan Wang
|
Jiacheng Huang
|
WangHaojie WangHaojie
|
Jianrong Wang
|
Xu Han
|
Zhiyuan Liu
|
Maosong Sun
Triton, a high-level Python-like language designed for building efficient GPU kernels, is widely adopted in deep learning frameworks due to its portability, flexibility, and accessibility. However, programming and parallel optimization still require considerable trial and error from Triton developers. Despite advances in large language models (LLMs) for conventional code generation, these models struggle to generate accurate, performance-optimized Triton code, as they lack awareness of its specifications and the complexities of GPU programming. More critically, there is an urgent need for systematic evaluations tailored to Triton. In this work, we introduce TritonBench, the first comprehensive benchmark for Triton operator generation. TritonBench features two evaluation channels: a curated set of 184 real-world operators from GitHub and a collection of operators aligned with PyTorch interfaces. Unlike conventional code benchmarks prioritizing functional correctness, TritonBench also profiles efficiency performance on widely deployed GPUs aligned with industry applications. Our study reveals that current state-of-the-art code LLMs struggle to generate efficient Triton operators, highlighting a significant gap in high-performance code generation.
pdf
bib
abs
Just KIDDIN’ : Knowledge Infusion and Distillation for Detection of INdecent Memes
Rahul Garg
|
Trilok Padhi
|
Hemang Jain
|
Ugur Kursuncu
|
Ponnurangam Kumaraguru
Detecting toxicity in online multimodal environments, such as memes, remains a challenging task due to the complex contextual connections across modalities (e.g., text and visual), which demand both common-sense reasoning and contextual awareness. To bridge this gap, we propose a hybrid neurosymbolic framework that unifies (1) distillation of implicit contextual knowledge (e.g., sarcasm, cultural references) from Large Vision-Language Models (LVLMs) and (2) infusion of explicit relational semantics through sub-graphs from Knowledge Graphs (KGs). Experimental results on two benchmark datasets show the superior performance of our approach, Knowledge-Infused Distilled Vision-Language Model (KID-VLM), over the state-of-the-art baselines across AUC and F1, with improvements of 0.5%, and 10.6%, respectively, in HatefulMemes Benchmark across variants. Further, KID-VLM demonstrates better generalizability and achieves the best performance across all baselines in the HarMeme Dataset with a 6.3% and 3.2% in F1 and AUC.Given the contextual complexity of the toxicity detection, KID-VLM showcases the significance of learning compact models (~500M parameters) from both explicit (i.e., KG) and implicit (i.e., LVLMs) contextual cues incorporated through a hybrid neurosymbolic approach. Our codes and pretrained models are publicly available.
pdf
bib
abs
Dynamic Personality in LLM Agents: A Framework for Evolutionary Modeling and Behavioral Analysis in the Prisoner’s Dilemma
Weiqi Zeng
|
Bo Wang
|
Dongming Zhao
|
Zongfeng Qu
|
Ruifang He
|
Yuexian Hou
|
Qinghua Hu
Using Large Language Model agents to simulate human game behaviors offers valuable insights for human social psychology in anthropomorphic AI research. While current models rely on static personality traits, real-world evidence shows personality evolves through environmental feedback. Recent work introduced dynamic personality traits but lacked natural selection processes and direct psychological metrics, failing to accurately capture authentic dynamic personality variations. To address these limitations, we propose an enhanced framework within the Prisoner’s Dilemma, a socially significant scenario. By using game payoffs as environmental feedback, we drive adaptive personality evolution and analyze correlations between personality metrics and behavior. Our framework reveals new behavioral patterns of agents and evaluates personality-behavior relationships, advancing agent-based social simulations and human-AI symbiosis research.
pdf
bib
abs
Building A Proof-Oriented Programmer That Is 64% Better Than GPT-4o Under Data Scarcity
Dylan Zhang
|
Justin Wang
|
Tianran Sun
Existing LMs struggle with proof-oriented programming due to data scarcity, which manifest in two key ways: (1) a lack of sufficient corpora for proof-oriented programming languages such as F*, and (2) the absence of large-scale, project-level proof-oriented implementations that can teach the model the intricate reasoning process when performing proof-oriented programming. We present the first on synthetic data augmentation for project level proof oriented programming for both generation and repair. Our method addresses data scarcity by synthesizing basic proof-oriented programming problems for proficiency in that language; incorporating diverse coding data for reasoning capability elicitation and creating new proofs and repair data within existing repositories. This approach enables language models to both synthesize and repair proofs for function- and repository-level code. We show that our fine-tuned 14B parameter model, PoPilot, can exceed the performance of the models that outperforms GPT-4o in project-level proof-oriented programming by 64% relative margin, and can improve GPT-4o’s performance by 54% by repairing its outputs over GPT-4o’s self-repair.
pdf
bib
abs
On the Robust Approximation of ASR Metrics
Abdul Waheed
|
Hanin Atwany
|
Rita Singh
|
Bhiksha Raj
Recent advances in speech foundation models are largely driven by scaling both model size and data, enabling them to perform a wide range of tasks, including speech recognition. Traditionally, ASR models are evaluated using metrics like Word Error Rate (WER) and Character Error Rate (CER), which depend on ground truth labels. As a result of limited labeled data from diverse domains and testing conditions, the true generalization capabilities of these models beyond standard benchmarks remain unclear. Moreover, labeling data is both costly and time-consuming. To address this, we propose a novel label-free approach for approximating ASR performance metrics, eliminating the need for ground truth labels. Our method utilizes multimodal embeddings in a unified space for speech and transcription representations, combined with a high-quality proxy model to compute proxy metrics. These features are used to train a regression model to predict key ASR metrics like Word Error Rate (WER) and Character Error Rate (CER). We experiment with over 40 models across 14 datasets representing both standard and in-the-wild testing conditions. Our results show that we approximate the metrics within a single-digit absolute difference across all experimental configurations, outperforming the most recent baseline by more than 50%.
pdf
bib
abs
Are the Values of LLMs Structurally Aligned with Humans? A Causal Perspective
Yipeng Kang
|
Junqi Wang
|
Yexin Li
|
Mengmeng Wang
|
Wenming Tu
|
Quansen Wang
|
Hengli Li
|
Tingjun Wu
|
Xue Feng
|
Fangwei Zhong
|
Zilong Zheng
As large language models (LLMs) become increasingly integrated into critical applications, aligning their behavior with human values presents significant challenges. Current methods, such as Reinforcement Learning from Human Feedback (RLHF), typically focus on a limited set of coarse-grained values and are resource-intensive. Moreover, the correlations between these values remain implicit, leading to unclear explanations for value-steering outcomes. Our work argues that a latent causal value graph underlies the value dimensions of LLMs and that, despite alignment training, this structure remains significantly different from human value systems. We leverage these causal value graphs to guide two lightweight value-steering methods: role-based prompting and sparse autoencoder (SAE) steering, effectively mitigating unexpected side effects. Furthermore, SAE provides a more fine-grained approach to value steering. Experiments on Gemma-2B-IT and Llama3-8B-IT demonstrate the effectiveness and controllability of our methods.
pdf
bib
abs
LLMs Can Also Do Well! Breaking Barriers in Semantic Role Labeling via Large Language Models
Xinxin Li
|
Huiyao Chen
|
Chengjun Liu
|
Jing Li
|
Meishan Zhang
|
Jun Yu
|
Min Zhang
Semantic role labeling (SRL) is a crucial task of natural language processing (NLP). Although generative decoder-based large language models (LLMs) have achieved remarkable success across various NLP tasks, they still lag behind state-of-the-art encoder-decoder (BERT-like) models in SRL. In this work, we seek to bridge this gap by equipping LLMs for SRL with two mechanisms: (a) retrieval-augmented generation and (b) self-correction. The first mechanism enables LLMs to leverage external linguistic knowledge such as predicate and argument structure descriptions, while the second allows LLMs to identify and correct inconsistent SRL outputs. We conduct extensive experiments on three widely-used benchmarks of SRL (CPB1.0, CoNLL-2009, and CoNLL-2012). Results demonstrate that our method achieves state-of-the-art performance in both Chinese and English, marking the first successful application of LLMs to surpass encoder-decoder approaches in SRL.
pdf
bib
abs
Lost in Transcription, Found in Distribution Shift: Demystifying Hallucination in Speech Foundation Models
Hanin Atwany
|
Abdul Waheed
|
Rita Singh
|
Monojit Choudhury
|
Bhiksha Raj
Speech foundation models trained at a massive scale, both in terms of model and data size, result in robust systems capable of performing multiple speech tasks, including automatic speech recognition (ASR). These models transcend language and domain barriers, yet effectively measuring their performance remains a challenge. Traditional metrics like word error rate (WER) and character error rate (CER) are commonly used to evaluate ASR performance but often fail to reflect transcription quality in critical contexts, particularly when detecting fabricated outputs. This phenomenon, known as hallucination, is especially concerning in high-stakes domains such as healthcare, legal, and aviation, where errors can have severe consequences. In our work, we address this gap by investigating hallucination in ASR models. We examine how factors such as distribution shifts, model size, and model architecture influence the hallucination error rate (HER), a metric we introduce to quantify hallucinations. Our analysis of over 20 ASR models reveals key insights: (1) High WERs can mask low hallucination rates, while low WERs may conceal dangerous hallucinations. (2) Synthetic noise, both adversarial and common perturbations like white noise, pitch shift, and time stretching, increase HER. (3) Distribution shift correlates strongly with HER (𝛼 = 0.91). Our findings highlight the importance of incorporating HER alongside traditional metrics like WER to better assess ASR model performance, particularly in high-stakes domains.
pdf
bib
abs
M2PA: A Multi-Memory Planning Agent for Open Worlds Inspired by Cognitive Theory
YanfangZhou YanfangZhou
|
Xiaodong Li
|
Yuntao Liu
|
Yongqiang Zhao
|
Xintong Wang
|
Zhenyu Li
|
Jinlong Tian
|
Xinhai Xu
Open-world planning poses a significant challenge for general artificial intelligence due to environmental complexity and task diversity, especially in long-term tasks and lifelong learning. Inspired by cognitive theories, we propose M2PA, an open-world multi-memory planning agent. M2PA innovates by combining Large Language Models (LLMs) with human-like multi-memory systems, aiming to fully leverage the strengths of both while mitigating their respective limitations. By integrating the expansive world knowledge and language processing capabilities of LLMs with the perception and experience accumulation abilities of the human memory system, M2PA exhibits situation awareness, and experience generalization capabilities, as well as the potential for lifelong learning. In experiments, M2PA significantly outperforms current state-of-the-art agents across 50 Minecraft tasks in zero-shot learning. In exploratory lifelong learning experiments, M2PA demonstrates its continuous learning ability, achieving a 38.33% success rate in the “ObtainDiamond” task. Our findings provide a novel paradigm for constructing more effective agents in open-world environments.
pdf
bib
abs
AnnaAgent: Dynamic Evolution Agent System with Multi-Session Memory for Realistic Seeker Simulation
Ming Wang
|
Peidong Wang
|
Lin Wu
|
Xiaocui Yang
|
Daling Wang
|
Shi Feng
|
Yuxin Chen
|
Bixuan Wang
|
Yifei Zhang
Constrained by the cost and ethical concerns of involving real seekers in AI-driven mental health, researchers develop LLM-based conversational agents (CAs) with tailored configurations, such as profiles, symptoms, and scenarios, to simulate seekers. While these efforts advance AI in mental health, achieving more realistic seeker simulation remains hindered by two key challenges: dynamic evolution and multi-session memory. Seekers’ mental states often fluctuate during counseling, which typically spans multiple sessions. To address this, we propose **AnnaAgent**, an emotional and cognitive dynamic agent system equipped with tertiary memory. AnnaAgent incorporates an emotion modulator and a complaint elicitor trained on real counseling dialogues, enabling dynamic control of the simulator’s configurations. Additionally, its tertiary memory mechanism effectively integrates short-term and long-term memory across sessions. Evaluation results, both automated and manual, demonstrate that AnnaAgent achieves more realistic seeker simulation in psychological counseling compared to existing baselines. The ethically reviewed and screened code can be found on [https://github.com/sci-m-wang/AnnaAgent](https://github.com/sci-m-wang/AnnaAgent).
pdf
bib
abs
Diversification Catalyzes Language Models’ Instruction Generalization To Unseen Semantics
Dylan Zhang
|
Justin Wang
|
Francois Charton
Instruction-tuned language models excel in knowledge, reasoning, and instruction-following. While knowledge and reasoning are well-explored, the factors enabling generalization to unseen instructions remain underexplored due to challenges in isolating instruction-following dynamics.In this work, we model instruction-following as a computational process and design controlled experiments inspired by the Turing-complete Markov algorithm to disentangle its dynamics. Our findings reveal that the ability to generalize to instructions with unseen semantics emerges only when training data is strategically diversified across rich semantics. This finding gives us the hammer that breaks down the wall separating training instructions from unseen ones encountered in the wild. For specialist models, a balanced mix of in-domain and diverse out-of-domain tasks enhances performance more effectively than simply increasing in-domain data. For generalist models, domain diversification consistently outweighs the costs of reduced task-specific data, regardless of data budgets. Furthermore, we show that proper diversification with a lower data budget can outperform simply scaling up data volume. These findings highlight strategic data diversification as key to optimizing instruction-following and improving model performance across applications.
pdf
bib
abs
DecompileBench: A Comprehensive Benchmark for Evaluating Decompilers in Real-World Scenarios
Zeyu Gao
|
Yuxin Cui
|
Hao Wang
|
Siliang Qin
|
Yuanda Wang
|
Zhang Bolun
|
Chao Zhang
Decompilers are fundamental tools for critical security tasks, from vulnerability discovery to malware analysis, yet their evaluation remains fragmented. Existing approaches primarily focus on syntactic correctness through synthetic micro-benchmarks or subjective human ratings, failing to address real-world requirements for semantic fidelity and analyst usability. We present **DecompileBench**, the first comprehensive framework that enables effective evaluation of decompilers in reverse engineering workflows through three key components: real-world function extraction (comprising 23,400 functions from 130 real-world programs), runtime-aware validation, and automated human-centric assessment using LLM-as-Judge to quantify the effectiveness of decompilers in reverse engineering workflows. Through a systematic comparison between six industrial-strength decompilers and six recent LLM-powered approaches, we demonstrate that LLM-based methods surpass commercial tools in code understandability despite 52.2% lower functionality correctness. These findings highlight the potential of LLM-based approaches to transform human-centric reverse engineering. We open source **DecompileBench** to provide a framework to advance research on decompilers and assist security experts in making informed tool selections based on their specific requirements.
pdf
bib
abs
Thinking Before Running! Efficient Code Generation with Thorough Exploration and Optimal Refinement
Xiaoqing Zhang
|
Yuhan Liu
|
Flood Sung
|
Xiuying Chen
|
Shuo Shang
|
Rui Yan
Code generation is crucial in software engineering for automating the coding process efficiently. While test-time computation methods show promise, they suffer from high latency due to multiple computation rounds.To overcome this, we introduce ThinkCoder, a framework that combines thorough exploration with optimal refinement.The exploration phase diversifies the solution space by searching for potential solutions, followed by a refinement phase that enhances precision.This approach allows us to select the best solution through careful consideration before taking action, avoiding excessive trial and error.To further minimize test-time computation overhead, we introduce preference-driven optimization with Reinforced Self-Training (ReST), which uses exploration trajectories from ThinkCoder to guide LLM’s evolution.This approach enhances LLM’s exploration efficiency via preference learning, cutting costs while maintaining accuracy.ThinkCoder boosts the performance with a single LLM, excelling on benchmarks like HumanEval and MBPP. Compared to SOTA models, it improves Pass@1 by 3.0% over MapCoder with just 6.4% of the computation cost.Against AgentCoder, ThinkCoder achieves a 0.5% higher Pass@1 after 2 rounds, outperforming AgentCoder’s 5 rounds.Additionally, ReST with success trajectories enhances efficiency, allowing models like LLaMA2-7B to achieve competitive results using only 20% of the computational resources. These results highlight the framework’s effectiveness and scalability.
pdf
bib
abs
Edit Once, Update Everywhere: A Simple Framework for Cross-Lingual Knowledge Synchronization in LLMs
Yuchen Wu
|
Liang Ding
|
Li Shen
|
Dacheng Tao
Knowledge editing allows for efficient adaptation of large language models (LLMs) to new information or corrections without requiring full retraining. However, prior methods typically focus on either single-language editing or basic multilingual editing, failing to achieve true cross-linguistic knowledge synchronization. To address this, we present a simple and practical state-of-the-art (SOTA) recipe Cross-Lingual Knowledge Democracy Edit (X-KDE), designed to propagate knowledge from a dominant language to other languages effectively. Our X-KDE comprises two stages: (i) Cross-lingual Edition Instruction Tuning (XE-IT), which fine-tunes the model on a curated parallel dataset to modify in-scope knowledge while preserving unrelated information, and (ii) Target-language Preference Optimization (TL-PO), which applies advanced optimization techniques to ensure consistency across languages, fostering the transfer of updates. Additionally, we contribute a high-quality, cross-lingual dataset, specifically designed to enhance knowledge transfer across languages. Extensive experiments on the Bi-ZsRE and MzsRE benchmarks show that X-KDE significantly enhances cross-lingual performance, achieving an average improvement of +8.19%, while maintaining high accuracy in monolingual settings.
pdf
bib
abs
SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities
Fengqing Jiang
|
Zhangchen Xu
|
Yuetai Li
|
Luyao Niu
|
Zhen Xiang
|
Bo Li
|
Bill Yuchen Lin
|
Radha Poovendran
Emerging large reasoning models (LRMs), such as DeepSeek-R1 models, leverage long chain-of-thought (CoT) reasoning to generate structured intermediate steps, enhancing their reasoning capabilities. However, long CoT does not inherently guarantee safe outputs, potentially leading to harmful consequences such as the introduction of security vulnerabilities in code or the spread of misinformation. Current research on large language model (LLM) safety usually focuses on short-answer responses, overlooking the long CoT style outputs of LRMs. To bridge this gap, we conduct a systematic study of LRM safety. First, we investigate safety evaluators calibrated against human annotations. Using our newly developed metrics, we thoroughly assess the safety of 13 state-of-the-art LRMs on StrongReject and WildJailbreak datasets. Our results show that LRMs are not safe compared to their reasoning advance. Further, we perform a fine-grained analysis of the reasoning trace and final answer. We find that three decoding strategies-ZeroThink, LessThink, and MoreThink-can improve model safety without additional training. However, these strategies either use constrained reasoning traces or incur high inference costs. To better strengthen LRM safety, we introduce SafeChain, the first-of-its-kind safety training dataset in CoT style. We fine-tune two LRMs with SafeChain, showing that it not only enhances model safety but also preserves performance across 6 reasoning benchmarks.
pdf
bib
abs
ETRQA: A Comprehensive Benchmark for Evaluating Event Temporal Reasoning Abilities of Large Language Models
Sigang Luo
|
Yinan Liu
|
Dongying Lin
|
Yingying Zhai
|
Bin Wang
|
Xiaochun Yang
|
Junpeng Liu
Event temporal reasoning (ETR) aims to model and reason about the relationships between events and time, as well as between events in the real world. Proficiency in ETR is a significant indicator that a large language model (LLM) truly understands the physical world. Previous question-answering datasets available for evaluating the ETR ability lack a systematic taxonomy and pay limited attention to compound questions. In this paper, we propose a unified taxonomy for event temporal questions and construct a comprehensive benchmark ETRQA, to evaluate the ETR abilities of LLMs based on this taxonomy. ETRQA not only inherits and expands the evaluation content of existing datasets but also contains multiple categories of compound questions. We evaluate two leading LLM series, Llama and Qwen, on ETRQA across various settings. Our experimental results indicate that large-scale LLMs exhibit certain ETR abilities. Yet they do not perform well in solving specific types of reasoning tasks, including reasoning involving time spans, reasoning for compound questions, and reasoning with fine temporal granularity. Additionally, we hope ETRQA can benefit the temporal reasoning research community for future studies.
pdf
bib
abs
The Law of Knowledge Overshadowing: Towards Understanding, Predicting and Preventing LLM Hallucination
Yuji Zhang
|
Sha Li
|
Cheng Qian
|
Jiateng Liu
|
Pengfei Yu
|
Chi Han
|
Yi R. Fung
|
Kathleen McKeown
|
ChengXiang Zhai
|
Manling Li
|
Heng Ji
Hallucination is a persistent challenge in large language models (LLMs), where even with rigorous quality control, models often generate distorted facts. This paradox, in which error generation continues despite high-quality training data, calls for a deeper understanding of the underlying LLM mechanisms. To address it, we propose a novel concept: knowledge overshadowing, where model’s dominant knowledge can obscure less prominent knowledge during text generation, causing the model to fabricate inaccurate details. Building on this idea, we introduce a novel framework to quantify factual hallucinations by modeling knowledge overshadowing. Central to our approach is the log-linear law, which predicts that the rate of factual hallucination increases linearly with the logarithmic scale of (1) Knowledge Popularity, (2) Knowledge Length, and (3) Model Size. The law provides a means to preemptively quantify hallucinations, offering foresight into their occurrence even before model training or inference. Built on overshadowing effect, we propose a new decoding strategy CoDa, to mitigate hallucinations, which notably enhance model factuality on Overshadow (27.9%), MemoTrap (13.1%) and NQ-Swap (18.3%). Our findings not only deepen understandings of the underlying mechanisms behind hallucinations but also provide actionable insights for developing more predictable and controllable language models.
pdf
bib
abs
LegoMT2: Selective Asynchronous Sharded Data Parallel Training for Massive Neural Machine Translation
Fei Yuan
|
Yinquan Lu
|
Lei Li
|
Jingjing Xu
It is a critical challenge to learn a single model for massive languages. Prior methods focus on increasing the model size and training data size. However, large models are difficult to optimize efficiently even with distributed parallel training and translation capacity can interfere among languages. To address the challenge, we propose LegoMT2, an efficient training approach with an asymmetric multi-way model architecture for massive multilingual neural machine translation. LegoMT2 shards 435 languages into 8 language-centric groups and attributes one local encoder for each group’s languages and a mix encoder-decoder for all languages. LegoMT2 trains the model through local data parallel and asynchronous distributed updating of parameters. LegoMT2 is 16.2× faster than the distributed training method for M2M-100-12B (which only for 100 languages) while improving the translation performance by an average of 2.2 BLEU on Flores-101, especially performing better for low-resource languages .
pdf
bib
abs
Pruning General Large Language Models into Customized Expert Models
Yiran Zhao
|
Guizhen Chen
|
Kenji Kawaguchi
|
Lidong Bing
|
Wenxuan Zhang
Large Language Models (LLMs) have transformed natural language processing, yet their substantial model sizes often demand significant computational resources. To preserve computing resources and accelerate inference speed, it is crucial to prune redundant parameters, especially for experienced users who often need expert models tailored to specific downstream scenarios. However, current pruning methods primarily focus on maintaining models’ general capabilities, either requiring extensive post-training or performing poorly due to coarse-grained pruning. In this work, we design a ̲Custom ̲Pruning method (Cus-Prun) to prune a large general model into a smaller lightweight expert model, which is positioned along the “language”, “domain” and “task” dimensions. By identifying and pruning irrelevant neurons of each dimension, Cus-Prun creates expert models without any post-training. Our experiments demonstrate that Cus-Prun consistently outperforms other methods, achieving minimal loss in both expert and general capabilities across various models from different model families and sizes.
pdf
bib
abs
Enhance Multimodal Consistency and Coherence for Text-Image Plan Generation
Xiaoxin Lu
|
Ranran Haoran Zhang
|
Yusen Zhang
|
Rui Zhang
People get informed of a daily task plan through diverse media involving both texts and images. However, most prior research only focuses on LLM’s capability of textual plan generation. The potential of large-scale models in providing text-image plans remains understudied. Generating high-quality text-image plans faces two main challenges: ensuring consistent alignment between two modalities and keeping coherence among visual steps. To address these challenges, we propose a novel framework that generates and refines text-image plans step-by-step. At each iteration, our framework (1) drafts the next textual step based on the prediction history; (2) edits the last visual step to obtain the next one; (3) extracts PDDL-like visual information; and (4) refines the draft with the extracted visual information. The textual and visual step produced in stage (4) and (2) will then serve as inputs for the next iteration. Our approach offers a plug-and-play improvement to various backbone models, such as Mistral-7B, Gemini-1.5, and GPT-4o. To evaluate the effectiveness of our approach, we collect a new benchmark consisting of 1,100 tasks and their text-image pair solutions covering 11 daily topics. We also design and validate a new set of metrics to evaluate the multimodal consistency and coherence in text-image plans. Extensive experiment results show the effectiveness of our approach on a range of backbone models against competitive baselines.
pdf
bib
abs
Un-considering Contextual Information: Assessing LLMs’ Understanding of Indexical Elements
Metehan Oğuz
|
Yavuz Faruk Bakman
|
Duygu Nur Yaldiz
Large Language Models (LLMs) have demonstrated impressive performances in tasks related to coreference resolution. However, previous studies mostly assessed LLM performance on coreference resolution with nouns and third person pronouns. This study evaluates LLM performance on coreference resolution with indexical like I, you, here and tomorrow which come with unique challenges due to their linguistic properties. We present the first study examining how LLMs interpret indexicals in English, releasing the English Indexical Dataset with 1600 multiple-choice questions. We evaluate pioneering LLMs, including GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, and DeepSeek V3. Our results reveal that LLMs exhibit an impressive performance with some indexicals (I), while struggling with others (you, here, tomorrow), and that syntactic cues (e.g. quotation) contribute to LLM performance with some indexicals, while they reduce performance with others. Code and data are available at: https://github.com/metehanoguzz/LLMs-Indexicals-English
pdf
bib
abs
Behavioral Analysis of Information Salience in Large Language Models
Jan Trienes
|
Jörg Schlötterer
|
Junyi Jessy Li
|
Christin Seifert
Large Language Models (LLMs) excel at text summarization, a task that requires models to select content based on its importance. However, the exact notion of salience that LLMs have internalized remains unclear. To bridge this gap, we introduce an explainable framework to systematically derive and investigate information salience in LLMs through their summarization behavior. Using length-controlled summarization as a behavioral probe into the content selection process, and tracing the answerability of Questions Under Discussion throughout, we derive a proxy for how models prioritize information. Our experiments on 13 models across four datasets reveal that LLMs have a nuanced, hierarchical notion of salience, generally consistent across model families and sizes. While models show highly consistent behavior and hence salience patterns, this notion of salience cannot be accessed through introspection, and only weakly correlates with human perceptions of information salience.
pdf
bib
abs
The Behavior Gap: Evaluating Zero-shot LLM Agents in Complex Task-Oriented Dialogs
Avinash Baidya
|
Kamalika Das
|
Xiang Gao
Large Language Model (LLM)-based agents have significantly impacted Task-Oriented Dialog Systems (TODS) but continue to face notable performance challenges, especially in zero-shot scenarios. While prior work has noted this performance gap, the behavioral factors driving the performance gap remain under-explored. This study proposes a comprehensive evaluation framework to quantify the behavior gap between AI agents and human experts, focusing on discrepancies in dialog acts, tool usage, and knowledge utilization. Our findings reveal that this behavior gap is a critical factor negatively impacting the performance of LLM agents. Notably, as task complexity increases, the behavior gap widens (correlation: 0.963), leading to a degradation of agent performance on complex task-oriented dialogs. For the most complex task in our study, even the GPT-4o-based agent exhibits low alignment with human behavior, with low F1 scores for dialog acts (0.464), excessive and often misaligned tool usage with a F1 score of 0.139, and ineffective usage of external knowledge. Reducing such behavior gaps leads to significant performance improvement (24.3% on average). This study highlights the importance of comprehensive behavioral evaluations and improved alignment strategies to enhance the effectiveness of LLM-based TODS in handling complex tasks.
pdf
bib
abs
Task Facet Learning: A Structured Approach To Prompt Optimization
Gurusha Juneja
|
Gautam Jajoo
|
Hua Li
|
Jian Jiao
|
Nagarajan Natarajan
|
Amit Sharma
Given a task in the form of a basic description and its training examples, prompt optimization is the problem of synthesizing the given information into a text prompt for a large language model. Humans solve this problem by also considering the different facets that define a task (e.g., counter-examples, explanations, analogies) and including them in the prompt. However, it is unclear whether existing algorithmic approaches, based on iteratively editing a given prompt or automatically selecting a few in-context examples, can cover the multiple facets required to solve a complex task. In this work, we view prompt optimization as that of learning multiple facets of a task from a set of training examples. We exploit structure in the prompt optimization problem and break down a prompt into loosely coupled semantic sections. The proposed algorithm, UniPrompt, (1) clusters the input space and uses clustered batches so that each batch likely corresponds to a different facet of the task, and (2) utilizes a feedback mechanism to propose adding, editing or deleting a section, which in turn is aggregated over a batch to capture generalizable facets. Empirical evaluation on multiple datasets and a real-world task shows that prompts generated using UniPrompt obtain higher accuracy than human-tuned prompts and those from state-of-the-art methods. In particular, our algorithm can generate long, complex prompts that existing methods are unable to generate.
pdf
bib
abs
LLM as Effective Streaming Processor: Bridging Streaming-Batch Mismatches with Group Position Encoding
Junlong Tong
|
Jinlan Fu
|
Zixuan Lin
|
Yingqi Fan
|
Anhao Zhao
|
Hui Su
|
Xiaoyu Shen
Large Language Models (LLMs) are primarily designed for batch processing. Existing methods for adapting LLMs to streaming rely either on expensive re-encoding or specialized architectures with limited scalability. This work identifies three key mismatches in adapting batch-oriented LLMs to streaming: (1) input-attention, (2) output-attention, and (3) position-ID mismatches. While it is commonly assumed that the latter two mismatches require frequent re-encoding, our analysis reveals that only the input-attention mismatch significantly impacts performance, indicating re-encoding outputs is largely unnecessary. To better understand this discrepancy with the common assumption,we provide the first comprehensive analysis of the impact of position encoding on LLMs in streaming, showing that preserving relative positions within source and target contexts is more critical than maintaining absolute order. Motivated by the above analysis, we introduce a group position encoding paradigm built on batch architectures to enhance consistency between streaming and batch modes. Extensive experiments on cross-lingual and cross-modal tasks demonstrate that our method outperforms existing approaches. Our method requires no architectural modifications, exhibits strong generalization in both streaming and batch modes. The code is available at repository https://github.com/EIT-NLP/StreamingLLM.
pdf
bib
abs
YinYang-Align: A new Benchmark for Competing Objectives and Introducing Multi-Objective Preference based Text-to-Image Alignment
Amitava Das
|
Yaswanth Narsupalli
|
Gurpreet Singh
|
Vinija Jain
|
Vasu Sharma
|
Suranjana Trivedy
|
Aman Chadha
|
Amit Sheth
Precise alignment in Text-to-Image (T2I) systems is crucial for generating visuals that reflect user intent while adhering to ethical and policy standards. Recent controversies, such as the Google Gemini-generated Pope image backlash, highlight the urgent need for robust alignment mechanisms. Building on alignment successes in Large Language Models (LLMs), this paper introduces YinYangAlign, a benchmarking framework designed to evaluate and optimize T2I systems across six inherently contradictory objectives. These objectives highlight core trade-offs, such as balancing faithfulness to prompts with artistic freedom and maintaining cultural sensitivity without compromising creativity. Alongside this benchmark, we propose the Contradictory Alignment Optimization (CAO) framework, an extension of Direct Preference Optimization (DPO), which employs multi-objective optimization techniques to address these competing goals. By leveraging per-axiom loss functions, synergy-driven global preferences, and innovative tools like the Synergy Jacobian, CAO achieves superior alignment across all objectives. Experimental results demonstrate significant improvements in fidelity, diversity, and ethical adherence, setting new benchmarks for the field. This work provides a scalable, effective approach to resolving alignment challenges in T2I systems while offering insights into broader AI alignment paradigms.
pdf
bib
abs
FREE: Fast and Robust Vision Language Models with Early Exits
Divya Jyoti Bajpai
|
Manjesh Kumar Hanawal
In recent years, Vision-Language Models (VLMs) have shown remarkable performance improvements in Vision-Language tasks. However, their large size poses challenges for real-world applications where inference latency is a concern. To tackle this issue, we propose employing Early Exit (EE) strategies in VLMs. However, training exit classifiers in VLMs is challenging, particularly with limited labeled training data. To address this, we introduce FREE, an adversarial training approach within a GAN-based framework. Here, each exit consists of a transformer layer and a classifier. The transformer layer is adversarially trained to produce feature representations similar to the final layer, while a feature classifier serves as the discriminator. Our method focuses on performing input-adaptive inference that increases inference speed with minimal drop in performance. Experimental results demonstrate the effectiveness of our approach in enhancing accuracy and model robustness by mitigating overthinking and the phenomenon of mid-crisis that we highlight. We experimentally validate that our method speeds up the inference process by more than 1.51× while retaining comparable performance. The anonymized source code is available at https://github.com/Div290/BLIPEE.
pdf
bib
abs
REPRO-Bench: Can Agentic AI Systems Assess the Reproducibility of Social Science Research?
Chuxuan Hu
|
Liyun Zhang
|
Yeji Lim
|
Aum Wadhwani
|
Austin Peters
|
Daniel Kang
Assessing the reproducibility of social science papers is essential for promoting rigor in research processes, but manual assessment is costly. With recent advances in agentic AI systems (i.e., AI agents), we seek to evaluate their capability to automate this process. However, existing benchmarks for reproducing research papers (1) focus solely on reproducing results using provided code and data without assessing their consistency with the paper, (2) oversimplify real-world scenarios, and (3) lack necessary diversity in data formats and programming languages. To address these issues, we introduce REPRO-Bench, a collection of 112 task instances, each representing a social science paper with a publicly available reproduction report. The agents are tasked with assessing the reproducibility of the paper based on the original paper PDF and the corresponding reproduction package. REPRO-Bench features end-to-end evaluation tasks on the reproducibility of social science papers with complexity comparable to real-world assessments. We evaluate three representative AI agents on REPRO-Bench, with the best-performing agent achieving an accuracy of only 21.4%. Building on our empirical analysis, we develop REPRO-Agent, which improves the highest accuracy achieved by existing agents by 71%. We conclude that more advanced AI agents should be developed to automate real-world reproducibility assessment. REPRO-Bench is publicly available at https://github.com/uiuc-kang-lab/REPRO-Bench.
pdf
bib
abs
Time Travel: A Comprehensive Benchmark to Evaluate LMMs on Historical and Cultural Artifacts
Sara Ghaboura
|
Ketan Pravin More
|
Ritesh Thawkar
|
Wafa Al Ghallabi
|
Omkar Thawakar
|
Fahad Shahbaz Khan
|
Hisham Cholakkal
|
Salman Khan
|
Rao Muhammad Anwer
Understanding historical and cultural artifacts demands human expertise and advanced computational techniques, yet the process remains complex and time-intensive. While large multimodal models offer promising support, their evaluation and improvement require a standardized benchmark. To address this, we introduce TimeTravel, a benchmark of 10,250 expert-verified samples spanning 266 distinct cultures across 10 major historical regions. Designed for AI-driven analysis of manuscripts, artworks, inscriptions, and archaeological discoveries, TimeTravel provides a structured dataset and robust evaluation framework to assess AI models’ capabilities in classification, interpretation, and historical comprehension. By integrating AI with historical research, TimeTravel fosters AI-powered tools for historians, archaeologists, researchers, and cultural tourists to extract valuable insights while ensuring technology contributes meaningfully to historical discovery and cultural heritage preservation. We evaluate contemporary AI models on TimeTravel, highlighting their strengths and identifying areas for improvement. Our goal is to establish AI as a reliable partner in preserving cultural heritage, ensuring that technological advancements contribute meaningfully to historical discovery. We release the TimeTravel dataset and evaluation suite as open-source resources for culturally and historically informed research.
pdf
bib
abs
Unveiling and Addressing Pseudo Forgetting in Large Language Models
Huashan Sun
|
Yizhe Yang
|
Yinghao Li
|
Jiawei Li
|
Yang Gao
Although substantial efforts have been made to mitigate catastrophic forgetting in continual learning, the intrinsic mechanisms are not well understood. In this work, we demonstrate the existence of “pseudo forgetting”: the performance degradation in previous tasks is not attributed to a loss of capabilities, but rather to the failure of the instructions to activate the appropriate model capabilities. We show that the model’s performance on previous tasks can be restored through two simple interventions: (1) providing partial external correct rationale, and (2) appending semantically meaningless suffixes to the original instructions, to guide the generation of correct rationales. Through empirical analysis of the internal mechanisms governing rationale generation, we reveal that models exhibiting pseudo forgetting show reduced instruction dependence during rationale generation, leading to suboptimal activation of their inherent capabilities. Based on this insight, we propose Rationale-Guidance Difficulty based Replay (RGD-R) framework that dynamically allocates replay data based on the model’s ability to correctly leverage the intrinsic capabilities. Experimental results demonstrate that RGD-R effectively mitigates pseudo forgetting while maintaining model plasticity.
pdf
bib
abs
Improving MLLM’s Document Image Machine Translation via Synchronously Self-reviewing Its OCR Proficiency
Yupu Liang
|
Yaping Zhang
|
Zhiyang Zhang
|
Zhiyuan Chen
|
Yang Zhao
|
Lu Xiang
|
Chengqing Zong
|
Yu Zhou
Multimodal Large Language Models (MLLMs) have shown strong performance in document image tasks, especially Optical Character Recognition (OCR). However, they struggle with Document Image Machine Translation (DIMT), which requires handling both cross-modal and cross-lingual challenges. Previous efforts to enhance DIMT capability through Supervised Fine-Tuning (SFT) on the DIMT dataset often result in the forgetting of the model’s existing monolingual abilities, such as OCR. To address these challenges, we introduce a novel fine-tuning paradigm, named Synchronously Self-Reviewing (SSR) its OCR proficiency, inspired by the concept “Bilingual Cognitive Advantage”. Specifically, SSR prompts the model to generate OCR text before producing translation text, which allows the model to leverage its strong monolingual OCR ability while learning to translate text across languages. Comprehensive experiments demonstrate the proposed SSR learning helps mitigate catastrophic forgetting, improving the generalization ability of MLLMs on both OCR and DIMT tasks. The code will be released upon acceptance.
pdf
bib
abs
HG-InsightLog: Context Prioritization and Reduction for Question Answering with Non-Natural Language Construct Log Data
Supriya Bajpai
|
Athira Gopal
|
Chandrakant Harjpal
|
Niraj Kumar
Modern IT systems generate vast amounts of log data, which pose challenges for Large Language Models (LLMs) due to their large size, irrelevant entries, and non-Natural Language (non-NL) construct (e.g., domain-specific jargon, error codes, file paths, and abbreviations). Traditional methods like Retrieval-Augmented Generation (RAG) and GraphRAG fail to preserve temporal sequences, handle non-NL for context and entities extraction, and dynamically prioritize query-relevant context. To address these limitations, we propose HG-InsightLog, a novel framework that constructs a multi-entity temporal hypergraph representing log attribute-value pair as nodes and connecting them with hyperedges, capturing critical connections in the data. HG-InsightLog introduces a multi-step query personalization mechanism enhancing the Personalized PageRank algorithm to rank hyperedges based on query relevance and contextual centrality to priortize critical connections. Top ranked hyperedges are extracted and converted back into log formats preserving temporal order and reducing context. Experimental results across multiple datasets demonstrate its superiority over existing methods, enhancing factual, causal, and analytical reasoning. Our approach enables smaller LLMs like LLaMA-8B to perform effective log-based QA. Being model-agnostic and training-free, it scales with evolving open-source LLMs without relying on proprietary systems.
pdf
bib
abs
Dialect Normalization using Large Language Models and Morphological Rules
Antonios Dimakis
|
John Pavlopoulos
|
Antonios Anastasopoulos
Natural language understanding systems struggle with low-resource languages, including many dialects of high-resource ones. Dialect-to-standard normalization attempts to tackle this issue by transforming dialectal text so that it can be used by standard-language tools downstream. In this study, we tackle this task by introducing a new normalization method that combines rule-based linguistically informed transformations and large language models (LLMs) with targeted few-shot prompting, without requiring any parallel data. We implement our method for Greek dialects and apply it on a dataset of regional proverbs, evaluating the outputs using human annotators. We then use this dataset to conduct downstream experiments, finding that previous results regarding these proverbs relied solely on superficial linguistic information, including orthographic artifacts, while new observations can still be made through the remaining semantics.
pdf
bib
abs
USDC: A Dataset of ̲User ̲Stance and ̲Dogmatism in Long ̲Conversations
Mounika Marreddy
|
Subba Reddy Oota
|
Venkata Charan Chinni
|
Manish Gupta
|
Lucie Flek
Analyzing user opinion changes in long conversation threads is extremely critical for applications like enhanced personalization, market research, political campaigns, customer service, targeted advertising, and content moderation. Unfortunately, previous studies on stance and dogmatism in user conversations have focused on training models using datasets annotated at the post level, treating each post as independent and randomly sampling posts from conversation threads. Hence, first, we build a dataset for studying user opinion fluctuations in 764 long multi-user Reddit conversation threads, called USDC. USDC contains annotations for 2 tasks: i) User Stance classification, which involves labeling a user’s stance in a post within a conversation on a five-point scale; ii) User Dogmatism classification, which involves labeling a user’s overall opinion in the conversation on a four-point scale. Besides being time-consuming and costly, manual annotations for USDC are challenging because: 1) Conversation threads could be very long, increasing the chances of noisy annotations; and 2) Interpreting instances where a user changes their opinion within a conversation is difficult because often such transitions are subtle and not expressed explicitly. Hence, we leverage majority voting on zero-shot, one-shot, and few-shot annotations from Mistral Large and GPT-4 to automate the annotation process. Human annotations on 200 test conversations achieved inter-annotator agreement scores of 0.49 for stance and 0.50 for dogmatism with these LLM annotations, indicating a reasonable level of consistency between human and LLM annotations. USDC is then used to finetune and instruction-tune multiple deployable small language models like LLaMA, Falcon and Vicuna for the stance and dogmatism classification tasks. We make the code and dataset publicly available [https://github.com/mounikamarreddy/USDC].
pdf
bib
abs
Learning to Insert [PAUSE] Tokens for Better Reasoning
Eunki Kim
|
Sangryul Kim
|
James Thorne
To enhance reasoning capabilities, previous works have explored incorporating special-purpose tokens into the training process. These strategies strengthen the learning mechanism of transformer-based large language models (LLMs). Building on prior research, in which inserting dummy tokens consecutively just before reasoning steps can enhance effectiveness, we introduce a novel approach termed Dynamic Inserting Tokens Training (DIT). Our method identifies positions within sequences where model confidence is lowest according to token log-likelihood. Strategically inserting [PAUSE] tokens on these positions bolsters the model’s predictive capabilities for subsequent tokens. Experimental results across diverse datasets and models, from the 2.7B model to the 8B model, demonstrate that DIT consistently outperforms traditional fine-tuning and previous token insertion methods. With this simple yet effective method, we achieve accuracy gains of up to 4.7%p on GSM8K, 3.23%p on AQUA-RAT, and pass@1 improvements of up to 3.4%p on MBPP datasets. Our work shows a model-based, dynamic approach rather than a heuristic one, thereby broadening the scope of research in reasoning.
pdf
bib
abs
Understand the Implication: Learning to Think for Pragmatic Understanding
Settaluri Lakshmi Sravanthi
|
Kishan Maharaj
|
Sravani Gunnu
|
Abhijit Mishra
|
Pushpak Bhattacharyya
Pragmatics, the ability to infer meaning beyond literal interpretation, is crucial for social cognition and communication. While LLMs have been benchmarked for their pragmatic understanding, improving their performance remains underexplored. Existing methods rely on annotated labels but overlook the reasoning process humans naturally use to interpret implicit meaning. To bridge this gap, we introduce a novel pragmatic dataset ImpliedMeaningPreference that includes explicit reasoning (‘thoughts’) for both correct and incorrect interpretations. Through preference-tuning and supervised fine-tuning, we demonstrate that thought-based learning significantly enhances LLMs’ pragmatic understanding, improving accuracy by 11.12% across model families. We further discuss a transfer-learning study where we evaluate the performance of thought-based training for the other tasks of pragmatics (presupposition, deixis) that are not seen during the training time and observe an improvement of 16.10% compared to label trained models.
pdf
bib
abs
WASA: WAtermark-based Source Attribution for Large Language Model-Generated Data
Xinyang Lu
|
Jingtan Wang
|
Zitong Zhao
|
Zhongxiang Dai
|
Chuan-Sheng Foo
|
See-Kiong Ng
|
Bryan Kian Hsiang Low
The impressive performances of Large Language Models (LLMs) and their immense potential for commercialization have given rise to serious concerns over the Intellectual Property (IP) of their training data. In particular, the synthetic texts generated by LLMs may infringe the IP of the data being used to train the LLMs. To this end, it is imperative to be able to perform source attribution by identifying the data provider who contributed to the generation of a synthetic text by an LLM. In this paper, we show that this problem can be tackled by watermarking, i.e., by enabling an LLM to generate synthetic texts with embedded watermarks that contain information about their source(s). We identify the key properties of such watermarking frameworks (e.g., source attribution accuracy, robustness against adversaries), and propose a source attribution framework that satisfies these key properties due to our algorithmic designs. Our framework enables an LLM to learn an accurate mapping from the generated texts to data providers, which sets the foundation for effective source attribution. Extensive empirical evaluations show that our framework achieves effective source attribution.
pdf
bib
abs
Dense Retrieval with Quantity Comparison Intent
Prayas Agrawal
|
Nandeesh Kumar K M
|
Muthusamy Chelliah
|
Surender Kumar
|
Soumen Chakrabarti
Pre-trained language models (PLMs) fragment numerals and units that express quantities in arbitrary ways, depending on their subword vocabulary. Consequently, they are unable to contextualize the fragment embeddings well enough to be proficient with dense retrieval in domains like e-commerce and finance. Arithmetic inequality constraints (“laptop under 2 lb”) offer additional challenges. In response, we propose DeepQuant, a dense retrieval system built around a dense multi-vector index, but carefully engineered to elicit and exploit quantities and associated comparison intents. A novel component of our relevance score compares two quantities with compatible units, conditioned on a proposed comparison operator. The uncertain extractions of numerals, units and comparators are marginalized in a suitable manner. On two public and one proprietary e-commerce benchmark, DeepQuant is both faster and more accurate than popular PLMs. It also beats several competitive sparse and dense retrieval systems that do not take special cognizance of quantities.
pdf
bib
abs
Reflection on Knowledge Graph for Large Language Models Reasoning
Yigeng Zhou
|
Wu Li
|
Yifan Lu
|
Jing Li
|
Fangming Liu
|
Meishan Zhang
|
Yequan Wang
|
Daojing He
|
Honghai Liu
|
Min Zhang
Recent research shows that supplementing Large Language Models (LLMs) with knowledge graphs can enhance their performance. However, existing methods often introduce noise in the retrieval and reasoning pipeline, hindering LLMs’ ability to effectively integrate external knowledge for complex multi-hop question answering. To address this, we propose RefKG, a novel framework designed to enhance the reasoning capabilities of LLMs through reflective engagement with knowledge graphs. RefKG autonomously conduct retrieval and reflection on knowledge graphs. It consists of three modules: Query Decoupling, LLM-Driven Knowledge Graph Exploration, and Inference with Knowledge Reconstruction. We also introduce a multi-task tuning strategy that not only integrates external knowledge into LLMs but also trains them to leverage this knowledge for answering questions. This significantly improves their performance on knowledge-intensive tasks. Experiments on fact verification and knowledge graph question answering demonstrate RefKG’s effectiveness.
pdf
bib
abs
Revisiting 3D LLM Benchmarks: Are We Really Testing 3D Capabilities?
Jiahe Jin
|
Yanheng He
|
Mingyan Yang
In this work, we identify the “2D-Cheating” problem in 3D LLM evaluation, where these tasks might be easily solved by VLMs with rendered images of point clouds, exposing ineffective evaluation of 3D LLMs’ unique 3D capabilities. We test VLM performance across multiple 3D LLM benchmarks and, using this as a reference, propose principles for better assessing genuine 3D understanding. We also advocate explicitly separating 3D abilities from 1D or 2D aspects when evaluating 3D LLMs.
pdf
bib
abs
DIESEL: A Lightweight Inference-Time Safety Enhancement for Language Models
Ben Ganon
|
Alon Zolfi
|
Omer Hofman
|
Inderjeet Singh
|
Hisashi Kojima
|
Yuval Elovici
|
Asaf Shabtai
Large language models (LLMs) have demonstrated impressive performance across a wide range of tasks, including open-ended dialogue, driving advancements in virtual assistants and other interactive systems. However, these models often generate outputs misaligned with human values, such as ethical norms and safety constraints, resulting in potentially harmful or inappropriate responses. While several techniques have been proposed to address this problem, they typically involve computationally intensive training procedures or introduce substantial inference-time latency. In this paper, we present DIESEL, a lightweight inference-guidance technique that can be seamlessly integrated into any autoregressive LLM to semantically filter undesirable content during generation. DIESEL guides generation by reranking token candidates according to their semantic similarity to predefined negative concepts in the latent space. It can serve either as a standalone safeguard or as an auxiliary defense layer, enhancing response safety without requiring model fine-tuning or additional data. We demonstrate DIESEL’s effectiveness on state-of-the-art conversational models, including in adversarial jailbreak scenarios. Furthermore, we show that DIESEL generalizes beyond safety applications, enabling flexible and domain-specific response filtering.
pdf
bib
abs
Toward Structured Knowledge Reasoning: Contrastive Retrieval-Augmented Generation on Experience
Jiawei Gu
|
Ziting Xian
|
Yuanzhen Xie
|
Ye Liu
|
Enjie Liu
|
Ruichao Zhong
|
Mochi Gao
|
Yunzhi Tan
|
Bo Hu
|
Zang Li
Large language models (LLMs) achieve strong performance on plain text tasks but underperform on structured data like tables and databases. Potential challenges arise from their underexposure during pre-training and rigid text-to-structure transfer mechanisms. Unlike humans who seamlessly apply learned patterns across data modalities, LLMs struggle to infer implicit relationships embedded in tabular formats, especially in the absence of explicit structural guidance. To bridge this cognitive gap, we introduce Contrastive Retrieval-Augmented Generation on Experience (CoRE), a framework that builds experience memory representations and enhances generalization through contrastive In-Context Learning (ICL) to simulate human-like knowledge transfer. Experiments on Text-to-SQL and TableQA show CoRE significantly improves performance, achieving average gains of 3.44% and 4.24%, with up to 17.2% on challenging tasks. Our Monte Carlo Tree Search (MCTS)-generated Experience Memory expands training data 8-9×, enhancing diversity and domain coverage. This training-free and continual method propels LLMs toward structured knowledge expertise.
pdf
bib
abs
Structured Pruning for Diverse Best-of-N Reasoning Optimization
Hieu Trung Nguyen
|
Bao Nguyen
|
Viet Anh Nguyen
Model pruning in transformer-based language models, traditionally seen as a means of computational savings, can enhance the model’s reasoning capabilities. In this work, we uncover the surprising phenomenon that the selective pruning of certain attention heads leads to improvements in reasoning performance, particularly on challenging tasks. Motivated by this observation, we propose SPRINT, a novel contrastive learning framework that dynamically selects the optimal head and layer to prune during inference. By aligning question embeddings with head embeddings, our approach identifies those pruned-head configurations that result in more accurate reasoning. Extensive experiments on the MATH dataset demonstrate that our method significantly outperforms traditional best-of-N and random head selection strategies on the MATH500 and GSM8K datasets.
pdf
bib
abs
PodAgent: A Comprehensive Framework for Podcast Generation
Yujia Xiao
|
Lei He
|
Haohan Guo
|
Feng-Long Xie
|
Tan Lee
Existing automatic audio generation methods struggle to generate podcast-like audio programs effectively. The key challenges lie in in-depth content generation, appropriate and expressive voice production. This paper proposed PodAgent, a comprehensive framework for creating audio programs. PodAgent 1) generates informative topic-discussion content by designing a Host-Guest-Writer multi-agent collaboration system, 2) builds a voice pool for suitable voice-role matching and 3) utilizes LLM-enhanced speech synthesis method to generate expressive conversational speech. Given the absence of standardized evaluation criteria for podcast-like audio generation, we developed comprehensive assessment guidelines to effectively evaluate the model’s performance. Experimental results demonstrate PodAgent’s effectiveness, significantly surpassing direct GPT-4 generation in topic-discussion dialogue content, achieving an 87.4% voice-matching accuracy, and producing more expressive speech through LLM-guided synthesis. Demo page: https://podcast-agent.github.io/demo/. Source code: https://github.com/yujxx/PodAgent.
pdf
bib
abs
STORM-BORN: A Challenging Mathematical Derivations Dataset Curated via a Human-in-the-Loop Multi-Agent Framework
Wenhao Liu
|
Zhenyi Lu
|
Xinyu Hu
|
Jerry Zhang
|
Dailin Li
|
Jiacheng Cen
|
Huilin Cao
|
Haiteng Wang
|
Yuhan Li
|
Xie Kun
|
Dandan Li
|
Pei Zhang
|
Chengbo Zhang
|
Yuxiang Ren
|
Xiaohong Huang
|
Yan Ma
High-quality math datasets are crucial for advancing the reasoning abilities of large language models (LLMs). However, existing datasets often suffer from three key issues: outdated and insufficient challenging content, neglecting human-like reasoning, and limited reliability due to single-LLM generation.To address these, we introduce STORM-BORN, an ultra-challenging dataset of mathematical derivations sourced from cutting-edge academic papers, which includes dense human-like approximations and heuristic cues.To ensure the reliability and quality, we propose a novel human-in-the-loop, multi-agent data generation framework, integrating reasoning-dense filters, multi-agent collaboration, and human mathematicians’ evaluations. We curated a set of 2,000 synthetic samples and deliberately selected the 100 most difficult problems.Even most advanced models like GPT-o1 solved fewer than 5% of them. Fine-tuning on STORM-BORN boosts accuracy by 7.84% (LLaMA3-8B) and 9.12% (Qwen2.5-7B).As AI approaches mathematician-level reasoning, STORM-BORN provides both a high-difficulty benchmark and a human-like reasoning training resource. Our code and dataset are publicly available at https://github.com/lwhere/STORM-BORN.
pdf
bib
abs
iMOVE : Instance-Motion-Aware Video Understanding
Jiaze Li
|
Yaya Shi
|
Zongyang Ma
|
Haoran Xu
|
Yandong.bai Yandong.bai
|
Huihui Xiao
|
Ruiwen Kang
|
Fan Yang
|
Tingting Gao
|
Di Zhang
Enhancing the fine-grained instance spatiotemporal motion perception capabilities of Video Large Language Models is crucial for improving their temporal and general video understanding. However, current models struggle to perceive detailed and complex instance motions. To address these challenges, we have made improvements from both data and model perspectives. In terms of data, we have meticulously curated iMOVE-IT, the first large-scale instance-motion-aware video instruction-tuning dataset. This dataset is enriched with comprehensive instance motion annotations and spatiotemporal mutual-supervision tasks, providing extensive training for the model’s instance-motion-awareness. Building on this foundation, we introduce iMOVE, an instance-motion-aware video foundation model that utilizes Event-aware Spatiotemporal Efficient Modeling to retain informative instance spatiotemporal motion details while maintaining computational efficiency. It also incorporates Relative Spatiotemporal Position Tokens to ensure awareness of instance spatiotemporal positions. Evaluations indicate that iMOVE excels not only in video temporal understanding and general video understanding but also demonstrates significant advantages in long-term video understanding. We will release the data, code, and model weights after acceptance.
pdf
bib
abs
SceneGram: Conceptualizing and Describing Tangrams in Scene Context
Simeon Junker
|
Sina Zarrieß
Research on reference and naming suggests that humans can come up with very different ways of conceptualizing and referring to the same object, e.g. the same abstract tangram shape can be a “crab”, “sink” or “space ship”. Another common assumption in cognitive science is that scene context fundamentally shapes our visual perception of objects and conceptual expectations. This paper contributes SceneGram, a dataset of human references to tangram shapes placed in different scene contexts, allowing for systematic analyses of the effect of scene context on conceptualization. Based on this data, we analyze references to tangram shapes generated by multimodal LLMs, showing that these models do not account for the richness and variability of conceptualizations found in human references.
pdf
bib
abs
Relevant or Random: Can LLMs Truly Perform Analogical Reasoning?
Chengwei Qin
|
Wenhan Xia
|
Tan Wang
|
Fangkai Jiao
|
Yuchen Hu
|
Bosheng Ding
|
Ruirui Chen
|
Shafiq Joty
Analogical reasoning is a unique ability of humans to address unfamiliar challenges by transferring strategies from relevant past experiences. One key finding in psychology is that compared with irrelevant past experiences, recalling relevant ones can help humans better handle new tasks. Coincidentally, the NLP community has also recently found that self-generating relevant examples in the context can help large language models (LLMs) better solve a given problem than hand-crafted prompts. However, it is yet not clear whether relevance is the key factor eliciting such capability, i.e., can LLMs benefit more from self-generated relevant examples than irrelevant ones? In this work, we systematically explore whether LLMs can truly perform analogical reasoning on a diverse set of reasoning tasks. With extensive experiments and analysis, we show that self-generated random examples can surprisingly achieve comparable or even better performance on certain tasks, e.g., 4% performance boost on GSM8K with random biological examples. We find that the accuracy of self-generated examples is the key factor and subsequently design two novel methods with improved performance and significantly reduced inference costs. Overall, we aim to advance a deeper understanding of LLM analogical reasoning and hope this work stimulates further research in the design of self-generated contexts.
pdf
bib
abs
MERIT: Multi-Agent Collaboration for Unsupervised Time Series Representation Learning
Shu Zhou
|
Yunyang Xuan
|
Yuxuan Ao
|
Xin Wang
|
Tao Fan
|
Hao Wang
This paper studies the problem of unsupervised time series representation learning, which aims to map unlabeled time series data into a low-dimensional latent space for various downstream tasks. Previous works usually combine a range of augmentation strategies with contrastive learning to generate discriminative representations. However, these augmentation strategies could alter the original semantics of time series data, which could degrade the performance of representation learning. To solve this problem, this paper incorporates the large language model (LLM) agent to guide unsupervised time series representation learning and proposes a novel framework named Multi-Agent Collaboration for Time-series Representation Learning (MERIT). The core of our MERIT is to utilize three LLM agents to collaboratively generate positive views for time series data. In particular, we first design a retrieval agent to automatically identify the relevant time series data from a coarse candidate set. Then, these selected sequences are further utilized to enhance an augmentation agent which automatically selects reliable augmentation strategies from an augmentation strategy library. We also design a review agent to evaluate the quality of generated views and stop the generation process. These three agents are designed to work in a loop for effective time series representation learning. Extensive experiments on multiple time series datasets demonstrate the effectiveness of our MERIT in comparison with state-of-the-art baselines.
pdf
bib
abs
JsonTuning: Towards Generalizable, Robust, and Controllable Instruction Tuning
Chang Gao
|
Wenxuan Zhang
|
Guizhen Chen
|
Wai Lam
Instruction tuning is vital for enhancing the performance of large language models (LLMs), but existing text-to-text methods, referred to as TextTuning, struggle with issues such as generalization, robustness, and controllability due to their lack of explicit task structures. We introduce JsonTuning, a structure-to-structure approach that uses JSON structures to represent tasks. This method improves generalization by clarifying task elements and their relations, boosts robustness by minimizing ambiguity, and enhances controllability by allowing precise control over outputs. We conduct an extensive comparative analysis between JsonTuning and TextTuning using various language models and benchmarks. Our findings reveal that JsonTuning consistently surpasses TextTuning in terms of performance, robustness, and controllability across different scenarios. By overcoming the limitations of TextTuning, JsonTuning demonstrates significant potential for developing more effective and reliable LLMs capable of handling diverse scenarios.
pdf
bib
abs
RedundancyLens: Revealing and Exploiting Visual Token Processing Redundancy for Efficient Decoder-Only MLLMs
Hongliang Li
|
Jiaxin Zhang
|
Wenhui Liao
|
Dezhi Peng
|
Kai Ding
|
Lianwen Jin
Current Multimodal Large Language Model (MLLM) architectures face a critical tradeoff between performance and efficiency: decoder-only architectures achieve higher performance but lower efficiency, while cross-attention-based architectures offer greater efficiency but lower performance. The key distinction lies in how visual tokens are processed. Decoder-only architectures apply self-attention and FFN operations on visual tokens, while cross-attention architectures skip these computations. To investigate whether redundancy exists in this computationally expensive process, we propose a training-free framework for analyzing trained MLLMs. It consists of Probe-Activated Dynamic FFN and Hollow Attention, which enable adjustable reductions in computations for visual tokens, as well as a Layer Ranking Algorithm that prioritizes layers for these reductions. Extensive experiments demonstrate substantial, structured, and clustered redundancy unique to decoder-only MLLMs, offering valuable insights for future MLLM architecture design. Furthermore, by leveraging our reduction framework as a training-free inference acceleration approach, we achieve performance comparable to or better than state-of-the-art methods while remaining compatible with them. Code is available at https://github.com/L-Hugh/RedundancyLens.
pdf
bib
abs
Memory-augmented Query Reconstruction for LLM-based Knowledge Graph Reasoning
Mufan Xu
|
Gewen Liang
|
Kehai Chen
|
Wei Wang
|
Xun Zhou
|
Muyun Yang
|
Tiejun Zhao
|
Min Zhang
Large language models (LLMs) have achieved remarkable performance on knowledge graph question answering (KGQA) tasks by planning and interacting with knowledge graphs. However, existing methods often confuse tool utilization with knowledge reasoning, harming readability of model outputs and giving rise to hallucinatory tool invocations, which hinder the advancement of KGQA. To address this issue, we propose Memory-augmented Query Reconstruction for LLM-based Knowledge Graph Reasoning (MemQ) to decouple LLM from tool invocation tasks using LLM-built query memory. By establishing a memory module with explicit descriptions of query statements, the proposed MemQ facilitates the KGQA process with natural language reasoning and memory-augmented query reconstruction. Meanwhile, we design an effective and readable reasoning to enhance the LLM’s reasoning capability in KGQA. Experimental results that MemQ achieves state-of-the-art performance on widely used benchmarks WebQSP and CWQ.
pdf
bib
abs
KaFT: Knowledge-aware Fine-tuning for Boosting LLMs’ Domain-specific Question-Answering Performance
Qihuang Zhong
|
Liang Ding
|
Xiantao Cai
|
Juhua Liu
|
Bo Du
|
Dacheng Tao
Supervised fine-tuning (SFT) is a common approach to improve the domain-specific question-answering (QA) performance of large language models (LLMs). However, recent literature reveals that due to the conflicts between LLMs’ internal knowledge and the context knowledge of training data, vanilla SFT using the full QA training set is usually suboptimal. In this paper, we first design a query diversification strategy for robust conflict detection and then conduct a series of experiments to analyze the impact of knowledge conflict. We find that 1) training samples with varied conflicts contribute differently, where SFT on the data with large conflicts leads to catastrophic performance drops; 2) compared to directly filtering out the conflict data, appropriately applying the conflict data would be more beneficial. Motivated by this, we propose a simple-yet-effective Knowledge-aware Fine-tuning (namely KaFT) approach to effectively boost LLMs’ performance. The core of KaFT is to adapt the training weight by assigning different rewards for different training samples according to conflict level. Extensive experiments show that KaFT brings consistent and significant improvements (up to +5.73% average scores) across four LLMs. More analyses prove that KaFT effectively improves the model generalization and alleviates the hallucination.
pdf
bib
abs
Are Multimodal Large Language Models Pragmatically Competent Listeners in Simple Reference Resolution Tasks?
Simeon Junker
|
Manar Ali
|
Larissa Koch
|
Sina Zarrieß
|
Hendrik Buschmeier
We investigate the linguistic abilities of multimodal large language models in reference resolution tasks featuring simple yet abstract visual stimuli, such as color patches and color grids. Although the task may not seem challenging for today’s language models, being straightforward for human dyads, we consider it to be a highly relevant probe of the pragmatic capabilities of MLLMs. Our results and analyses indeed suggest that basic pragmatic capabilities, such as context-dependent interpretation of color descriptions, still constitute major challenges for state-of-the-art MLLMs.
pdf
bib
abs
Removing Prompt-template Bias in Reinforcement Learning from Human Feedback
Chaojie Wang
|
Haonan Shi
|
Long Tian
|
Bo An
|
Shuicheng Yan
Reinforcement Learning from Human Feedback (RLHF) has become an essential technique for enhancing pre-trained large language models (LLMs) to generate responses that align with human preferences and societal values. Although RLHF has shown promise, the training of reward models (RMs) still faces the challenge of reward hacking, motivating recent works to prevent RMs from finding shortcuts that bypass the intended optimization objectives by identifying simplistic patterns such as response length. Besides the issue of length bias, our work firstly reveals that prompt-template bias learned by RMs can also cause reward hacking when dealing with some marginal samples, resulting in LLMs preferring to generate responses in a specific format after RLHF fine-tuning, regardless of the format requested in the prompt. To this end, we propose a low-cost but effective method, namely Prompt Bias Calibration (PBC), to estimate the prompt-template bias term during reward modeling, which can be utilized to calibrate reward scores in the following RL fine-tuning process. Then, we show that our PBC method can be flexibly combined with existing algorithms of removing length bias, leading to a further improvement in the aspect of enhancing the quality of generated responses.
pdf
bib
abs
Latent Distribution Decouple for Uncertain-Aware Multimodal Multi-label Emotion Recognition
Jingwang Huang
|
Jiang Zhong
|
Qin Lei
|
Gaojinpeng Gaojinpeng
|
Ymyang Ymyang
|
Sirui Wang
|
PeiguangLi PeiguangLi
|
Kaiwen Wei
Multimodal multi-label emotion recognition (MMER) aims to identify the concurrent presence of multiple emotions in multimodal data. Existing studies primarily focus on improving fusion strategies and modeling modality-to-label dependencies. However, they often overlook the impact of aleatoric uncertainty, which is the inherent noise in the multimodal data and hinders the effectiveness of modality fusion by introducing ambiguity into feature representations.To address this issue and effectively model aleatoric uncertainty, this paper proposes Latent emotional Distribution Decomposition with Uncertainty perception (LDDU) framework from a novel perspective of latent emotional space probabilistic modeling. Specifically, we introduce a contrastive disentangled distribution mechanism within the emotion space to model the multimodal data, allowing for the extraction of semantic features and uncertainty. Furthermore, we design an uncertainty-aware fusion multimodal method that accounts for the dispersed distribution of uncertainty and integrates distribution information. Experimental results show that LDDU achieves state-of-the-art performance on the CMU-MOSEI and M3ED datasets, highlighting the importance of uncertainty modeling in MMER. Code is available at https://github.com/201983290498/lddu_mmer.git.
pdf
bib
abs
Are LLMs Rational Investors? A Study on the Financial Bias in LLMs
Yuhang Zhou
|
Yuchen Ni
|
Zhiheng Xi
|
Zhangyue Yin
|
Yu He
|
Gan Yunhui
|
Xiang Liu
|
Zhang Jian
|
Sen Liu
|
Xipeng Qiu
|
Yixin Cao
|
Guangnan Ye
|
Hongfeng Chai
Large language models (LLMs) excel in natural language generation but also exhibit biases, particularly in gender, race, and religion, which can be amplified with widespread use. However, research on biases in specific domains, such as finance, remains limited. To address this gap, we conducted a comprehensive evaluation of 23 leading LLMs and found varying degrees of financial bias, including more pronounced biases in financial-specific LLMs (FinLLMs). In response, we propose the Financial Bias Indicators (FBI) framework, which includes components like the Bias Unveiler, Bias Detective, Bias Tracker, and Bias Antidote, designed to identify, detect, analyze, and mitigate financial biases. Our analysis explores the root causes of these biases and introduces a debiasing method based on financial causal knowledge, alongside three other debiasing techniques. For the most biased model, we successfully reduced bias by 68% according to key metrics. This study advances our understanding of LLM biases in finance and highlights the need for greater scrutiny in their application within this critical domain.
pdf
bib
abs
Seeing What Tastes Good: Revisiting Multimodal Distributional Semantics in the Billion Parameter Era
Dan Oneata
|
Desmond Elliott
|
Stella Frank
Human learning and conceptual representation is grounded in sensorimotor experience, in contrast to state-of-the-art foundation models. In this paper, we investigate how well such large-scale models, trained on vast quantities of data, represent the semantic feature norms of concrete object concepts, e.g. a ROSE is red, smells sweet, and is a flower. More specifically, we use probing tasks to test which properties of objects these models are aware of. We evaluate image encoders trained on image data alone, as well as multimodally-trained image encoders and language-only models, on predicting an extended denser version of the classic McRae norms and the newer Binder dataset of attribute ratings. We find that multimodal image encoders slightly outperform language-only approaches, and that image-only encoders perform comparably to the language models, even on non-visual attributes that are classified as “encyclopedic” or “function”. These results offer new insights into what can be learned from pure unimodal learning, and the complementarity of the modalities.
pdf
bib
abs
Communication-Efficient and Tensorized Federated Fine-Tuning of Large Language Models
Sajjad Ghiasvand
|
Yifan Yang
|
Zhiyu Xue
|
Mahnoosh Alizadeh
|
Zheng Zhang
|
Ramtin Pedarsani
Parameter-efficient fine-tuning (PEFT) methods typically assume that Large Language Models (LLMs) are trained on data from a single device or client. However, real-world scenarios often require fine-tuning these models on private data distributed across multiple devices. Federated Learning (FL) offers an appealing solution by preserving user privacy, as sensitive data remains on local devices during training. Nonetheless, integrating PEFT methods into FL introduces two main challenges: communication overhead and data heterogeneity. In this paper, we introduce FedTT and FedTT+, methods for adapting LLMs by integrating tensorized adapters into client-side models’ encoder/decoder blocks. FedTT is versatile and can be applied to both cross-silo FL and large-scale cross-device FL. FedTT+, an extension of FedTT tailored for cross-silo FL, enhances robustness against data heterogeneity by adaptively freezing portions of tensor factors, further reducing the number of trainable parameters. Experiments on BERT and LLaMA models demonstrate that our proposed methods successfully address data heterogeneity challenges and perform on par or even better than existing federated PEFT approaches while achieving up to 10× reduction in communication cost.
pdf
bib
abs
A rebuttal of two common deflationary stances against LLM cognition
Zak Hussain
|
Rui Mata
|
Dirk U. Wulff
Large language models (LLMs) are arguably the most predictive models of human cognition available. Despite their impressive human-alignment, LLMs are often labeled as "*just* next-token predictors” that purportedly fall short of genuine cognition. We argue that these deflationary claims need further justification. Drawing on prominent cognitive and artificial intelligence research, we critically evaluate two forms of “Justaism” that dismiss LLM cognition by labeling LLMs as “just” simplistic entities without specifying or substantiating the critical capacities these models supposedly lack. Our analysis highlights the need for a more measured discussion of LLM cognition, to better inform future research and the development of artificial intelligence.
pdf
bib
abs
COVER: Context-Driven Over-Refusal Verification in LLMs
Giovanni Sullutrone
|
Riccardo A. Vigliermo
|
Sonia Bergamaschi
|
Luca Sala
We introduce the concept of context-driven over-refusal, an abstention arising when model’s safety guardrails are triggered by the grounding knowledge provided alongside the user’s request. Distinct from question-driven over-refusal, this occurs in both retrieval-augmented generation (RAG) and natural language processing (NLP) task completion (e.g. summarization, translation) where external content can unexpectedly trigger refusals. In this work, we present a novel two-stage evaluation framework named COVER, designed to quantify and analyze this behavior. Through a comprehensive empirical study on two public corpora, we show that over-refusal rates strongly depend on the task, system prompts, model family, and the number of retrieved documents. We observe that tasks such as translation and summarization yield disproportionately high over-refusal rates, while question-answering remains relatively robust, especially in newer models. Moreover, increasing the number of contextual documents tends to reduce refusals, yet broadens the pool of prompts at risk of encountering at least one “unsafe” text. Interestingly, strict system prompts do not necessarily lead to higher over-refusal rates, suggesting that in the absence of explicit directives, some models may default to a more cautious behavior. These findings highlight the need for fine-grained alignment and benchmarking strategies sensitive to both user intent and contextual nuances, offering a roadmap for future research in model training and evaluation.
pdf
bib
abs
MOSAIC: Multiple Observers Spotting AI Content
Matthieu Dubois
|
François Yvon
|
Pablo Piantanida
The dissemination of Large Language Models (LLMs), trained at scale, and endowed with powerful text-generating abilities, has made it easier for all to produce harmful, toxic, faked or forged content. In response, various proposals have been made to automatically discriminate artificially generated from human-written texts, typically framing the problem as a binary classification problem. Early approaches evaluate an input document with a well-chosen detector LLM, assuming that low-perplexity scores reliably signal machine-made content. More recent systems instead consider two LLMs and compare their probability distributions over the document to further discriminate when perplexity alone cannot. However, using a fixed pair of models can induce brittleness in performance. We extend these approaches to the ensembling of several LLMs and derive a new, theoretically grounded approach to combine their respective strengths. Our experiments, using a variety of generator LLMs, suggest that this approach effectively harnesses each model’s capabilities, leading to strong detection performance on a variety of domains.
pdf
bib
abs
GUIDEX: Guided Synthetic Data Generation for Zero-Shot Information Extraction
Neil De La Fuente
|
Oscar Sainz
|
Iker García-Ferrero
|
Eneko Agirre
Information Extraction (IE) systems are traditionally domain-specific, requiring costlyadaptation that involves expert schema design,data annotation, and model training. WhileLarge Language Models have shown promisein zero-shot IE, performance degrades significantly in unseen domains where label definitions differ. This paper introduces GUIDEX,a novel method that automatically definesdomain-specific schemas, infers guidelines,and generates synthetically labeled instances,allowing for better out-of-domain generalization. Fine-tuning Llama 3.1 with GUIDEXsets a new state-of-the-art across seven zeroshot Named Entity Recognition benchmarks.Models trained with GUIDEX gain up to 7 F1points over previous methods without humanlabeled data, and nearly 2 F1 points higherwhen combined with it. Models trained onGUIDEX demonstrate enhanced comprehension of complex, domain-specific annotationschemas. Code, models, and synthetic datasetsare available at neilus03.github.io/guidex.com
pdf
bib
abs
Missing the Margins: A Systematic Literature Review on the Demographic Representativeness of LLMs
Indira Sen
|
Marlene Lutz
|
Elisa Rogers
|
David Garcia
|
Markus Strohmaier
Many applications of Large Language Models (LLMs) require them to either simulate people or offer personalized functionality, making the demographic representativeness of LLMs crucial for equitable utility. At the same time, we know little about the extent to which these models actually reflect the demographic attributes and behaviors of certain groups or populations, with conflicting findings in empirical research. To shed light on this debate, we review 211 papers on the demographic representativeness of LLMs. We find that while 29% of the studies report positive conclusions on the representativeness of LLMs, 30% of these do not evaluate LLMs across multiple demographic categories or within demographic subcategories. Another 35% and 47% of the papers concluding positively fail to specify these subcategories altogether for gender and race, respectively. Of the articles that do report subcategories, fewer than half include marginalized groups in their study. Finally, more than a third of the papers do not define the target population to whom their findings apply; of those that do define it either implicitly or explicitly, a large majority study only the U.S. Taken together, our findings suggest an inflated perception of LLM representativeness in the broader community. We recommend more precise evaluation methods and comprehensive documentation of demographic attributes to ensure the responsible use of LLMs for social applications.
pdf
bib
abs
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
Omkar Thawakar
|
Dinura Dissanayake
|
Ketan Pravin More
|
Ritesh Thawkar
|
Ahmed Heakl
|
Noor Ahsan
|
Yuhao Li
|
Ilmuz Zaman Mohammed Zumri
|
Jean Lahoud
|
Rao Muhammad Anwer
|
Hisham Cholakkal
|
Ivan Laptev
|
Mubarak Shah
|
Fahad Shahbaz Khan
|
Salman Khan
Step-by-step reasoning is crucial for solving complex visual tasks, yet existing approaches lack a comprehensive framework for evaluating this capability and do not emphasize step-wise problem-solving. To this end, we propose a comprehensive framework for advancing multi-step visual reasoning in large multimodal models (LMMs) through three key contributions. First, we introduce a Visual Reasoning Chain Benchmark, the most comprehensive benchmark for multi-step visual reasoning, covering eight diverse categories and over 4k reasoning steps. This enables rigorous evaluation of LMMs’ ability to reason accurately and interpretably across multiple steps. Second, we propose a fine-grained reasoning metric that evaluates correctness and logical coherence at each step, providing deeper insights beyond traditional accuracy metrics. Third, we introduce LlamaV-o1, a state-of-the-art multimodal reasoning model trained using a multi-step curriculum learning approach. LlamaV-o1 is optimized for structured, step-by-step reasoning and significantly outperforms existing open-source models. It surpasses Llava-CoT with a 3.8% absolute gain across six benchmarks, achieving an average score of 67.3 while being 5x faster during inference scaling. Our benchmark, model, and code is available at https://github.com/mbzuai-oryx/LlamaV-o1.
pdf
bib
abs
Burn After Reading: Do Multimodal Large Language Models Truly Capture Order of Events in Image Sequences?
Yingjin Song
|
Yupei Du
|
Denis Paperno
|
Albert Gatt
This paper introduces the TempVS benchmark, which focuses on temporal grounding and reasoning capabilities of Multimodal Large Language Models (MLLMs) in image sequences. TempVS consists of three main tests (i.e., event relation inference, sentence ordering and image ordering), each accompanied with a basic grounding test. TempVS requires MLLMs to rely on both visual and linguistic modalities to understand the temporal order of events. We evaluate 38 state-of-the-art MLLMs, demonstrating that models struggle to solve TempVS, with a substantial performance gap compared to human capabilities. We also provide fine-grained insights that suggest promising directions for future research. Our TempVS benchmark data and code are available at https://github.com/yjsong22/TempVS.
pdf
bib
abs
Full-Step-DPO: Self-Supervised Preference Optimization with Step-wise Rewards for Mathematical Reasoning
Huimin Xu
|
Xin Mao
|
Feng-Lin Li
|
Xiaobao Wu
|
Wang Chen
|
Wei Zhang
|
Anh Tuan Luu
Direct Preference Optimization (DPO) often struggles with long-chain mathematical reasoning. Existing approaches, such as Step-DPO, typically improve this by focusing on the first erroneous step in the reasoning chain. However, they overlook all other steps and rely heavily on humans or GPT-4 to identify erroneous steps. To address these issues, we propose Full-Step-DPO, a novel DPO framework tailored for mathematical reasoning. Instead of optimizing only the first erroneous step, it leverages step-wise rewards from the entire reasoning chain. This is achieved by training a self-supervised process reward model, which automatically scores each step, providing rewards while avoiding reliance on external signals. Furthermore, we introduce a novel step-wise DPO loss, which dynamically updates gradients based on these step-wise rewards. This endows stronger reasoning capabilities to language models. Extensive evaluations on both in-domain and out-of-domain mathematical reasoning benchmarks across various base language models, demonstrate that Full-Step-DPO achieves superior performance compared to state-of-the-art baselines.
pdf
bib
abs
Do Emotions Really Affect Argument Convincingness? A Dynamic Approach with LLM-based Manipulation Checks
Yanran Chen
|
Steffen Eger
Emotions have been shown to play a role in argument convincingness, yet this aspect is underexplored in the natural language processing (NLP) community. Unlike prior studies that use static analyses, focus on a single text domain or language, or treat emotion as just one of many factors, we introduce a dynamic framework inspired by manipulation checks commonly used in psychology and social science; leveraging LLM-based manipulation checks, this framework examines the extent to which perceived emotional intensity influences perceived convincingness. Through human evaluation of arguments across different languages, text domains, and topics, we find that in over half of cases, human judgments of convincingness remain unchanged despite variations in perceived emotional intensity; when emotions do have an impact, they more often enhance rather than weaken convincingness.We further analyze whether 11 LLMs behave like humans in the same scenario, finding that while LLMs generally mirror human patterns,they struggle to capture nuanced emotional effects in individual judgments.
pdf
bib
abs
SCOPE: Compress Mathematical Reasoning Steps for Efficient Automated Process Annotation
Huimin Xu
|
Xin Mao
|
Feng-Lin Li
|
Xiaobao Wu
|
Wang Chen
|
Wei Zhang
|
Anh Tuan Luu
Process Reward Models (PRMs) have demonstrated promising results in mathematical reasoning, but existing process annotation approaches, whether through human annotations or Monte Carlo simulations, remain computationally expensive. In this paper, we introduce Step COmpression for Process Estimation (SCOPE), a novel compression-based approach that significantly reduces annotation costs. We first translate natural language reasoning steps into code and normalize them through Abstract Syntax Tree, then merge equivalent steps to construct a prefix tree. Unlike simulation-based methods that waste numerous samples on estimation, SCOPE leverages a compression-based prefix tree where each root-to-leaf path serves as a training sample, reducing the complexity from O(NMK) to O(N) We construct a large-scale dataset containing 509K samples with only 5% of the computational resources required by previous methods. Empirical results demonstrate that PRMs trained on our dataset consistently outperform existing automated annotation approaches on both Best-of-N strategy and ProcessBench.
pdf
bib
abs
Compositional Syntactico-SemBanking for English as a Second or Foreign Language
Wenxi Li
|
Xihao Wang
|
Weiwei Sun
Despite the widespread use of English as a Second or Foreign Language (ESFL), developing syntactico-semantic representations for it is limited — the irregularities in ESFL complicate systematic composition and subsequently the derivation of its semantics.This paper draws on constructivism and proposes a novel Synchronous Hyperedge Replacement Grammar (SHRG)-based constructivist approach to address the challenges. By using constructions as fundamental units, this approach not only accommodates both the idiosyncrasies and the compositional nature of ESFL, but also bridges the gap between literal cues and intended meaning.The feasibility of this constructivist approach is demonstrated using real ESFL data, resulting in a gold-standard, medium-sized syntactico-semantic bank that covers a wide range of ESFL phenomena.
pdf
bib
abs
Semantics-aware prompting for translating NOtices To AirMen
Minal Nitin Dani
|
Aishwarya Maheswaran
|
Maunendra Sankar Desarkar
A NOTAM or NOtice To AirMen is a crucial notice for different aviation stakeholders, particularly flight crews. It delivers essential notifications about abnormal conditions of Aviation System components such as changes to facilities, hazards, service, procedure that are not known far enough in advance to be publicized through other means. NOTAM messages are short, contain acronyms, and look cryptic in most of the cases. Writing and understanding these messages put heavy cognitive load on its end users. In this work, we take up the task of translating NOTAMs into English natural language using LLMs. Since NOTAMs do not adhere to English grammar rules and have their own decoding rules, large language models (LLMs) cannot translate them without effective prompting. In this paper, we develop a framework to come up with effective prompts to achieve the translations. Our approach uses context-aware semantic prompting techniques, paired with domain-specific rules, to improve the accuracy and clarity of translations. The framework is evaluated using comprehensive experiments (6 LLMs of varying sizes, and with 5 different prompting setups for each) and eight evaluation metrics measuring different aspects of the translation. The results demonstrate that our methodology can produce clear translations that accurately convey the information contained in NOTAMs.
pdf
bib
abs
Stereotype or Personalization? User Identity Biases Chatbot Recommendations
Anjali Kantharuban
|
Jeremiah Milbauer
|
Maarten Sap
|
Emma Strubell
|
Graham Neubig
While personalized recommendations are often desired by users, it can be difficult in practice to distinguish cases of bias from cases of personalization: we find that models generate racially stereotypical recommendations regardless of whether the user revealed their identity intentionally through explicit indications or unintentionally through implicit cues. We demonstrate that when people use large language models (LLMs) to generate recommendations, the LLMs produce responses that reflect both what the user wants and who the user is. We argue that chatbots ought to transparently indicate when recommendations are influenced by a user’s revealed identity characteristics, but observe that they currently fail to do so. Our experiments show that even though a user’s revealed identity significantly influences model recommendations (p < 0.001), model responses obfuscate this fact in response to user queries. This bias and lack of transparency occurs consistently across multiple popular consumer LLMs and for four American racial groups.
pdf
bib
abs
Automated main concept generation for narrative discourse assessment in aphasia
Ankita Gupta
|
Marisa Hudspeth
|
Polly Stokes
|
Jacquie Kurland
|
Brendan O’Connor
We present an interesting application of narrative understanding in the clinical assessment of aphasia, where story retelling tasks are used to evaluate a patient’s communication abilities. This clinical setting provides a framework to help operationalize narrative discourse analysis and an application-focused evaluation method for narrative understanding systems. In particular, we highlight the use of main concepts (MCs)—a list of statements that capture a story’s gist—for aphasic discourse analysis. We then propose automatically generating MCs from novel stories, which experts can edit manually, thus enabling wider adaptation of current assessment tools. We further develop a prompt ensemble method using large language models (LLMs) to automatically generate MCs for a novel story. We evaluate our method on an existing narrative summarization dataset to establish its intrinsic validity. We further apply it to a set of stories that have been annotated with MCs through extensive analysis of retells from non-aphasic and aphasic participants (Kurland et al., 2021, 2025). Our results show that our proposed method can generate most of the gold-standard MCs for stories from this dataset. Finally, we release this dataset of stories with annotated MCs to spur more research in this area.
pdf
bib
abs
Can VLMs Actually See and Read? A Survey on Modality Collapse in Vision-Language Models
Mong Yuan Sim
|
Wei Emma Zhang
|
Xiang Dai
|
Biaoyan Fang
Vision-language models (VLMs) integrate textual and visual information, enabling the model to process visual inputs and leverage visual information to generate predictions. Such models are demanding for tasks such as visual question answering, image captioning, and visual grounding. However, some recent work found that VLMs often rely heavily on textual information, ignoring visual information, but are still able to achieve competitive performance in vision-language (VL) tasks. This survey reviews modality collapse analysis work to provide insights into the reason for this unintended behavior. It also reviews probing studies for fine-grained vision-language understanding, presenting current findings on information encoded in VL representations and highlighting potential directions for future research.
pdf
bib
abs
“You are Beautiful, Body Image Stereotypes are Ugly!” BIStereo: A Benchmark to Measure Body Image Stereotypes in Language Models
Narjis Asad
|
Nihar Ranjan Sahoo
|
Rudra Murthy
|
Swaprava Nath
|
Pushpak Bhattacharyya
While a few high-quality bias benchmark datasets exist to address stereotypes in Language Models (LMs), a notable lack of focus remains on body image stereotypes. To bridge this gap, we propose BIStereo, a suite to uncover LMs’ biases towards people of certain physical appearance characteristics, namely, skin complexion, body shape, height, attire, and a miscellaneous category including hair texture, eye color, and more. Our dataset comprises 40k sentence pairs designed to assess LMs’ biased preference for certain body types. We further include 60k premise-hypothesis pairs designed to comprehensively assess LMs’ preference for fair skin tone. Additionally, we curate 553 tuples consisting of a body image descriptor, gender, and a stereotypical attribute, validated by a diverse pool of annotators for physical appearance stereotypes.We propose a metric, TriSentBias, that captures the biased preferences of LMs towards a certain body type over others. Using BIStereo, we assess the presence of body image biases in ten different language models, revealing significant biases in models Muril, XLMR, Llama3, and Gemma. We further evaluate the LMs through downstream NLI and Analogy tasks.Our NLI experiments highlight notable patterns in the LMs that align with the well-documented cognitive bias in humans known as the Halo Effect.
pdf
bib
abs
Retrieval Models Aren’t Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models
Zhengliang Shi
|
Yuhan Wang
|
Lingyong Yan
|
Pengjie Ren
|
Shuaiqiang Wang
|
Dawei Yin
|
Zhaochun Ren
Tool learning aims to augment large language models (LLMs) with diverse tools, enabling them to act as agents for solving practical tasks. Due to the limited context length of tool-using LLMs, adopting information retrieval (IR) models to select useful tools from large toolsets is a critical initial step. However, the performance of IR models in tool retrieval tasks remains underexplored and unclear. Most tool-use benchmarks simplify this step by manually pre-annotating a small set of relevant tools for each task, which is far from the real-world scenarios. In this paper, we propose ToolRet, a heterogeneous tool retrieval benchmark comprising 7.6k diverse retrieval tasks, and a corpus of 43k tools, collected from existing datasets. We benchmark six types of models on ToolRet. Surprisingly, even the models with strong performance in conventional IR benchmarks, exhibit poor performance on ToolRet. This low retrieval quality degrades the task pass rate of tool-use LLMs. As a further step, we contribute a large-scale training dataset with over 200k instances, which substantially optimizes the tool retrieval ability of IR models.
pdf
bib
abs
FineCite: A Novel Approach For Fine-Grained Citation Context Analysis
Lasse M. Jantsch
|
Dong-Jae Koh
|
Seonghwan Yoon
|
Jisu Lee
|
Anne Lauscher
|
Young-Kyoon Suh
Citation context analysis (CCA) is a field of research studying the role and purpose of citation in scientific discourse. While most of the efforts in CCA have been focused on elaborate characterization schemata to assign function or intent labels to individual citations, the citation context as the basis for such a classification has received rather limited attention. This relative neglect, however, has led to the prevalence of vague definitions and restrictive assumptions, limiting the citation context in its expressiveness. It is a common practice, for example, to restrict the context to the citing sentence. While this simple context conceptualization might be sufficient to assign intent or function classes, it fails to cover the rich information of scientific discourse. To address this concern, we analyze the context conceptualizations of previous works and, to our knowledge, construct the first comprehensive context definition based on the semantic properties of the citing text. To evaluate this definition, we construct and publish the FineCite corpus containing 1,056 manually annotated citation contexts. Our experiments on established CCA benchmarks demonstrate the effectiveness of our fine-grained context definition, showing improvements of up to 25% compared to state-of-the-art approaches. We make our code and data publicly available at https://github.com/lab-paper-code/FineCite.
pdf
bib
abs
Decoupling Reasoning and Knowledge Injection for In-Context Knowledge Editing
Changyue Wang
|
Weihang Su
|
Qingyao Ai
|
Yujia Zhou
|
Yiqun Liu
Knowledge editing enables efficient updates to Large Language Models (LLMs) by modifying specific knowledge without full-model retraining. Among knowledge editing approaches, in-context editing (ICE) stands out for its ability to inject knowledge without modifying the model’s parameters. However, existing ICE approaches directly edit model context without isolating target knowledge from the reasoning path of model inference, resulting in unreliable and low-quality outputs, particularly in multi-hop tasks. To investigate this issue, we analyze the interaction between reasoning path planning and knowledge injection, showing that the reasoning ability of a LLM is usually coupled with its original knowledge, and directly replacing old knowledge with new one could simultaneously hurt the LLM’s performance in task reasoning. Based on these findings, we propose DecKER, a novel ICE framework that separates model reasoning from knowledge editing. Extensive experiments show that DecKER significantly improves multi-hop reasoning performance by mitigating knowledge conflicts and preserving reasoning integrity.
pdf
bib
abs
Entrospect: Information-Theoretic Self-Reflection Elicits Better Response Refinement of Small Language Models
Tianqiang Yan
|
Ziqiao Lin
|
Lin Zhang
|
Zhenglong Sun
|
Yuan Gao
Self-reflection helps de-hallucinate Large Language Models (LLMs). However, the effectiveness of self-reflection remains insufficiently validated in the context of Small Language Models (SLMs), which exhibit limited semantic capacities. In particular, we demonstrate that the conventional self-reflection paradigm, such as Self-Refine, fails to deliver robust response refinement for models with parameter sizes of 10 billion or smaller, even when compared to generations elicited through Chain-of-Thought (CoT) prompting. To improve SLMs’ self-reflection, we redesign Self-Refine and introduce Entrospect (ENTROpy-aware IntroSPECTion), an information-theoretic framework based on prompt engineering.We evaluated Entrospect using accuracy and average time consumption metrics to comprehensively assess its precision and computational efficiency. Experiments conducted across four distinct SLMs and four baseline methods demonstrate that Entrospect achieves state-of-the-art performance on validation tasks. Notably, under identical model and data settings, Entrospect delivers a remarkable improvement of up to 36.2 in reasoning accuracy while enhancing computational efficiency by as much as 10 times compared to its predecessor, Self-Refine.
pdf
bib
abs
Iterative Repair with Weak Verifiers for Few-shot Transfer in KBQA with Unanswerability
Riya Sawhney
|
Samrat Yadav
|
Indrajit Bhattacharya
|
Mausam Mausam
Real-world applications of KBQA require models to detect different types of unanswerable questions with a limited volume of in-domain labeled training data. We propose the novel task of few-shot transfer for KBQA with unanswerable questions. The state-of-the-art KBQA few-shot transfer model (FuSIC-KBQA) uses an iterative repair strategy that assumes that all questions are answerable. As a remedy, we present FUn-FuSIC – a novel solution for our task that extends FuSIC-KBQA with Feedback for Unanswerability (FUn), which is an iterative repair strategy for answerable as well as unanswerable questions. FUn uses feedback from a suite of strong and weak verifiers, and an adaptation of self-consistency for unanswerability for assessing answerability of questions. Our experiments show that FUn-FuSIC significantly outperforms suitable adaptations of multiple LLM-based and supervised SoTA models on our task, while establishing a new SoTA performance for answerable few-shot transfer as well. We have made datasets and other resources publicly available at https://github.com/dair-iitd/funfusic/.
pdf
bib
abs
Safeguarding RAG Pipelines with GMTP: A Gradient-based Masked Token Probability Method for Poisoned Document Detection
San Kim
|
Jonghwi Kim
|
Yejin Jeon
|
Gary Lee
Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by providing external knowledge for accurate and up-to-date responses. However, this reliance on external sources exposes a security risk; attackers can inject poisoned documents into the knowledge base to steer the generation process toward harmful or misleading outputs. In this paper, we propose Gradient-based Masked Token Probability (GMTP), a novel defense method to detect and filter out adversarially crafted documents. Specifically, GMTP identifies high-impact tokens by examining gradients of the retriever’s similarity function. These key tokens are then masked, and their probabilities are checked via a Masked Language Model (MLM). Since injected tokens typically exhibit markedly low masked-token probabilities, this enables GMTP to easily detect malicious documents and achieve high-precision filtering. Experiments demonstrate that GMTP is able to eliminate over 90% of poisoned content while retaining relevant documents, thus maintaining robust retrieval and generation performance across diverse datasets and adversarial settings.
pdf
bib
abs
EnSToM: Enhancing Dialogue Systems with Entropy-Scaled Steering Vectors for Topic Maintenance
Heejae Suh
|
Yejin Jeon
|
Deokhyung Kang
|
Taehee Park
|
Yejin Min
|
Gary Lee
Small large language models (sLLMs) offer the advantage of being lightweight and efficient, which makes them suitable for resource-constrained environments. However, sLLMs often struggle to maintain topic consistency in task-oriented dialogue systems, which is critical for scenarios such as service chatbots. Specifically, it is important to ensure that the model denies off-topic or malicious inputs and adheres to its intended functionality so as to prevent potential misuse and uphold reliability. Towards this, existing activation engineering approaches have been proposed to manipulate internal activations during inference. While these methods are effective in certain scenarios, our preliminary experiments reveal their limitations in ensuring topic adherence. Therefore, to address this, we propose a novel approach termed Entropy-scaled Steering vectors for Topic Maintenance (EnSToM). EnSToM dynamically adjusts the steering intensity based on input uncertainty, which allows the model to handle off-topic distractors effectively while preserving on-topic accuracy. Our experiments demonstrate that EnSToM achieves significant performance gain with a relatively small data size compared to fine-tuning approaches. By improving topic adherence without compromising efficiency, our approach provides a robust solution for enhancing sLLM-based dialogue systems.
pdf
bib
abs
MultiTEND: A Multilingual Benchmark for Natural Language to NoSQL Query Translation
Zhiqian Qin
|
Yuanfeng Song
|
Jinwei Lu
|
Yuanwei Song
|
Shuaimin Li
|
Chen Jason Zhang
Natural language interfaces for NoSQL databases are increasingly vital in the big data era, enabling users to interact with complex, unstructured data without deep technical expertise. However, most recent advancements focus on English, leaving a gap for multilingual support. This paper introduces MultiTEND, the first and largest multilingual benchmark for natural language to NoSQL query generation, covering six languages: English, German, French, Russian, Japanese and Mandarin Chinese.Using MultiTEND, we analyze challenges in translating natural language to NoSQL queries across diverse linguistic structures, including lexical and syntactic differences. Experiments show that performance accuracy in both English and non-English settings remains relatively low, with a 4%-6% gap across scenarios like fine-tuned SLM, zero-shot LLM, and RAG for LLM.To address the aforementioned challenges, we introduce MultiLink, a novel framework that bridges the multilingual input to NoSQL query generation gap through a Parallel Linking Process. It breaks down the task into multiple steps, integrating parallel multilingual processing, Chain-of-Thought (CoT) reasoning, and Retrieval-Augmented Generation (RAG) to tackle lexical and structural challenges inherent in multilingual NoSQL generation. MultiLink shows enhancements in all metrics for every language against the top baseline, boosting execution accuracy by about 15% for English and averaging a 10% improvement for non-English languages.
pdf
bib
abs
Tool learning via Inference-time Scaling and Cycle Verifier
Xiaobo Liang
|
Wenjin Xie
|
Juntao Li
|
Wanfu Wang
|
Yibin Chen
|
Kehai Chen
|
Min Zhang
In inference-time scaling, Chain-of-Thought (CoT) plays a crucial role in enabling large language models (LLMs) to exhibit reasoning capabilities. However, in many scenarios, high-quality CoT data is scarce or even unavailable. In such cases, STaR-like methods can help LLMs synthesize CoT based on user queries and response, but they inevitably suffer from the risk of compounding errors. In this work, we tackle an even more challenging scenario: tool learning in the absence of user queries. We design a data scaling method using back-translation, which establishes an inference cycle to synthesize both user queries and CoT data. To reudce the compounding error of inference time, we introduce two rule-based verifiers to assess the validity of the synthesized CoT data. In particular, the Cycle Verifier facilitates performance improvement by continuously accumulating new data over multiple iterations. Our approach achieves a 75.4% pass rate and a 79.6% win rate using small models (7B) in StableToolBench. Notably, these results are obtained exclusively from self-synthesized high-quality data, without relying on external supervision or expert trajectories for warm-up.
pdf
bib
abs
When Benchmarks Talk: Re-Evaluating Code LLMs with Interactive Feedback
Jane Pan
|
Ryan Shar
|
Jacob Pfau
|
Ameet Talwalkar
|
He He
|
Valerie Chen
Programming with a coding assistant is a fundamentally interactive process, yet existing static benchmarks fail to capture key features of model-user collaboration. We introduce an interactive evaluation pipeline to examine how LLMs incorporate different types of feedback in a collaborative setting, in which we obfuscate the input of static coding benchmarks so that the code model must interact with a simulated user. Across 10 models and 3 datasets, the relative rankings of models often permute greatly between static and interactive settings, despite models being fairly robust to feedback that contains errors. We also observe that similarly effective feedback types differ in terms of how models respond to higher- vs. lower-quality feedback. Moreover, feedback type impacts the degree to which the models make aesthetic or behavioral edits to their output. Our work aims to “re-evaluate” model coding capabilities through an interactive lens toward bridging the gap between existing evaluations and real-world usage.
pdf
bib
abs
Reranking-based Generation for Unbiased Perspective Summarization
Narutatsu Ri
|
Nicholas Deas
|
Kathleen McKeown
Generating unbiased summaries in real-world settings such as political perspective summarization remains a crucial application of Large Language Models (LLMs). Yet, existing evaluation frameworks rely on traditional metrics for measuring key attributes such as coverage and faithfulness without verifying their applicability, and efforts to develop improved summarizers are still nascent. We address these gaps by (1) identifying reliable metrics for measuring perspective summary quality, and (2) investigating the efficacy of LLM-based methods beyond zero-shot inference. Namely, we build a test set for benchmarking metric reliability using human annotations and show that traditional metrics underperform compared to language model–based metrics, which prove to be strong evaluators. Using these metrics, we show that reranking-based methods yield strong results, and preference tuning with synthetically generated and reranking-labeled data further boosts performance. Our findings aim to contribute to the reliable evaluation and development of perspective summarization methods.
pdf
bib
abs
KARPA: A Training-free Method of Adapting Knowledge Graph as References for Large Language Model’s Reasoning Path Aggregation
Siyuan Fang
|
Kaijing Ma
|
Tianyu Zheng
|
Xeron Du
|
Ningxuan Lu
|
Ge Zhang
|
Qingkun Tang
Large language models (LLMs) demonstrate exceptional performance across a variety of tasks, yet they are often affected by hallucinations and the timeliness of knowledge. Leveraging knowledge graphs (KGs) as external knowledge sources has emerged as a viable solution, but existing methods for LLM-based knowledge graph question answering (KGQA) are often limited by step-by-step decision-making on KGs, restricting the global planning and reasoning capabilities of LLMs, or they require fine-tuning or pre-training on specific KGs. To address these challenges, we propose Knowledge graph Assisted Reasoning Path Aggregation (KARPA), a novel framework that harnesses the global planning abilities of LLMs for efficient and accurate KG reasoning. KARPA operates in three steps: pre-planning relation paths using the LLM’s global planning capabilities, matching semantically relevant paths via an embedding model, and reasoning over these paths to generate answers. Unlike existing KGQA methods, KARPA avoids stepwise traversal, requires no additional training, and is adaptable to various LLM architectures. Extensive experimental results show that KARPA achieves state-of-the-art performance in KGQA tasks, delivering both high efficiency and accuracy.
pdf
bib
abs
Enhancing LLM-based Hatred and Toxicity Detection with Meta-Toxic Knowledge Graph
Yibo Zhao
|
Jiapeng Zhu
|
Can Xu
|
Yao Liu
|
Xiang Li
The rapid growth of social media platforms has raised significant concerns regarding online content toxicity. When Large Language Models (LLMs) are used for toxicity detection, two key challenges emerge: 1) the absence of domain-specific toxicity knowledge leads to false negatives; 2) the excessive sensitivity of LLMs to toxic speech results in false positives, limiting freedom of speech. To address these issues, we propose a novel method called *MetaTox*, leveraging graph search on a meta-toxic knowledge graph to enhance hatred and toxicity detection. First, we construct a comprehensive meta-toxic knowledge graph by utilizing LLMs to extract toxic information through a three step pipeline. Second, we query the graph via retrieval and ranking processes to supplement accurate, relevant toxicity knowledge. Extensive experiments and case studies across multiple datasets demonstrate that our MetaTox boosts overall toxicity detection performance, particularly in out-of-domain settings. In addition, under in-domain scenarios, we surprisingly find that small language models are more competent. Our code is available at https://github.com/YiboZhao624/MetaTox.
pdf
bib
abs
Mixture-of-Personas Language Models for Population Simulation
Ngoc Bui
|
Hieu Trung Nguyen
|
Shantanu Kumar
|
Julian Theodore
|
Weikang Qiu
|
Viet Anh Nguyen
|
Rex Ying
Advances in Large Language Models (LLMs) paved the way for their emerging applications in various domains, such as human behavior simulations, where LLMs could augment human-generated data in social science research and machine learning model training. However, pretrained LLMs often fail to capture the behavioral diversity of target populations due to the inherent variability across individuals and groups. To address this, we propose Mixture of Personas (MoP), a probabilistic prompting method that aligns LLM responses with the target population. MoP is a contextual mixture model, where each component is an LM agent characterized by a persona and an exemplar that represents the behaviors of subpopulation. The persona and the exemplar are randomly chosen according to the learned mixing weights to elicit diverse LLM responses during simulation. MoP is flexible, does not require model fine-tuning, and is transferable between base models. Experiments for synthetic data generation show that MoP outperforms competing methods in alignment and diversity metrics.
pdf
bib
abs
ClusComp: A Simple Paradigm for Model Compression and Efficient Finetuning
Baohao Liao
|
Christian Herold
|
Seyyed Hadi Hashemi
|
Stefan Vasilev
|
Shahram Khadivi
|
Christof Monz
As large language models (LLMs) scale, model compression is crucial for edge deployment and accessibility. Weight-only quantization reduces model size but suffers from performance degradation at lower bit widths. Moreover, standard finetuning is incompatible with quantized models, and alternative methods often fall short of full finetuning. In this paper, we propose ClusComp, a simple yet effective compression paradigm that clusters weight matrices into codebooks and finetunes them block-by-block. ClusComp (1) achieves superior performance in 2-4 bit quantization, (2) pushes compression to 1-bit while outperforming ultra-low-bit methods with minimal finetuning, and (3) enables efficient finetuning, even surpassing existing quantization-based approaches and rivaling full FP16 finetuning. Notably, ClusComp supports compression and finetuning of 70B LLMs on a single A6000-48GB GPU.
pdf
bib
abs
Decomposed Opinion Summarization with Verified Aspect-Aware Modules
Miao Li
|
Jey Han Lau
|
Eduard Hovy
|
Mirella Lapata
Opinion summarization plays a key role in deriving meaningful insights from large-scale online reviews. To make the process more explainable and grounded, we propose a domain-agnostic modular approach guided by review aspects (e.g., cleanliness for hotel reviews) which separates the tasks of aspect identification, opinion consolidation, and meta-review synthesis to enable greater transparency and ease of inspection. We conduct extensive experiments across datasets representing scientific research, business, and product domains. Results show that our approach generates more grounded summaries compared to strong baseline models, as verified through automated and human evaluations. Additionally, our modular approach, which incorporates reasoning based on review aspects, produces more informative intermediate outputs than other knowledge-agnostic decomposition approaches. Lastly, we provide empirical results to show that these intermediate outputs can support humans in summarizing opinions from large volumes of reviews.
pdf
bib
abs
Token-Budget-Aware LLM Reasoning
Tingxu Han
|
Zhenting Wang
|
Chunrong Fang
|
Shiyu Zhao
|
Shiqing Ma
|
Zhenyu Chen
Reasoning is critical for large language models (LLMs) to excel in a wide range of tasks. While methods like Chain-of-Thought (CoT) reasoning and enhance LLM performance by decomposing problems into intermediate steps, they also incur significant overhead in token usage, leading to increased costs. We find that the reasoning process of current LLMs is unnecessarily lengthy and it can be compressed by including a reasonable token budget in the prompt, but the choice of token budget plays a crucial role in the actual compression effectiveness. We then propose a token-budget-aware LLM reasoning framework that dynamically adjusts the number of reasoning tokens based on the reasoning complexity of each problem. Experiments show that our method effectively reduces token costs in CoT reasoning with only a slight performance reduction, offering a practical solution to balance efficiency and accuracy in LLM reasoning. Code: https://github.com/GeniusHTX/TALE.
pdf
bib
abs
HATA: Trainable and Hardware-Efficient Hash-Aware Top-k Attention for Scalable Large Model Inference
Ping Gong
|
Jiawei Yi
|
Shengnan Wang
|
Juncheng Zhang
|
Zewen Jin
|
Ouxiang Zhou
|
Ruibo Liu
|
Guanbin Xu
|
Youhui Bai
|
Bowen Ye
|
Kun Yuan
|
Tong Yang
|
Gong Zhang
|
Renhai Chen
|
Feng Wu
|
Cheng Li
Large Language Models (LLMs) have emerged as a pivotal research area, yet the attention module remains a critical bottleneck in LLM inference, even with techniques like KVCache to mitigate redundant computations. While various top-k attention mechanisms have been proposed to accelerate LLM inference by exploiting the inherent sparsity of attention, they often struggled to strike a balance between efficiency and accuracy. In this paper, we introduce HATA (Hash-Aware Top-k Attention), a novel approach that systematically integrates low-overhead learning-to-hash techniques into the Top-k attention process. Different from the existing top-k attention methods which are devoted to seeking an absolute estimation of qk score, typically with a great cost, HATA maps queries and keys into binary hash codes, and acquires the relative qk score order with a quite low cost, which is sufficient for realizing top-k attention. Extensive experiments demonstrate that HATA achieves up to 7.2× speedup compared to vanilla full attention while maintaining model accuracy. In addition, HATA outperforms the state-of-the-art top-k attention methods in both accuracy and efficiency across multiple mainstream LLM models and diverse tasks. HATA is open source at https://github.com/gpzlx1/HATA.
pdf
bib
abs
Answer When Needed, Forget When Not: Language Models Pretend to Forget via In-Context Knowledge Unlearning
Shota Takashiro
|
Takeshi Kojima
|
Andrew Gambardella
|
Qi Cao
|
Yusuke Iwasawa
|
Yutaka Matsuo
As large language models (LLMs) are applied across diverse domains, the ability to selectively unlearn specific information is becoming increasingly essential. For instance, LLMs are expected to selectively provide confidential information to authorized internal users, such as employees or trusted partners, while withholding it from external users, including the general public and unauthorized entities.Therefore, we propose a novel method termed ìn-context knowledge unlearning”, which enables the model to selectively forget information in test-time based on the query context.Our method fine-tunes pre-trained LLMs to enable prompt unlearning of target knowledge within the context, while preserving unrelated information. Experiments on TOFU, AGE and RWKU datasets using Llama2-7B/13B and Mistral-7B models demonstrate that our method achieves up to 95% forget accuracy while retaining 80% of unrelated knowledge, significantly outperforming baselines in both in-domain and out-of-domain scenarios.Further investigation of the model’s internal behavior revealed that while fine-tuned LLMs generate correct predictions in the middle layers and preserve them up to the final layer. However, the decision to forget is made only at the last layer, i.e. LLMs pretend to forget”.Our findings offer valuable insight into the improvement of the robustness of the unlearning mechanisms in LLMs, laying a foundation for future research in the field.
pdf
bib
abs
LIST: Linearly Incremental SQL Translator for Single-Hop Reasoning, Generation and Verification
Kaiyuan Guan
|
Ruoxin Li
|
Xudong Guo
|
Zhenning Huang
|
Xudong Weng
|
Hehuan Liu
|
Zheng Wei
|
Zang Li
SQL languages often feature nested structures that require robust interaction with databases. Aside from the well-validated schema linking methods on PLMs and LLMs, we introduce the Linearly Incremental SQL Translator (LIST), a novel algorithmic toolkit designed to leverage the notable reasoning and tool interaction capabilities inherent in LLMs. LIST transforms complex SQL queries into grammatically verifiable sub-queries which are arranged sequentially to reflect single-hop reasoning steps, enhancing both the granularity and accuracy of database interactions. With in-context learning, our experiments demonstrated significant improvements, achieving notable performance of 60.56% and 56.32% on the BIRD dataset with GPT-4o and Llama-3-70B-Instruct. To the best of our knowledge, this achieves SOTA performance among non-schema linking methods, also surpassing a series of schema linking based approaches at a comparable or better cost.
pdf
bib
abs
MAGI: Multi-Agent Guided Interview for Psychiatric Assessment
Guanqun Bi
|
Zhuang Chen
|
Zhoufu Liu
|
Hongkai Wang
|
Xiyao Xiao
|
Yuqiang Xie
|
Wen Zhang
|
Yongkang Huang
|
Yuxuan Chen
|
Libiao Peng
|
Minlie Huang
Automating structured clinical interviews could revolutionize mental healthcare accessibility, yet existing large language models (LLMs) approaches fail to align with psychiatric diagnostic protocols. We present MAGI, the first framework that transforms the gold-standard Mini International Neuropsychiatric Interview (MINI) into automatic computational workflows through coordinated multi-agent collaboration. MAGI dynamically navigates clinical logic via four specialized agents: 1) an interview tree guided navigation agent adhering to the MINI’s branching structure, 2) an adaptive question agent blending diagnostic probing, explaining, and empathy, 3) a judgment agent validating whether the response from participants meet the node, and 4) a diagnosis Agent generating Psychometric Chain-of- Thought (PsyCoT) traces that explicitly map symptoms to clinical criteria. Experimental results on 1,002 real-world participants covering depression, generalized anxiety, social anxiety and suicide shows that MAGI advances LLM- assisted mental health assessment by combining clinical rigor, conversational adaptability, and explainable reasoning.
pdf
bib
abs
TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking
Shahriar Kabir Nahin
|
Rabindra Nath Nandi
|
Sagor Sarker
|
Quazi Sarwar Muhtaseem
|
Md Kowsher
|
Apu Chandraw Shill
|
Md Ibrahim
|
Mehadi Hasan Menon
|
Tareq Al Muntasir
|
Firoj Alam
In this paper, we present TituLLMs, the first large pretrained Bangla LLMs, available in 1b and 3b parameter sizes. Due to computational constraints during both training and inference, we focused on smaller models. To train TituLLMs, we collected a pretraining dataset of approximately ∼ 37 billion tokens. We extended the Llama-3.2 tokenizer to incorporate language- and culture-specific knowledge, which also enables faster training and inference. There was a lack of benchmarking datasets to benchmark LLMs for Bangla. To address this gap, we developed five benchmarking datasets. We benchmarked various LLMs, including TituLLMs, and demonstrated that TituLLMs outperforms its initial multilingual versions. However, this is not always the case, highlighting the complexities of language adaptation. Our work lays the groundwork for adapting existing multilingual open models to other low-resource languages. To facilitate broader adoption and further research, we have made the TituLLMs models and benchmarking datasets publicly available.
pdf
bib
abs
WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts
Negar Foroutan
|
Angelika Romanou
|
Matin Ansaripour
|
Julian Martin Eisenschlos
|
Karl Aberer
|
Rémi Lebret
Documents are fundamental to preserving and disseminating information, often incorporating complex layouts, tables, and charts that pose significant challenges for automatic document understanding (DU). While vision-language large models (VLLMs) have demonstrated improvements across various tasks, their effectiveness in processing long-context vision inputs remains unclear. This paper introduces WikiMixQA, a benchmark comprising 1,000 multiple-choice questions (MCQs) designed to evaluate cross-modal reasoning over tables and charts extracted from 4,000 Wikipedia pages spanning seven distinct topics. Unlike existing benchmarks, WikiMixQA emphasizes complex reasoning by requiring models to synthesize information from multiple modalities. We evaluate 12 state-of-the-art vision-language models, revealing that while proprietary models achieve ~70% accuracy when provided with direct context, their performance deteriorates significantly when retrieval from long documents is required. Among these, GPT-4-o is the only model exceeding 50% accuracy in this setting, whereas open-source models perform considerably worse, with a maximum accuracy of 27%. These findings underscore the challenges of long-context, multi-modal reasoning and establish WikiMixQA as a crucial benchmark for advancing document understanding research.
pdf
bib
abs
Let’s Fuse Step by Step: A Generative Fusion Decoding Algorithm with LLMs for Robust and Instruction-Aware ASR and OCR
Chan-Jan Hsu
|
Yi-Chang Chen
|
Feng-Ting Liao
|
Pei-Chen Ho
|
Yu-Hsiang Wang
|
Po-Chun Hsu
|
Da-shan Shiu
We introduce “Generative Fusion Decoding” (GFD), a novel shallow fusion framework, utilized to integrate large language models(LLMs) into cross-modal text recognition systems inlculding automatic speech recognition (ASR) and optical character recognition (OCR). We derive the formulas necessary to enable GFD to operate across mismatched token spaces of different models by calculating likelihood at the byte level, thereby enabling seamless fusion and synchronous progression during the decoding process. GFD is plug-and-play bydesign, making it readily compatible with various auto-regressive models without the need for any re-training. GFD proves effective for general ASR and OCR tasks through intermediate and frequent interactions with LLMs, surpassing cascaded methods in English and Mandarin benchmarks. In addition, GFD transfers in-context learning abilities of LLMs and allows for adaptive ASR in instruction-aware andlong-context settings, yielding significant WER reductions of up to 17.7%.
pdf
bib
abs
HPSS: Heuristic Prompting Strategy Search for LLM Evaluators
Bosi Wen
|
Pei Ke
|
Yufei Sun
|
Cunxiang Wang
|
Xiaotao Gu
|
Jinfeng Zhou
|
Jie Tang
|
Hongning Wang
|
Minlie Huang
Since the adoption of large language models (LLMs) for text evaluation has become increasingly prevalent in the field of natural language processing (NLP), a series of existing works attempt to optimize the prompts for LLM evaluators to improve their alignment with human judgment. However, their efforts are limited to optimizing individual factors of evaluation prompts, such as evaluation criteria or output formats, neglecting the combinatorial impact of multiple factors, which leads to insufficient optimization of the evaluation pipeline. Nevertheless, identifying well-behaved prompting strategies for adjusting multiple factors requires extensive enumeration. To this end, we comprehensively integrate 8 key factors for evaluation prompts and propose a novel automatic prompting strategy optimization method called Heuristic Prompting Strategy Search (HPSS). Inspired by the genetic algorithm, HPSS conducts an iterative search to find well-behaved prompting strategies for LLM evaluators. A heuristic function is employed to guide the search process, enhancing the performance of our algorithm. Extensive experiments across four evaluation tasks demonstrate the effectiveness of HPSS, consistently outperforming both human-designed evaluation prompts and existing automatic prompt optimization methods. Our code is available athttps://github.com/thu-coai/HPSS.
pdf
bib
abs
A Fully Generative Motivational Interviewing Counsellor Chatbot for Moving Smokers Towards the Decision to Quit
Zafarullah Mahmood
|
Soliman Ali
|
Jiading Zhu
|
Mohamed Abdelwahab
|
Michelle Yu Collins
|
Sihan Chen
|
Yi Cheng Zhao
|
Jodi Wolff
|
Osnat C. Melamed
|
Nadia Minian
|
Marta Maslej
|
Carolynne Cooper
|
Matt Ratto
|
Peter Selby
|
Jonathan Rose
The conversational capabilities of Large Language Models (LLMs) suggest that they may be able to perform as automated talk therapists. It is crucial to know if these systems would be effective and adhere to known standards. We present a counsellor chatbot that focuses on motivating tobacco smokers to quit smoking. It uses a state-of-the-art LLM and a widely applied therapeutic approach called Motivational Interviewing (MI), and was evolved in collaboration with clinician-scientists with expertise in MI. We also describe and validate an automated assessment of both the chatbot’s adherence to MI and client responses. The chatbot was tested on 106 participants, and their confidence that they could succeed in quitting smoking was measured before the conversation and one week later. Participants’ confidence increased by an average of 1.7 on a 0-10 scale. The automated assessment of the chatbot showed adherence to MI standards in 98% of utterances, higher than human counsellors. The chatbot scored well on a participant-reported metric of perceived empathy but lower than typical human counsellors. Furthermore, participants’ language indicated a good level of motivation to change, a key goal in MI. These results suggest that the automation of talk therapy with a modern LLM has promise.
pdf
bib
abs
LegalCore: A Dataset for Event Coreference Resolution in Legal Documents
Kangda Wei
|
Xi Shi
|
Jonathan Tong
|
Sai Ramana Reddy
|
Anandhavelu Natarajan
|
Rajiv Jain
|
Aparna Garimella
|
Ruihong Huang
Recognizing events and their coreferential mentions in a document is essential for understanding semantic meanings of text. The existing research on event coreference resolution is mostly limited to news articles. In this paper, we present the first dataset for the legal domain, LegalCore, which has been annotated with comprehensive event and event coreference information. The legal contract documents we annotated in this dataset are several times longer than news articles, with an average length of around 25k tokens per document. The annotations show that legal documents have dense event mentions and feature both short-distance and super long-distance coreference links between event mentions. We further benchmark mainstream Large Language Models (LLMs) on this dataset for both event detection and event coreference resolution tasks, and find that this dataset poses significant challenges for state-of-the-art open-source and proprietary LLMs, which perform significantly worse than a supervised baseline. We will publish the dataset as well as the code.
pdf
bib
abs
Rectifying Belief Space via Unlearning to Harness LLMs’ Reasoning
Ayana Niwa
|
Masahiro Kaneko
|
Kentaro Inui
Large Language Models (LLMs) exhibit sophisticated reasoning yet still generate incorrect answers. We attribute these errors to **Spurious Beliefs**, defined as propositions the model internally considers as true despite being factually false. To reduce reasoning errors, we propose a belief space rectification framework. Our method first identifies the beliefs invoked during inference via an explanation‐based approach with Forward‐Backward Beam Search (FBBS). We subsequently apply unlearning via gradient ascent to suppress spurious beliefs and enhance true ones, thereby effectively rectifying the model’s belief space. Experiments on three QA datasets and three LLMs show that our method significantly reduces erroneous reasoning and improves generalization.
pdf
bib
abs
MemeDetoxNet: Balancing Toxicity Reduction and Context Preservation
Gitanjali Kumari
|
Jitendra Solanki
|
Asif Ekbal
Toxic memes often spread harmful and offensive content and pose a significant challenge in online environments. In this paper, we present MemeDetoxNet, a robust framework designed to mitigate toxicity in memes by leveraging fine-tuned pre-trained models. Our approach utilizes the interpretability of CLIP (Contrastive Language-Image Pre-Training) to identify toxic elements within the visual and textual components of memes. Our objective is to automatically assess the immorality of toxic memes and transform them into morally acceptable alternatives by employing large language models (LLMs) to replace offensive text and blurring toxic regions in the image. As a result, we proposed MemeDetoxNet that has three main primitives: (1) detection of toxic memes, (2) localizing and highlighting toxic visual and textual attributes, and (3) manipulating the toxic content to create a morally acceptable alternative. Empirical evaluation on several publicly available meme datasets shows a reduction in toxicity by approximately 10-20%. Both qualitative and quantitative analyses further demonstrate MemeDetoxNet’s superior performance in detoxifying memes compared to the other methods. These results underscore MemeDetoxNet’s potential as an effective tool for content moderation on online platforms.
pdf
bib
abs
Should I Trust You? Detecting Deception in Negotiations using Counterfactual RL
Wichayaporn Wongkamjan
|
Yanze Wang
|
Feng Gu
|
Denis Peskoff
|
Jonathan K. Kummerfeld
|
Jonathan May
|
Jordan Lee Boyd-Graber
An increasingly common socio-technical problem is people being taken in by offers that sound “too good to be true”, where persuasion and trust shape decision-making. This paper investigates how AI can help detect these deceptive scenarios. We analyze how humans strategically deceive each other in Diplomacy, a board game that requires both natural language communication and strategic reasoning. This requires extracting logical forms representing proposals—agreements that players suggest during communication—and computing their relative rewards using agents’ value functions. Combined with text-based features, this can improve our deception detection. Our method detects human deception with a high precision when compared to a Large Language Model approach that flags many true messages as deceptive. Future human-AI interaction tools can build on our methods for deception detection by triggering friction to give users a chance of interrogating suspicious proposals.
pdf
bib
abs
Multi-matrix Factorization Attention
Jingcheng Hu
|
Houyi Li
|
Yinmin Zhang
|
Zili Wang
|
Shuigeng Zhou
|
Xiangyu Zhang
|
Heung-Yeung Shum
We propose novel attention architectures, Multi-matrix Factorization Attention (MFA) and MFA-Key-Reuse (MFA-KR). Existing variants for standard Multi-Head Attention (MHA), including SOTA methods like MLA, fail to maintain as strong performance under stringent Key-Value cache (KV cache) constraints. MFA enhances model capacity by efficiently scaling up both the number and dimension of attention heads through low-rank matrix factorization in the Query-Key (QK) circuit. Extending MFA, MFA-KR further reduces memory requirements by repurposing the key cache as value through value projection re-parameterization. MFA’s design enables strong model capacity when working under tight KV cache budget, while MFA-KR is suitable for even harsher KV cache limits with minor performance trade-off. Notably, in our extensive and large-scale experiments, the proposed architecture outperforms MLA and performs comparably to MHA, while reducing KV cache usage by up to 56% and 93.7%, respectively.
pdf
bib
abs
Self-Training Elicits Concise Reasoning in Large Language Models
Tergel Munkhbat
|
Namgyu Ho
|
Seo Hyun Kim
|
Yongjin Yang
|
Yujin Kim
|
Se-Young Yun
Chain-of-thought (CoT) reasoning has enabled large language models (LLMs) to utilize additional computation through intermediate tokens to solve complex tasks. However, we posit that typical reasoning traces contain many redundant tokens, incurring extraneous inference costs. Upon examination of the output distribution of current LLMs, we find evidence on their latent ability to reason more concisely, relative to their default behavior. To elicit this capability, we propose simple fine-tuning methods which leverage self-generated concise reasoning paths obtained by best-of-N sampling and few-shot conditioning, in task-specific settings. Our combined method achieves a 30% reduction in output tokens on average, across five model families on GSM8K and MATH, while maintaining average accuracy. By exploiting the fundamental stochasticity and in-context learning capabilities of LLMs, our self-training approach robustly elicits concise reasoning on a wide range of models, including those with extensive post-training.
pdf
bib
abs
Reason from Future: Reverse Thought Chain Enhances LLM Reasoning
Yinlong Xu
|
Yanzhao Zheng
|
Shuoshuo Sun
|
Shuaihan Huang
|
Baohua Dong
|
Zhu Hangcheng
|
Ruohui Huang
|
Gang Yu
|
Hongxia Xu
|
Jian Wu
It has been demonstrated that carefully designed reasoning paradigms, like Chain-of-Thought(CoT) and Tree-of-Thought(ToT), can enhance the reasoning capabilities of small language models by detailed thinking and extensive thought searching, unbounded branching factors in the searching space create prohibitive reasoning consumption. However these methods fell into the trap of local optimum reasoning, which means the model lacks a global perspective while solving problems. We propose a novel reasoning paradigm called Reason from Future(RFF), which generates reasoning paths by bidirectional reasoning that combines top-down planning with bottom-up reasoning accumulation. The essence of RFF lies in its reverse reasoning mechanism, which prioritizes core logical relationships and imposes goal-oriented constraints on intermediate steps, thereby reducing the searching space and mitigating error accumulation inherent in sequential forward reasoning. Empirical evaluations across diverse experiments demonstrate that RFF outperforms conventional paradigms with higher accuracy and less searching space to solve complex tasks.
pdf
bib
abs
LLMs as Planning Formalizers: A Survey for Leveraging Large Language Models to Construct Automated Planning Models
Marcus Tantakoun
|
Christian Muise
|
Xiaodan Zhu
Large Language Models (LLMs) excel in various natural language tasks but often struggle with long-horizon planning problems requiring structured reasoning. This limitation has drawn interest in integrating neuro-symbolic approaches within the Automated Planning (AP) and Natural Language Processing (NLP) communities. However, identifying optimal AP deployment frameworks can be daunting and introduces new challenges. This paper aims to provide a timely survey of the current research with an in-depth analysis, positioning LLMs as tools for formalizing and refining planning specifications to support reliable off-the-shelf AP planners. By systematically reviewing the current state of research, we highlight methodologies, and identify critical challenges and future directions, hoping to contribute to the joint research on NLP and Automated Planning.
pdf
bib
abs
From Conversation to Automation: Leveraging LLMs for Problem-Solving Therapy Analysis
Elham Aghakhani
|
Lu Wang
|
Karla T. Washington
|
George Demiris
|
Jina Huh-Yoo
|
Rezvaneh Rezapour
Problem-Solving Therapy (PST) is a structured psychological approach that helps individuals manage stress and resolve personal issues by guiding them through problem identification, solution brainstorming, decision-making, and outcome evaluation. As mental health care increasingly adopts technologies like chatbots and large language models (LLMs), it is important to thoroughly understand how each session of PST is conducted before attempting to automate it. We developed a comprehensive framework for PST annotation using established PST Core Strategies and a set of novel Facilitative Strategies to analyze a corpus of real-world therapy transcripts to determine which strategies are most prevalent. Using various LLMs and transformer-based models, we found that GPT-4o outperformed all models, achieving the highest accuracy (0.76) in identifying all strategies. To gain deeper insights, we examined how strategies are applied by analyzing Therapeutic Dynamics (autonomy, self-disclosure, and metaphor), and linguistic patterns within our labeled data. Our research highlights LLMs’ potential to automate therapy dialogue analysis, offering a scalable tool for mental health interventions. Our framework enhances PST by improving accessibility, effectiveness, and personalized support for therapists.
pdf
bib
abs
Revisiting Self-Consistency from Dynamic Distributional Alignment Perspective on Answer Aggregation
Yiwei Li
|
Ji Zhang
|
Shaoxiong Feng
|
Peiwen Yuan
|
Xinglin Wang
|
Jiayi Shi
|
Yueqi Zhang
|
Chuyi Tan
|
Boyuan Pan
|
Yao Hu
|
Kan Li
Self-consistency improves reasoning by aggregating diverse stochastic samples, yet the dynamics behind its efficacy remain underexplored. We reframe self-consistency as a dynamic distributional alignment problem, revealing that decoding temperature not only governs sampling randomness but also actively shapes the latent answer distribution. Given that high temperatures require prohibitively large sample sizes to stabilize, while low temperatures risk amplifying biases, we propose a confidence-driven mechanism that dynamically calibrates temperature: sharpening the sampling distribution under uncertainty to align with high-probability modes, and promoting exploration when confidence is high. Experiments on mathematical reasoning tasks show this approach outperforms fixed-diversity baselines under limited samples, improving both average and best-case performance across varying initial temperatures without additional data or modules. This establishes self-consistency as a synchronization challenge between sampling dynamics and evolving answer distributions.
pdf
bib
abs
Don’t Say No: Jailbreaking LLM by Suppressing Refusal
Yukai Zhou
|
Jian Lou
|
Zhijie Huang
|
Zhan Qin
|
Sibei Yang
|
Wenjie Wang
Ensuring the safety alignment of Large Language Models (LLMs) is critical for generating responses consistent with human values. However, LLMs remain vulnerable to jailbreaking attacks, where carefully crafted prompts manipulate them into producing toxic content. One category of such attacks reformulates the task as an optimization problem, aiming to elicit affirmative responses from the LLM. However, these methods heavily rely on predefined objectionable behaviors, limiting their effectiveness and adaptability to diverse harmful queries. In this study, we first identify why the vanilla target loss is suboptimal and then propose enhancements to the loss objective. We introduce DSN (Don’t Say No) attack, which combines a cosine decay schedule method with refusal suppression to achieve higher success rates. Extensive experiments demonstrate that DSN outperforms baseline attacks and achieves state-of-the-art attack success rates (ASR). DSN also shows strong universality and transferability to unseen datasets and black-box models.
pdf
bib
abs
From Perception to Reasoning: Enhancing Vision-Language Models for Mobile UI Understanding
Settaluri Lakshmi Sravanthi
|
Ankit Mishra
|
Debjyoti Mondal
|
Subhadarshi Panda
|
Rituraj Singh
|
Pushpak Bhattacharyya
Accurately grounding visual and textual elements within mobile user interfaces (UIs) remains a significant challenge for Vision-Language Models (VLMs). Visual grounding, a critical task in this domain, involves identifying the most relevant UI element or region based on a natural language query—a process that requires both precise perception and context-aware reasoning. In this work, we present - **MoUI**, a light-weight mobile UI understanding model trained on **MoIT**, an instruction-tuning dataset specifically tailored for mobile screen understanding and grounding, designed to bridge the gap between user intent and visual semantics. Complementing this dataset, we also present a human-annotated reasoning benchmark **MoIQ** that rigorously evaluates complex inference capabilities over mobile UIs. To harness these resources effectively, we propose a two-stage training approach that separately addresses perception and reasoning tasks, leading to stronger perception capabilities and improvement in reasoning abilities. Through extensive experiments, we demonstrate that our MoUI models achieve significant gains in accuracy across all perception tasks and _state-of-the-art_ results on public reasoning benchmark **ComplexQA (78%) and our MoIQ (49%)**. We will be open-sourcing our dataset, code, and models to foster further research and innovation in the field.
pdf
bib
abs
Lemmas Matter, But Not Like That: Predictors of Lemma-Based Generalization in Morphological Inflection
Sarah Ruth Brogden Payne
|
Jordan Kodner
Recent work has suggested that overlap –whether a given lemma or feature set is attested independently in train – drives model performance on morphological inflection tasks. The impact of lemma overlap, however, is debated, with recent work reporting accuracy drops from 0% to 30% between seen and unseen test lemmas. In this paper, we introduce a novel splitting algorithm designed to investigate predictors of accuracy on seen and unseen lemmas. We find only an 11% average drop from seen to unseen test lemmas, but show that the number of lemmas in train has a much stronger effect on accuracy on unseen than seen lemmas. We also show that the previously reported 30% drop is inflated due to the introduction of a near-30% drop in the number of training lemmas from the original splits to their novel splits.
pdf
bib
abs
Mosaic-IT: Cost-Free Compositional Data Synthesis for Instruction Tuning
Ming Li
|
Pei Chen
|
Chenguang Wang
|
Hongyu Zhao
|
Yijun Liang
|
YuPeng Hou
|
Fuxiao Liu
|
Tianyi Zhou
Finetuning large language models with a variety of instruction-response pairs has enhanced their capability to understand and follow instructions. Current instruction tuning primarily relies on teacher models or human intervention to generate and refine the instructions and responses for training, which are costly, non-sustainable, and may lack diversity. In this paper, we introduce Mosaic Instruction Tuning (Mosaic-IT), a human/model-free compositional data synthesis method that can efficiently create rich and diverse augmentations from existing instruction tuning data to enhance the LLMs. Mosaic-IT randomly concatenates multiple instruction data into one and trains the model to produce the corresponding responses with predefined higher-level meta-instructions to strengthen its multi-step instruction-following and format-following skills. Our extensive evaluations demonstrate a superior performance and training efficiency of Mosaic-IT, which achieves consistent performance improvements over various benchmarks and an 80% reduction in training costs compared with original instruction tuning.
pdf
bib
abs
MAM: Modular Multi-Agent Framework for Multi-Modal Medical Diagnosis via Role-Specialized Collaboration
Yucheng Zhou
|
Lingran Song
|
Jianbing Shen
Recent advancements in medical Large Language Models (LLMs) have showcased their powerful reasoning and diagnostic capabilities. Despite their success, current unified multimodal medical LLMs face limitations in knowledge update costs, comprehensiveness, and flexibility. To address these challenges, we introduce the Modular Multi-Agent Framework for Multi-Modal Medical Diagnosis (MAM). Inspired by our empirical findings highlighting the benefits of role assignment and diagnostic discernment in LLMs, MAM decomposes the medical diagnostic process into specialized roles: a General Practitioner, Specialist Team, Radiologist, Medical Assistant, and Director, each embodied by an LLM-based agent. This modular and collaborative framework enables efficient knowledge updates and leverages existing medical LLMs and knowledge bases. Extensive experimental evaluations conducted on a wide range of publicly accessible multimodal medical datasets, incorporating text, image, audio, and video modalities, demonstrate that MAM consistently surpasses the performance of modality-specific LLMs. Notably, MAM achieves significant performance improvements ranging from 18% to 365% compared to baseline models. Our code, data, and prompts are released at URL.
pdf
bib
abs
ATLAS: Agent Tuning via Learning Critical Steps
Zhixun Chen
|
Ming Li
|
Yuxuan Huang
|
Yali Du
|
Meng Fang
|
Tianyi Zhou
Large Language Model (LLM) agents have demonstrated remarkable generalization capabilities across multi-domain tasks. Existing agent tuning approaches typically employ supervised finetuning on entire expert trajectories. However, behavior-cloning of full trajectories can introduce expert bias and weaken generalization to states not covered by the expert data. Additionally, critical steps—such as planning, complex reasoning for intermediate subtasks, and strategic decision-making—are essential to success in agent tasks, so learning these steps is the key to improving LLM agents. For more effective and efficient agent tuning, we propose ATLAS that identifies the critical steps in expert trajectories and finetunes LLMs solely on these steps with reduced costs. By steering the training’s focus to a few critical steps, our method mitigates the risk of overfitting entire trajectories and promotes generalization across different environments and tasks. In extensive experiments, an LLM finetuned on only 30% critical steps selected by ATLAS outperforms the LLM finetuned on all steps and recent open-source LLM agents. ATLAS maintains and improves base LLM skills as generalist agents interacting with diverse environments.
pdf
bib
abs
Syntactic Control of Language Models by Posterior Inference
Vicky Xefteri
|
Tim Vieira
|
Ryan Cotterell
|
Afra Amini
Controlling the syntactic structure of text generated by language models is valuable for applications requiring clarity, stylistic consistency, or interpretability, yet it remains a challenging task. In this paper, we argue that sampling algorithms based on the posterior inference can effectively enforce a target constituency structure during generation. Our approach combines sequential Monte Carlo, which estimates the posterior distribution by sampling from a proposal distribution, with a syntactic tagger that ensures that each generated token aligns with the desired syntactic structure. Our experiments with GPT2 and Llama3-8B models show that with an appropriate proposal distribution, we can improve syntactic accuracy, increasing the F1 score from 12.31 (GPT2-large) and 35.33 (Llama3-8B) to about 93 in both cases without compromising the language model’s fluency. These results underscore both the complexity of syntactic control and the effectiveness of sampling algorithms, offering a promising approach for applications where precise control over syntax is essential.
pdf
bib
abs
Small Models Struggle to Learn from Strong Reasoners
Yuetai Li
|
Xiang Yue
|
Zhangchen Xu
|
Fengqing Jiang
|
Luyao Niu
|
Bill Yuchen Lin
|
Bhaskar Ramasubramanian
|
Radha Poovendran
Large language models (LLMs) excel in complex reasoning tasks, and distilling their reasoning capabilities into smaller models has shown promise. However, we uncover an interesting phenomenon, which we term the Small Model Learnability Gap: small models (3B parameters) do not consistently benefit from long chain-of-thought (CoT) reasoning or distillation from larger models. Instead, they perform better when fine-tuned on shorter, simpler reasoning chains that better align with their intrinsic learning capacity. To address this, we propose Mix Distillation, a simple yet effective strategy that balances reasoning complexity by combining long and short CoT examples or reasoning from both larger and smaller models. Our experiments demonstrate that Mix Distillation significantly improves small model reasoning performance compared to training on either data alone. These findings highlight the limitations of direct strong model distillation and underscore the importance of adapting reasoning complexity for effective reasoning capability transfer.
pdf
bib
abs
Sparse Rewards Can Self-Train Dialogue Agents
Barrett Martin Lattimer
|
Varun Prashant Gangal
|
Ryan McDonald
|
Yi Yang
Recent advancements in state-of-the-art (SOTA) Large Language Model (LLM) agents, especially in multi-turn dialogue tasks, have been primarily driven by supervised fine-tuning and high-quality human feedback. However, as base LLM models continue to improve, acquiring meaningful human feedback has become increasingly challenging and costly. In certain domains, base LLM agents may eventually exceed human capabilities, making traditional feedback-driven methods impractical. In this paper, we introduce a novel self-improvement paradigm that empowers LLM agents to autonomously enhance their performance without external human feedback. Our method, Juxtaposed Outcomes for Simulation Harvesting (JOSH), is a self-alignment algorithm that leverages a sparse reward simulation environment to extract ideal behaviors and further train the LLM on its own outputs. We present ToolWOZ, a sparse reward tool-calling simulation environment derived from MultiWOZ. We demonstrate that models trained with JOSH, both small and frontier, significantly improve tool-based interactions while preserving general model capabilities across diverse benchmarks. Our code and data are publicly available on GitHub.
pdf
bib
abs
Almost AI, Almost Human: The Challenge of Detecting AI-Polished Writing
Shoumik Saha
|
Soheil Feizi
The growing use of large language models (LLMs) for text generation has led to widespread concerns about AI-generated content detection. However, an overlooked challenge is AI-polished text, where human-written content undergoes subtle refinements using AI tools. This raises a critical question: should minimally polished text be classified as AI-generated? Such classification can lead to false plagiarism accusations and misleading claims about AI prevalence in online content. In this study, we systematically evaluate *twelve* state-of-the-art AI-text detectors using our **AI-Polished-Text Evaluation (APT-Eval)** dataset, which contains 15K samples refined at varying AI-involvement levels. Our findings reveal that detectors frequently flag even minimally polished text as AI-generated, struggle to differentiate between degrees of AI involvement, and exhibit biases against older and smaller models. These limitations highlight the urgent need for more nuanced detection methodologies.
pdf
bib
abs
The Reader is the Metric: How Textual Features and Reader Profiles Explain Conflicting Evaluations of AI Creative Writing
Guillermo Marco
|
Julio Gonzalo
|
Víctor Fresno
Recent studies comparing AI-generated and human-authored literary texts have produced conflicting results: some suggest AI already surpasses human quality, while others argue it still falls short. We start from the hypothesis that such divergences can be largely explained by genuine differences in how readers interpret and value literature, rather than by an intrinsic quality of the texts evaluated. Using five public datasets (1,471 stories, 101 annotators including critics, students, and lay readers), we (i) extract 17 reference-less textual features (e.g., coherence, emotional variance, average sentence length...); (ii) model individual reader preferences, deriving feature importance vectors that reflect their textual priorities; and (iii) analyze these vectors in a shared “preference space”. Reader vectors cluster into two profiles: _surface-focused readers_ (mainly non-experts), who prioritize readability and textual richness; and _holistic readers_ (mainly experts), who value thematic development, rhetorical variety, and sentiment dynamics. Our results quantitatively explain how measurements of literary quality are a function of how text features align with each reader’s preferences. These findings advocate for reader-sensitive evaluation frameworks in the field of creative text generation.
pdf
bib
abs
Summary Factual Inconsistency Detection Based on LLMs Enhanced by Universal Information Extraction
Anguo Li
|
Lei Yu
Automatic text summarization has a potential flaw that affects the factuality of summaries. Recently, Large Language Models (LLMs) have been introduced as detectors for factual inconsistencies in summaries. However, LLM-based methods rely on reasoning capabilities and face challenges in terms of efficiency and explainability. We focus on decoupling LLMs’ information extraction and reasoning capabilities to address prominent challenges, and propose a novel framework, UIEFID (Universal Information Extraction-enhanced Factual Inconsistency Detection). Our idea is to define a self-adaptive structured schema to guide fine-tuned LLMs in extracting unified structured information from documents and summaries, ultimately detecting the origins of inconsistencies in extraction information. The evaluation on 5 open-source models shows that UIEFID not only enhances the detection accuracy on the AGGREFACT benchmark but also significantly reduces redundant reasoning.
pdf
bib
abs
ELI-Why: Evaluating the Pedagogical Utility of Language Model Explanations
Brihi Joshi
|
Keyu He
|
Sahana Ramnath
|
Sadra Sabouri
|
Kaitlyn Zhou
|
Souti Chattopadhyay
|
Swabha Swayamdipta
|
Xiang Ren
Language models today are widely used in education, yet their ability to tailor responses for learners with varied informational needs and knowledge backgrounds remains under-explored. To this end, we introduce ELI-Why, a benchmark of 13.4K “Why” questions to evaluate the pedagogical capabilities of language models. We then conduct two extensive human studies to assess the utility of language model-generated explanatory answers (explanations) on our benchmark, tailored to three distinct educational grades: elementary, high-school and graduate school. In our first study, human raters assume the role of an “educator” to assess model explanations’ fit to different educational grades. We find that GPT-4-generated explanations match their intended educational background only 50% of the time, compared to 79% for lay human-curated explanations. In our second study, human raters assume the role of a learner to assess if an explanation fits their own informational needs. Across all educational backgrounds, users deemed GPT-4-generated explanations 20% less suited on average to their informational needs, when compared to explanations curated by lay people. Additionally, automated evaluation metrics reveal that explanations generated across different language model families for different informational needs remain indistinguishable in their grade-level, limiting their pedagogical effectiveness.
pdf
bib
abs
Beyond Generation: Leveraging LLM Creativity to Overcome Label Bias in Classification
Xiaoyue Wang
|
Xin Liu
Large Language Models (LLMs) exhibit impressive capabilities in In-Context Learning (ICL) but are prone to label bias—an undesirable tendency to favor certain answers. Existing calibration methods mitigate bias by leveraging in-domain data, yet such data is often unavailable in real-world scenarios. To address this limitation, we propose SDC (Synthetic Data Calibration), a simple-yet-effective approach that generates synthetic in-domain data from a few in-context demonstrations and utilizes it for calibration. By approximating the benefits of real in-domain data, SDC effectively reduces label bias without requiring access to actual domain-specific inputs. Experimental evaluations on 279 classification and multiple-choice tasks from the Super-NaturalInstructions benchmark. The results show that SDC significantly reduces label bias, achieving an average Bias Score reduction of 57.5%, and outperforming all competitive baselines. Moreover, when combined with Leave-One-Out Calibration (LOOC), further improves performance, underscoring its effectiveness and generalizability in enhancing the reliability of LLMs.
pdf
bib
abs
CogSteer: Cognition-Inspired Selective Layer Intervention for Efficiently Steering Large Language Models
Xintong Wang
|
Jingheng Pan
|
Liang Ding
|
Longyue Wang
|
Longqin Jiang
|
Xingshan Li
|
Chris Biemann
Large Language Models (LLMs) achieve remarkable performance through pretraining on extensive data. This enables efficient adaptation to diverse downstream tasks. However, the lack of interpretability in their underlying mechanisms limits the ability to effectively steer LLMs for specific applications. In this work, we investigate the intrinsic mechanisms of LLMs from a cognitive perspective using eye movement measures. Specifically, we analyze the layer-wise correlation between human cognitive indicators and LLM representations. Building on these insights, we propose a heuristic approach for selecting the optimal steering layer to modulate LLM semantics. To this end, we introduce an efficient selective layer intervention based on prominent parameter-efficient fine-tuning methods, which conventionally adjust either all layers or only the final layer. Additionally, we present an implicit layer contrastive intervention during inference to steer LLMs away from toxic outputs. Extensive experiments on natural language understanding, reasoning, and generation tasks, conducted on GPT-2, LLaMa2-7B, and Mixtral-7B, demonstrate the effectiveness and efficiency of our approach. As a model-agnostic framework, it enhances the interpretability of LLMs while improving efficiency for safe deployment.
pdf
bib
abs
PASTEL : Polarity-Aware Sentiment Triplet Extraction with LLM-as-a-Judge
Aaditya Bodke
|
Avinoor Singh Kohli
|
Hemant Subhash Pardeshi
|
Prathamesh Bhosale
Aspect Sentiment Triplet Extraction (ASTE) is a subtask of Aspect-Based Sentiment Analysis (ABSA) that aims to extract aspect terms, corresponding opinion terms, and their associated sentiment polarities from text. Current end-to-end approaches, whether employing Large Language Models (LLMs) or complex neural network structures, struggle to effectively model the intricate latent relationships between aspects and opinions. Therefore, in this work, we propose Polarity-Aware Sentiment Triplet Extraction with LLM-as-a-judge (PASTEL), a novel pipeline that decomposes the ASTE task into structured subtasks. We employ finetuned LLMs to separately extract the aspect and opinion terms, incorporating a polarity-aware mechanism to enhance opinion extraction. After generating a candidate set through the Cartesian product of the extracted aspect and opinion-sentiment sets, we leverage an LLM-as-a-Judge to validate and prune these candidates. Experimental evaluations demonstrate that PASTEL outperforms existing baselines. Our findings highlight the necessity of modular decomposition in complex sentiment analysis tasks to fully exploit the capabilities of current LLMs.
pdf
bib
abs
COSMIC: Generalized Refusal Direction Identification in LLM Activations
Vincent Siu
|
Nicholas Crispino
|
Zihao Yu
|
Sam Pan
|
Zhun Wang
|
Yang Liu
|
Dawn Song
|
Chenguang Wang
Large Language Models encode behaviors like refusal within their activation space, but identifying these behaviors remains challenging. Existing methods depend on predefined refusal templates detectable in output tokens or manual review. We introduce **COSMIC** (Cosine Similarity Metrics for Inversion of Concepts), an automated framework for direction selection that optimally identifies steering directions and target layers using cosine similarity, entirely independent of output text. COSMIC achieves steering effectiveness comparable to prior work without any prior knowledge or assumptions of a model’s refusal behavior such as the use of certain refusal tokens. Additionally, COSMIC successfully identifies refusal directions in adversarial scenarios and models with weak safety alignment, demonstrating its robustness across diverse settings.
pdf
bib
abs
Red Queen: Exposing Latent Multi-Turn Risks in Large Language Models
Yifan Jiang
|
Kriti Aggarwal
|
Tanmay Laud
|
Kashif Munir
|
Jay Pujara
|
Subhabrata Mukherjee
The rapid advancement of large language models (LLMs) has unlocked diverse opportunities across domains and applications but has also raised concerns about their tendency to generate harmful responses under jailbreak attacks. However, most existing jailbreak strategies are single-turn with explicit malicious intent, failing to reflect the real-world scenario where interactions can be multi-turn and users can conceal their intents. Recent studies on Theory of Mind (ToM) reveal that LLMs often struggle to infer users’ latent intent in such scenarios. Building on these limitations, we propose a novel jailbreak attack, RED QUEEN ATTACK, which constructs a multi-turn scenario, concealing the malicious intent under the guise of preventing harm. We generate 56k multi-turn concealment data points across 40 scenarios and 14 harmful categories, evaluating four LLM families of different sizes. Results show all models are vulnerable to RED QUEEN ATTACK, reaching 87.6% attack success rate (ASR) on GPT-4o and 77.1% on Llama3-70B. Compared to prior jailbreak attacks, the RED QUEEN ATTACK achieves superior performance on nine out of ten models, with ASR improvements ranging from 2% to 64%. Further analysis reveals that larger models exhibit greater vulnerability to our attack, primarily due to the combination of multi-turn structures and concealment strategies. To enhance safety, we propose RED QUEEN GUARD, a mitigation strategy reducing ASR to below 1% while maintaining model performance on standard benchmarks. Full implementation and dataset are publicly accessible at https://github.com/kriti-hippo/red_queen.
pdf
bib
abs
MDBench: A Synthetic Multi-Document Reasoning Benchmark Generated with Knowledge Guidance
Joseph J Peper
|
Wenzhao Qiu
|
Ali Payani
|
Lu Wang
Natural language processing evaluation has made significant progress, largely driven by the proliferation of powerful large language mod-els (LLMs). New evaluation benchmarks are of increasing priority as the reasoning capabilities of LLMs are expanding at a rapid pace. In particular, while multi-document (MD) reasoning is an area of extreme relevance given LLM capabilities in handling longer-context inputs, few benchmarks exist to rigorously examine model behavior in this setting. Moreover, the multi-document setting is historically challenging for benchmark creation due to the expensive cost of annotating long inputs. In this work, we introduce MDBench, a new dataset for evaluating LLMs on the task of multi-document reasoning. Notably, MDBench is created through a novel synthetic generation process, allowing us to controllably and efficiently generate challenging document sets and the corresponding question-answer (QA) examples. Our novel technique operates on condensed structured seed knowledge, modifying it through LLM-assisted edits to induce MD-specific reasoning challenges. We then convert this structured knowledge into a natural text surface form, generating a document set and corresponding QA example. We analyze the behavior of popular LLMs and prompting techniques, finding that MDBench poses significant challenges for all methods, even with relatively short document sets. We also see our knowledge-guided generation technique (1) allows us to readily perform targeted analysis of MD-specific reasoning capabilities and (2) can be adapted quickly to account for new challenges and future modeling improvements.
pdf
bib
abs
DiaLLMs: EHR-Enhanced Clinical Conversational System for Clinical Test Recommendation and Diagnosis Prediction
Weijieying Ren
|
Tianxiang Zhao
|
Lei Wang
|
Tianchun Wang
|
Vasant G Honavar
Recent advances in Large Language Models (LLMs) have led to remarkable progresses in medical consultation.However, existing medical LLMs overlook the essential role of Electronic Health Records (EHR) and focus primarily on diagnosis recommendation, limiting their clinical applicability. We propose DiaLLM, the first medical LLM that integrates heterogeneous EHR data into clinically grounded dialogues, enabling clinical test recommendation, result interpretation, and diagnosis prediction to better align with real-world medical practice. To construct clinically grounded dialogues from EHR, we design a Clinical Test Reference (CTR) strategy that maps each clinical code to its corresponding description and classifies test results as “normal” or “abnormal”. Additionally, DiaLLM employs a reinforcement learning framework for evidence acquisition and automated diagnosis. To handle the large action space, we introduce a reject sampling strategy to reduce redundancy and improve exploration efficiency. Furthermore, a confirmation reward and a class-sensitive diagnosis reward are designed to guide accurate diagnosis prediction.Extensive experimental results demonstrate that DiaLLM outperforms baselines in clinical test recommendation and diagnosis prediction. Our code is available at Github.
pdf
bib
abs
Can Hallucination Correction Improve Video-Language Alignment?
Lingjun Zhao
|
Mingyang Xie
|
Paola Cascante-Bonilla
|
Hal Daumé Iii
|
Kwonjoon Lee
Large Vision-Language Models often generate hallucinated content that is not grounded in its visual inputs. While prior work focuses on mitigating hallucinations, we instead explore leveraging hallucination correction as a training objective to improve video-language alignment. We introduce HACA, a self-training framework learning to correct hallucinations in descriptions that do not align with the video content. By identifying and correcting inconsistencies, HACA enhances the model’s ability to align video and textual representations for spatio-temporal reasoning. Our experimental results show consistent gains in video-caption binding and text-to-video retrieval tasks, demonstrating that hallucination correction-inspired tasks serve as an effective strategy for improving vision and language alignment.
pdf
bib
abs
IMPARA-GED: Grammatical Error Detection is Boosting Reference-free Grammatical Error Quality Estimator
Yusuke Sakai
|
Takumi Goto
|
Taro Watanabe
We propose IMPARA-GED, a novel reference-free automatic grammatical error correction (GEC) evaluation method with grammatical error detection (GED) capabilities. We focus on the quality estimator of IMPARA, an existing automatic GEC evaluation method, and construct that of IMPARA-GED using a pre-trained language model with enhanced GED capabilities. Experimental results on SEEDA, a meta-evaluation dataset for automatic GEC evaluation methods, demonstrate that IMPARA-GED achieves the highest correlation with human sentence-level evaluations.
pdf
bib
abs
Do Language Models Mirror Human Confidence? Exploring Psychological Insights to Address Overconfidence in LLMs
Chenjun Xu
|
Bingbing Wen
|
Bin Han
|
Robert Wolfe
|
Lucy Lu Wang
|
Bill Howe
Psychology research has shown that humans are poor at estimating their performance on tasks, tending towards underconfidence on easy tasks and overconfidence on difficult tasks. We examine three LLMs, Llama-3-70B-instruct, Claude-3-Sonnet, and GPT-4o, on a range of QA tasks of varying difficulty, and show that models exhibit subtle differences from human patterns of overconfidence: less sensitive to task difficulty, and when prompted to answer based on different personas—e.g., expert vs layman, or different race, gender, and ages—the models will respond with stereotypically biased confidence estimations even though their underlying answer accuracy remains the same. Based on these observations, we propose Answer-Free Confidence Estimation (AFCE) to improve confidence calibration and LLM interpretability in these settings. AFCE is a self-assessment method that employs two stages of prompting, first eliciting only confidence scores on questions, then asking separately for the answer. Experiments on the MMLU and GPQA datasets spanning subjects and difficulty show that this separation of tasks significantly reduces overconfidence and delivers more human-like sensitivity to task difficulty.
pdf
bib
abs
Why Multi-Interest Fairness Matters: Hypergraph Contrastive Multi-Interest Learning for Fair Conversational Recommender System
Yongsen Zheng
|
Zongxuan Xie
|
Guohua Wang
|
Ziyao Liu
|
Liang Lin
|
Kwok-Yan Lam
Unfairness is a well-known challenge in Recommender Systems (RSs), often resulting in biased outcomes that disadvantage users or items based on attributes such as gender, race, age, or popularity. Although some approaches have started to improve fairness recommendation in offline or static contexts, the issue of unfairness often exacerbates over time, leading to significant problems like the Matthew effect, filter bubbles, and echo chambers. To address these challenges, we proposed a novel framework, Hypergraph Contrastive Multi-Interest Learning for Fair Conversational Recommender System (HyFairCRS), aiming to promote multi-interest diversity fairness in dynamic and interactive Conversational Recommender Systems (CRSs). HyFairCRS first captures a wide range of user interests by establishing diverse hypergraphs through contrastive learning. These interests are then utilized in conversations to generate informative responses and ensure fair item predictions within the dynamic user-system feedback loop. Experiments on two CRS-based datasets show that HyFairCRS achieves a new state-of-the-art performance while effectively alleviating unfairness.
pdf
bib
abs
Cautious Next Token Prediction
Yizhou Wang
|
Lingzhi Zhang
|
Yue Bai
|
Mang Tik Chiu
|
Zhengmian Hu
|
Mingyuan Zhang
|
Qihua Dong
|
Yu Yin
|
Sohrab Amirghodsi
|
Yun Fu
Next token prediction paradigm has been prevailing for autoregressive models in the era of LLMs. The current default sampling choice for popular LLMs is temperature scaling together with nucleus sampling to balance diversity and coherence. Nevertheless, such approach leads to inferior performance in various NLP tasks when the model is not certain about testing questions. To this end, we propose a brand new training-free decoding strategy, dubbed as Cautious Next Token Prediction (CNTP). In the decoding process, if the model has comparatively high prediction entropy at a certain step, we sample multiple trials starting from the step independently and stop when encountering any punctuation. Then we select the trial with the lowest perplexity score viewed as the most probable and reliable trial path given the model’s capacity. The trial number is negatively correlated with the prediction confidence, i.e., the less confident the model is, the more trials it should sample. This is consistent with human beings’ behaviour: when feeling uncertain or unconfident, one tends to think more creatively, exploring multiple thinking paths, to cautiously select the path one feels most confident about. Extensive experiments on both LLMs and MLLMs show that our proposed CNTP approach outperforms existing standard decoding strategies consistently by a clear margin. Moreover, the integration of CNTP with self consistency can further improve over vanilla self consistency. We believe our proposed CNTP has the potential to become one of the default choices for LLM decoding. Code is available at https://github.com/wyzjack/CNTP.
pdf
bib
abs
Reasoning with Graphs: Structuring Implicit Knowledge to Enhance LLMs Reasoning
Haoyu Han
|
Yaochen Xie
|
Hui Liu
|
Xianfeng Tang
|
Sreyashi Nag
|
William Headden
|
Yang Li
|
Chen Luo
|
Shuiwang Ji
|
Qi He
|
Jiliang Tang
Large language models (LLMs) have demonstrated remarkable success across a wide range of tasks; however, they still encounter challenges in reasoning tasks that require understanding and inferring relationships between distinct pieces of information within text sequences. This challenge is particularly pronounced in tasks involving multi-step processes, such as logical reasoning and multi-hop question answering, where understanding implicit relationships between entities and leveraging multi-hop connections in the given context are crucial. Graphs, as fundamental data structures, explicitly represent pairwise relationships between entities, thereby offering the potential to enhance LLMs’ reasoning capabilities. External graphs have proven effective in supporting LLMs across multiple tasks. However, in many reasoning tasks, no pre-existing graph structure is provided. Can we structure implicit knowledge derived from context into graphs to assist LLMs in reasoning? In this paper, we propose Reasoning with Graphs (RwG) by first constructing explicit graphs from the context and then leveraging these graphs to enhance LLM reasoning performance on reasoning tasks. Extensive experiments demonstrate the effectiveness of the proposed method in improving both logical reasoning and multi-hop question answering tasks.
pdf
bib
abs
Enhancing Medical Dialogue Generation through Knowledge Refinement and Dynamic Prompt Adjustment
Hongda Sun
|
Jiaren Peng
|
Wenzhong Yang
|
Liang He
|
Bo Du
|
Rui Yan
Medical dialogue systems (MDS) have emerged as crucial online platforms for enabling multi-turn, context-aware conversations with patients. However, existing MDS often struggle to (1) identify relevant medical knowledge and (2) generate personalized, medically accurate responses. To address these challenges, we propose MedRef, a novel MDS that incorporates knowledge refining and dynamic prompt adjustment. First, we employ a knowledge refining mechanism to filter out irrelevant medical data, improving predictions of critical medical entities in responses. Additionally, we design a comprehensive prompt structure that incorporates historical details and evident details. To enable real-time adaptability to diverse patient conditions, we implement two key modules, Triplet Filter and Demo Selector, providing appropriate knowledge and demonstrations equipped in the system prompt.Extensive experiments on MedDG and KaMed benchmarks show that MedRef outperforms state-of-the-art baselines in both generation quality and medical entity accuracy, underscoring its effectiveness and reliability for real-world healthcare applications.
pdf
bib
abs
Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders
Kristian Kuznetsov
|
Laida Kushnareva
|
Anton Razzhigaev
|
Polina Druzhinina
|
Anastasia Voznyuk
|
Irina Piontkovskaya
|
Evgeny Burnaev
|
Serguei Barannikov
Artificial Text Detection (ATD) is becoming increasingly important with the rise of advanced Large Language Models (LLMs). Despite numerous efforts, no single algorithm performs consistently well across different types of unseen text or guarantees effective generalization to new LLMs. Interpretability plays a crucial role in achieving this goal. In this study, we enhance ATD interpretability by using Sparse Autoencoders (SAE) to extract features from Gemma-2-2B’s residual stream. We identify both interpretable and efficient features, analyzing their semantics and relevance through domain- and model-specific statistics, a steering approach, and manual or LLM-based interpretation of obtained features. Our methods offer valuable insights into how texts from various models differ from human-written content. We show that modern LLMs have a distinct writing style, especially in information-dense domains, even though they can produce human-like outputs with personalized prompts. The code for this paper is available at https://github.com/pyashy/SAE_ATD.
pdf
bib
abs
Low-Resource Grammatical Error Correction: Selective Data Augmentation with Round-Trip Machine Translation
Frank Palma Gomez
|
Alla Rozovskaya
Supervised state-of-the-art methods for grammatical error correction require large amounts of parallel data for training. Due to lack of gold-labeled data, techniques that create synthetic training data have become popular. We show that models trained on synthetic data tend tocorrect a limited range of grammar and spelling mistakes that involve character-level changes, but perform poorly on (more complex) phenomena that require word-level changes. We propose to address the performance gap on such errors by generating synthetic data through selective data augmentation via round-trip machine translation. We show that the proposed technique, SeLex-RT, is capable of generating mistakes that are similar to those observed with language learners. Using the approach with two types of state-of-the-art learning frameworks and two low-resource languages (Russian and Ukrainian), we achieve substantial improvements, compared to training on synthetic data produced with standard techniques. Analysis of the output reveals that models trained on data noisified with the SeLex-RT approach are capable of making word-level changes and correct lexical errors common with language learners.
pdf
bib
abs
Just Put a Human in the Loop? Investigating LLM-Assisted Annotation for Subjective Tasks
Hope Schroeder
|
Deb Roy
|
Jad Kabbara
LLM use in annotation is becoming widespread, and given LLMs’ overall promising performance and speed, putting humans in the loop to simply “review” LLM annotations can be tempting. In subjective tasks with multiple plausible answers, this can impact both evaluation of LLM performance, and analysis using these labels in a social science task downstream. In a pre-registered experiment with 350 unique annotators and 7,000 annotations across 4 conditions, 2 models, and 2 datasets, we find that presenting crowdworkers with LLM-generated annotation suggestions did not make them faster annotators, but did improve their self-reported confidence in the task. More importantly, annotators strongly took the LLM suggestions, significantly changing the label distribution compared to the baseline. We show that when these labels created with LLM assistance are used to evaluate LLM performance, reported model performance significantly increases. We show how changes in label distributions as a result of LLM assistance can affect conclusions drawn by analyzing even “human-approved” LLM-annotated datasets. We believe our work underlines the importance of understanding the impact of LLM-assisted annotation on subjective, qualitative tasks, on the creation of gold data for training and testing, and on the evaluation of NLP systems on subjective tasks.
pdf
bib
abs
Research Community Perspectives on “Intelligence” and Large Language Models
Bertram Højer
|
Terne Sasha Thorn Jakobsen
|
Anna Rogers
|
Stefan Heinrich
Despite the widespread use of ‘artificial intelligence’ (AI) framing in Natural Language Processing (NLP) research, it is not clear what researchers mean by ”intelligence”. To that end, we present the results of a survey on the notion of ”intelligence” among researchers and its role in the research agenda. The survey elicited complete responses from 303 researchers from a variety of fields including NLP, Machine Learning (ML), Cognitive Science, Linguistics, and Neuroscience.We identify 3 criteria of intelligence that the community agrees on the most: generalization, adaptability, & reasoning.Our results suggests that the perception of the current NLP systems as ”intelligent” is a minority position (29%).Furthermore, only 16.2% of the respondents see developing intelligent systems as a research goal, and these respondents are more likely to consider the current systems intelligent.
pdf
bib
abs
LEMONADE: A Large Multilingual Expert-Annotated Abstractive Event Dataset for the Real World
Sina Semnani
|
Pingyue Zhang
|
Wanyue Zhai
|
Haozhuo Li
|
Ryan Beauchamp
|
Trey Billing
|
Katayoun Kishi
|
Manling Li
|
Monica Lam
This paper presents LEMONADE, a large-scale conflict event dataset comprising 39,786 events across 20 languages and 171 countries, with extensive coverage of region-specific entities. LEMONADE is based on a partially reannotated subset of the Armed Conflict Location & Event Data (ACLED), which has documented global conflict events for over a decade.To address the challenge of aggregating multilingual sources for global event analysis, we introduce abstractive event extraction (AEE) and its subtask, abstractive entity linking (AEL). Unlike conventional span-based event extraction, our approach detects event arguments and entities through holistic document understanding and normalizes them across the multilingual dataset. We evaluate various large language models (LLMs) on these tasks, adapt existing zero-shot event extraction systems, and benchmark supervised models. Additionally, we introduce ZEST, a novel zero-shot retrieval-based system for AEL.Our best zero-shot system achieves an end-to-end F1 score of 58.3%, with LLMs outperforming specialized event extraction models such as GoLLIE. For entity linking, ZEST achieves an F1 score of 45.7%, significantly surpassing OneNet, a state-of-the-art zero-shot baseline that achieves only 23.7%. However, these zero-shot results lag behind the best supervised systems by 20.1% and 37.0% in the end-to-end and AEL tasks, respectively, highlighting the need for further research.
pdf
bib
abs
Memorization vs. Reasoning: Updating LLMs with New Knowledge
Aochong Oliver Li
|
Tanya Goyal
Large language models (LLMs) encode vast amounts of pre-trained knowledge in their parameters, but updating them as real-world information evolves remains a challenge. Existing methodologies and benchmarks primarily target entity substitutions, failing to capture the full breadth of complex real-world dynamics. In this paper, we introduce Knowledge Update Playground (KUP), an automatic pipeline for simulating realistic knowledge updates reflected in an evidence corpus. KUP’s evaluation framework includes direct and indirect probes to both test memorization of updated facts and reasoning over them, for any update learning methods. Next, we present a lightweight method called memory conditioned training (MCT), which conditions tokens in the update corpus on self-generated ”memory” tokens during training. Our strategy encourages LLMs to surface and reason over newly memorized knowledge at inference. Our results on two LLM families show that (1) KUP benchmark is highly challenging, with the best CPT models achieving <2% in indirect probing setting (reasoning) and (2) MCT training significantly outperforms prior continued pre-training (CPT) baselines, improving direct probing (memorization) results by up to 25.4%.
pdf
bib
abs
CourtEval: A Courtroom-Based Multi-Agent Evaluation Framework
Sandeep Kumar
|
Abhijit A Nargund
|
Vivek Sridhar
Automated evaluation is crucial for assessing the quality of natural language text, especially in open-ended generation tasks, given the costly and time-consuming nature of human evaluation. Existing automatic evaluation metrics like ROUGE and BLEU often show low correlation with human judgments. As large language models (LLMs) continue to evolve, researchers have explored their use as alternatives to human evaluators. Although single-agent approaches have shown potential, results indicate that further progress is required to close the gap between their performance and the quality of human assessments. Acknowledging that human evaluations involve multiple annotators, the multi-agent approach allows LLMs to collaborate, enhancing efficiency and effectiveness in handling complex tasks. In this paper, we present CourtEval, a novel Multi-Agent Evaluation Framework modeled after courtroom dynamics. Each agent takes on a distinct role: the Grader, similar to a judge, assigns an initial score; the Critic, like a prosecutor, challenges this score; and the Defender, akin to a defense attorney, defends it. Based on the input from both the Critic and Defender, the Grader re-evaluates the score, leading to a more balanced and fair final decision through this adversarial process. CourtEval substantially outperforms the previous state-of-the-art methods in two meta-evaluation benchmarks in NLG evaluation, SummEval and TopicalChat.
pdf
bib
abs
Multilingual Definition Modeling
Edison Marrese-Taylor
|
Erica K. Shimomoto
|
A. Solano
|
Enrique Reid
In this paper, we propose the first multilingual study on definition modeling. We use monolingual dictionary data for four new languages (Spanish, French, Portuguese, and German) and perform an in-depth empirical study to test the performance of pre-trained multilingual language models on definition modeling of monosemic words when finetuned on this data. Furthermore, we use a zero-shot approach to test the multilingual capabilities of two popular chat-based Large Language Models (LLMs) in the task. Results show that multilingual language models can perform on-pair with English but cannot leverage potential cross-lingual synergies, with LLMs generally offering better performance overall. A comprehensive human evaluation of the LLM-generated definition highlights the zero and few-shot capabilities of these models in this new task, also showing their shortcomings. Finally, we show that performance on our task via BERTScore strongly correlates to the performance on multilingual LLM benchmarks, suggesting that our task offers a viable compute-constrained, stable and natural alternative to these.
pdf
bib
abs
Human Bias in the Face of AI: Examining Human Judgment Against Text Labeled as AI Generated
Tiffany Zhu
|
Iain Weissburg
|
Kexun Zhang
|
William Yang Wang
As Al advances in text generation, human trust in Al generated content remains constrained by biases that go beyond concerns of accuracy. This study explores how bias shapes the perception of AI versus human generated content. Through three experiments involving text rephrasing, news article summarization, and persuasive writing, we investigated how human raters respond to labeled and unlabeled content. While the raters could not differentiate the two types of texts in the blind test, they overwhelmingly favored content labeled as “Human Generated,” over those labeled “AI Generated,” by a preference score of over 30%. We observed the same pattern even when the labels were deliberately swapped. This human bias against AI has broader societal and cognitive implications, as it undervalues AI performance. This study highlights the limitations of human judgment in interacting with AI and offers a foundation for improving human-AI collaboration, especially in creative fields.
pdf
bib
abs
Redundancy, Isotropy, and Intrinsic Dimensionality of Prompt-based Text Embeddings
Hayato Tsukagoshi
|
Ryohei Sasano
Prompt-based text embedding models, which generate task-specific embeddings upon receiving tailored prompts, have recently demonstrated remarkable performance. However, their resulting embeddings often have thousands of dimensions, leading to high storage costs and increased computational costs of embedding-based operations. In this paper, we investigate how post-hoc dimensionality reduction applied to the embeddings affects the performance of various tasks that leverage these embeddings, specifically classification, clustering, retrieval, and semantic textual similarity (STS) tasks. Our experiments show that even a naive dimensionality reduction, which keeps only the first 25% of the dimensions of the embeddings, results in a very slight performance degradation, indicating that these embeddings are highly redundant. Notably, for classification and clustering, even when embeddings are reduced to less than 0.5% of the original dimensionality the performance degradation is very small. To quantitatively analyze this redundancy, we perform an analysis based on the intrinsic dimensionality and isotropy of the embeddings. Our analysis reveals that embeddings for classification and clustering, which are considered to have very high dimensional redundancy, exhibit lower intrinsic dimensionality and less isotropy compared with those for retrieval and STS.
pdf
bib
abs
Harnessing Whisper for Prosodic Stress Analysis
Samuel S. Sohn
|
Sten Knutsen
|
Karin Stromswold
Prosody affects how people produce and understand language, yet studies of how it does so have been hindered by the lack of efficient tools for analyzing prosodic stress. We fine-tune OpenAI Whisper large-v2, a state-of-the-art speech recognition model, to recognize phrasal, lexical, and contrastive stress using a small, carefully annotated dataset. Our results show that Whisper can learn distinct, gender-specific stress patterns to achieve near-human and super-human accuracy in stress classification and transfer its learning from one type of stress to another, surpassing traditional machine learning models. Furthermore, we explore how acoustic context influences its performance and propose a novel black-box evaluation method for characterizing the decision boundaries used by Whisper for prosodic stress interpretation. These findings open new avenues for large-scale, automated prosody research. Models can be found at github.com/SSSohn/ProsodyBench.
pdf
bib
abs
Can You Share Your Story? Modeling Clients’ Metacognition and Openness for LLM Therapist Evaluation
Minju Kim
|
Dongje Yoo
|
Yeonjun Hwang
|
Minseok Kang
|
Namyoung Kim
|
Minju Gwak
|
Beong-woo Kwak
|
Hyungjoo Chae
|
Harim Kim
|
Yunjoong Lee
|
Min Hee Kim
|
Dayi Jung
|
Kyong-Mee Chung
|
Jinyoung Yeo
Understanding clients’ thoughts and beliefs is fundamental in counseling, yet current evaluations of LLM therapists often fail to assess this ability. Existing evaluation methods rely on client simulators that clearly disclose internal states to the therapist, making it difficult to determine whether an LLM therapist can uncover unexpressed perspectives. To address this limitation, we introduce MindVoyager, a novel evaluation framework featuring a controllable and realistic client simulator which dynamically adapts itself based on the ongoing counseling session, offering a more realistic and challenging evaluation environment. We further introduce evaluation metrics that assess the exploration ability of LLM therapists by measuring their thorough understanding of client’s beliefs and thoughts.
pdf
bib
abs
Dictionaries to the Rescue: Cross-Lingual Vocabulary Transfer for Low-Resource Languages Using Bilingual Dictionaries
Haruki Sakajo
|
Yusuke Ide
|
Justin Vasselli
|
Yusuke Sakai
|
Yingtao Tian
|
Hidetaka Kamigaito
|
Taro Watanabe
Cross-lingual vocabulary transfer plays a promising role in adapting pre-trained language models to new languages, including low-resource languages.Existing approaches that utilize monolingual or parallel corpora face challenges when applied to languages with limited resources.In this work, we propose a simple yet effective vocabulary transfer method that utilizes bilingual dictionaries, which are available for many languages, thanks to descriptive linguists.Our proposed method leverages a property of BPE tokenizers where removing a subword from the vocabulary causes a fallback to shorter subwords.The embeddings of target subwords are estimated iteratively by progressively removing them from the tokenizer.The experimental results show that our approach outperforms existing methods for low-resource languages, demonstrating the effectiveness of a dictionary-based approach for cross-lingual vocabulary transfer.
pdf
bib
abs
When Should Dense Retrievers Be Updated in Evolving Corpora? Detecting Out-of-Distribution Corpora Using GradNormIR
Dayoon Ko
|
Jinyoung Kim
|
Sohyeon Kim
|
Jinhyuk Kim
|
Jaehoon Lee
|
Seonghak Song
|
Minyoung Lee
|
Gunhee Kim
Dense retrievers encode texts into embeddings to efficiently retrieve relevant documents from large databases in response to user queries. However, real-world corpora continually evolve, leading to a shift from the original training distribution of the retriever. Without timely updates or retraining, indexing newly emerging documents can degrade retrieval performance for future queries. Thus, identifying when a dense retriever requires an update is critical for maintaining robust retrieval systems. In this paper, we propose a novel task of predicting whether a corpus is out-of-distribution (OOD) relative to a dense retriever before indexing. Addressing this task allows us to proactively manage retriever updates, preventing potential retrieval failures. We introduce GradNormIR, an unsupervised approach that leverages gradient norms to detect OOD corpora effectively. Experiments on the BEIR benchmark demonstrate that GradNormIR enables timely updates of dense retrievers in evolving document collections, significantly enhancing retrieval robustness and efficiency.
pdf
bib
abs
The Million Authors Corpus: A Cross-Lingual and Cross-Domain Wikipedia Dataset for Authorship Verification
Abraham Israeli
|
Shuai Liu
|
Jonathan May
|
David Jurgens
Authorship verification (AV) is a crucial task for applications like identity verification, plagiarism detection, and AI-generated text identification. However, datasets for training and evaluating AV models are primarily in English and primarily in a single domain. This precludes analysis of AV techniques for generalizability and can cause seemingly valid AV solutions to, in fact, rely on topic-based features rather than actual authorship features. To address this limitation, we introduce the Million Authors Corpus (), a novel dataset encompassing contributions from dozens of languages on Wikipedia. It includes only long and contiguous textual chunks taken from Wikipedia edits and links those texts to their authors. includes 60.08M textual chunks, contributed by 1.29M Wikipedia authors. It enables broad-scale cross-lingual and cross-domain AV evaluation to ensure accurate analysis of model capabilities that are not overly optimistic. We provide baseline evaluations using state-of-the-art AV models as well as information retrieval models that are not AV-specific in order to demonstrate ‘s unique cross-lingual and cross-domain ablation capabilities.
pdf
bib
abs
BridG MT: Enhancing LLMs’ Machine Translation Capabilities with Sentence Bridging and Gradual MT
Seungwoo Choi
|
Gahyun Yoo
|
Jay-Yoon Lee
Recent Large Language Models (LLMs) have demonstrated impressive translation performance without requiring fine-tuning on additional parallel corpora. However, they still face significant challenges in certain scenarios, particularly when translating low-resource languages. A common approach to address this issue is to provide external knowledge, such as few-shot examples, to assist LLMs in translating specific source sentences. However, this method is fundamentally limited by the quality or quantity of relevant sources, which cannot always be guaranteed. To reduce LLMs’ reliance on external sources, we propose BridG MT, a method that combines Sentence Bridging, which generates a sequence of sentences as a bridge that gradually transition from easy-to-translate to more difficult, and Gradual MT, which sequentially translates these sentences using earlier translations as few-shot examples for subsequent ones. Experiments conducted on four LLMs across seven languages demonstrate that our method effectively enhances translation performance, even outperforming translation methods that rely on a large number of few-shot examples.
pdf
bib
abs
Text2World: Benchmarking Large Language Models for Symbolic World Model Generation
Mengkang Hu
|
Tianxing Chen
|
Yude Zou
|
Yuheng Lei
|
Qiguang Chen
|
Ming Li
|
Yao Mu
|
Hongyuan Zhang
|
Wenqi Shao
|
Ping Luo
Recently, there has been growing interest in leveraging large language models (LLMs) to generate symbolic world models from textual descriptions. Although LLMs have been extensively explored in the context of world modeling, prior studies encountered several challenges, including evaluation randomness, dependence on indirect metrics, and a limited domain scope. To address these limitations, we introduce a novel benchmark, Text2World, based on planning domain definition language (PDDL), featuring hundreds of diverse domains and employing multi-criteria, execution-based metrics for a more robust evaluation. We benchmark current LLMs using Text2World and find that reasoning models trained with large-scale reinforcement learning outperform others. However, even the best-performing model still demonstrates limited capabilities in world modeling. Building on these insights, we examine several promising strategies to enhance the world modeling capabilities of LLMs, including test-time scaling, agent training, and more. We hope that Text2World can serve as a crucial resource, laying the groundwork for future research in leveraging LLMs as world models.
pdf
bib
abs
Blinded by Context: Unveiling the Halo Effect of MLLM in AI Hiring
Kyusik Kim
|
Jeongwoo Ryu
|
Hyeonseok Jeon
|
Bongwon Suh
This study investigates the halo effect in AI-driven hiring evaluations using Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs). Through experiments with hypothetical job applications, we examined how these models’ evaluations are influenced by non-job-related information, including extracurricular activities and social media images. By analyzing models’ responses to Likert-scale questions across different competency dimensions, we found that AI models exhibit significant halo effects, particularly in image-based evaluations, while text-based assessments showed more resistance to bias. The findings demonstrate that supplementary multimodal information can substantially influence AI hiring decisions, highlighting potential risks in AI-based recruitment systems.
pdf
bib
abs
CoT-UQ: Improving Response-wise Uncertainty Quantification in LLMs with Chain-of-Thought
Boxuan Zhang
|
Ruqi Zhang
Large language models (LLMs) excel in many tasks but struggle to accurately quantify uncertainty in their generated responses. This limitation makes it challenging to detect misinformation and ensure reliable decision-making. Existing uncertainty quantification (UQ) methods for LLMs are primarily prompt-wise rather than response-wise, often requiring multiple response samples, which leads to inefficiency. Moreover, LLMs have been shown to be overconfident, particularly when using reasoning steps to derive their answers. In this work, we introduce a novel approach to quantify response-wise uncertainty by integrating LLMs’ inherent reasoning capabilities through Chain-of-Thought (CoT) into the UQ process. Our CoT-UQ framework captures critical information during inference by extracting keywords from each reasoning step and assessing their importance to the final answer. The uncertainty scores of keywords are then aggregated based on their significance to produce a final uncertainty estimate. We conduct extensive experiments based on Llama Family with model sizes varying from 8B to 13B across logical and mathematical reasoning tasks. Experimental results demonstrate that CoT-UQ significantly outperforms existing UQ methods, achieving an average improvement of 5.9% AUROC compared to current UQ methods.
pdf
bib
abs
ADO: Automatic Data Optimization for Inputs in LLM Prompts
Sam Lin
|
Wenyue Hua
|
Lingyao Li
|
Zhenting Wang
|
Yongfeng Zhang
This study explores a novel approach to enhance the performance of Large Language Models (LLMs) through the optimization of input data within prompts. While previous research has primarily focused on refining instruction components and augmenting input data with in-context examples, our work investigates the potential benefits of optimizing the input data itself. We introduce a two-pronged strategy for input data optimization: content engineering and structural reformulation. Content engineering involves imputing missing values, removing irrelevant attributes, and enriching profiles by generating additional information inferred from existing attributes. Subsequent to content engineering, structural reformulation is applied to optimize the presentation of the modified content to LLMs, given their sensitivity to input format. Our findings suggest that these optimizations can significantly improve the performance of LLMs in various tasks, offering a promising avenue for future research in prompt engineering. The source code is available at https://github.com/glin2229/Automatic-Data-Optimization.
pdf
bib
abs
Large Language Models Still Exhibit Bias in Long Text
Wonje Jeung
|
Dongjae Jeon
|
Ashkan Yousefpour
|
Jonghyun Choi
Existing fairness benchmarks for large language models (LLMs) primarily focus on simple tasks, such as multiple-choice questions, overlooking biases that may arise in more complex scenarios like long-text generation. To address this gap, we introduce the Long Text Fairness Test (LTF-TEST), a framework that evaluates biases in LLMs through essay-style prompts. LTF-TEST covers 14 topics and 10 demographic axes, including gender and race, resulting in 11,948 samples. By assessing both model responses and the reasoning behind them, LTF-TEST uncovers subtle biases that are difficult to detect in simple responses. In our evaluation of five recent LLMs, including GPT-4o and LLaMA3, we identify two key patterns of bias. First, these models frequently favor certain demographic groups in their responses. Second, they show excessive sensitivity toward traditionally disadvantaged groups, often providing overly protective responses while neglecting others. To mitigate these biases, we propose REGARD-FT, a finetuning approach that pairs biased prompts with neutral responses. REGARD-FT reduces gender bias by 34.6% and improves performance by 1.4 percentage points on the BBQ benchmark, offering a promising approach to addressing biases in long-text generation tasks.
pdf
bib
abs
Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation
Qiyue Gao
|
Xinyu Pi
|
Kevin Liu
|
Junrong Chen
|
Ruolan Yang
|
Xinqi Huang
|
Xinyu Fang
|
Lu Sun
|
Gautham Kishore
|
Bo Ai
|
Stone Tao
|
Mengyang Liu
|
Jiaxi Yang
|
Chao-Jung Lai
|
Chuanyang Jin
|
Jiannan Xiang
|
Benhao Huang
|
Zeming Chen
|
David Danks
|
Hao Su
|
Tianmin Shu
|
Ziqiao Ma
|
Lianhui Qin
|
Zhiting Hu
Internal world models (WMs) enable agents to understand the world’s state and predict transitions, serving as the basis for advanced deliberative reasoning.Recent large Vision-Language Models (VLMs), such as GPT-4o and Gemini, exhibit potential as general-purpose WMs. While the latest studies have evaluated and shown limitations in specific capabilities such as visual understanding, a systematic evaluation of VLMs’ fundamental WM abilities remains absent. Drawing on comparative psychology and cognitive science, we propose a two-stage framework that assesses **perception** (visual, spatial, temporal, quantitative, and motion) and **prediction** (mechanistic simulation, transitive inference, compositional inference) to provide an atomic evaluation of VLMs as WMs. Guided by this framework, we introduce **WM-ABench**, a large-scale benchmark comprising 23 fine-grained evaluation dimensions across 6 diverse simulated environments with controlled counterfactual simulations. Through 660 experiments on 15 latest commercial and open-source VLMs, we find that these models exhibit striking limitations in basic world modeling abilities. For instance, all models perform at near-random accuracy when distinguishing motion trajectories. Additionally, they lack disentangled understanding—e.g., they tend to believe blue objects move faster than green ones. More rich results and analyses reveal significant gaps between VLMs and human-level world modeling.
pdf
bib
abs
Protecting Users From Themselves: Safeguarding Contextual Privacy in Interactions with Conversational Agents
Ivoline C. Ngong
|
Swanand Ravindra Kadhe
|
Hao Wang
|
Keerthiram Murugesan
|
Justin D. Weisz
|
Amit Dhurandhar
|
Karthikeyan Natesan Ramamurthy
Conversational agents are increasingly woven into individuals’ personal lives, yet users often underestimate the privacy risks associated with them. The moment users share information with these agents —such as large language models (LLMs)— their private information becomes vulnerable to exposure. In this paper, we characterize the notion of contextual privacy for user interactions with LLM-based Conversational Agents (LCAs). It aims to minimize privacy risks by ensuring that users (sender) disclose only information that is both relevant and necessary for achieving their intended goals when interacting with LCAs (untrusted receivers). Through a formative design user study, we observe how even “privacy-conscious” users inadvertently reveal sensitive information through indirect disclosures. Based on insights from this study, we propose a locally deployable framework that operates between users and LCAs, identifying and reformulating out-of-context information in user prompts. Our evaluation using examples from ShareGPT shows that lightweight models can effectively implement this framework, achieving strong gains in contextual privacy while preserving the user’s intended interaction goals. Notably, about 76% of participants in our human evaluation preferred the reformulated prompts over the original ones, validating the usability and effectiveness of contextual privacy in our proposed framework. We open source the code at https://github.com/IBM/contextual-privacy-LLM.
pdf
bib
abs
Enhancing Persona Consistency for LLMs’ Role-Playing using Persona-Aware Contrastive Learning
Ke Ji
|
Yixin Lian
|
Linxu Li
|
Jingsheng Gao
|
Weiyuan Li
|
Bin Dai
In recent years, large language models (LLMs) have achieved breakthrough progress in many dialogue generation tasks. However, their lack of emotion and fine-grained role awareness limits the model’s ability to provide personalized and diverse interactions further. Current methods face high costs in collecting high-quality annotated data for scenarios such as role-playing, and traditional human alignment methods are difficult to deploy due to the inherent diversity of model behavior in role-playing scenarios. Inspired by the alignment of models for safety behaviors through RLHF (Reinforcement Learning from Human Feedback), in this paper, we revisit model role-playing behavior from the perspective of persona alignment and propose a novel annotation-free framework named Persona-Aware Contrastive Learning (PCL) to align LLMs’ behavior during role-playing, enhancing the model’s role consistency. Specifically, we first design a role chain method to encourage the model to self-question based on the role characteristics and dialogue context to adjust personality consistency. Then, we further enhance the model’s role-playing strategy through iterative adversarial modeling between the use of role characteristics and not. Experiments on both black-box and white-box LLMs show that LLMs equipped with PCL significantly outperform vanilla LLMs under automatic evaluation methods (CharEval & GPT-4) and human expert evaluation.
pdf
bib
abs
M2-TabFact: Multi-Document Multi-Modal Fact Verification with Visual and Textual Representations of Tabular Data
Mingyang Zhou
|
Lingyu Zhang
|
Sophia Horng
|
Maximillian Chen
|
Kung-Hsiang Huang
|
Shih-Fu Chang
Tabular data is used to store information in many real-world systems ranging from finance to healthcare. However, such structured data is often communicated to humans in visually interpretable formats (e.g. charts and textual paragraphs), making it imperative that fact-checking models should be able to reason over multiple pieces of structured evidence presented across different modalities. In this paper, we propose Multi-Document Multi-Modal Table-based Fact Verification (M2-TabFact), a challenging fact verification task that requires jointly reasoning over visual and textual representations of structured data. We design an automatic data generation pipeline that converts existing tabular data into descriptive visual and textual evidence. We then use Large Language Models to generate complex claims that depend on multi-document, multi-modal evidence. In total, we create 8,856 pairs of complex claims and multi-modal evidence through this procedure and systematically evaluate M2-TabFact with a set of strong vision-language models (VLM). We find that existing VLMs have large gaps in fact verification performance compared to humans. Moreover, we find that they are imbalanced when it comes to their ability to handle reason about different modalities, and currently struggle to reason about information extracted from multiple documents.
pdf
bib
abs
Fuzzy Speculative Decoding for a Tunable Accuracy-Runtime Tradeoff
Maximilian Holsman
|
Yukun Huang
|
Bhuwan Dhingra
Speculative Decoding (SD) enforces strict distributional equivalence to the target model when accepting candidate tokens. While it maintains the target model’s generation quality, this strict equivalence limits the speedup achievable by SD and prevents users from trading deviations from the target distribution in exchange for further inference speed gains. To address these limitations, we introduce Fuzzy Speculative Decoding (FSD) - a decoding algorithm that generalizes SD by accepting candidate tokens based on the divergences between the target and draft model distributions. By allowing for controlled divergence from the target model, FSD enables users to flexibly trade generation quality for inference speed. Across several benchmarks, our method is able to achieve significant runtime improvements of over 5 tokens per second faster than SD at only an approximate 2% absolute reduction in benchmark accuracy. In many cases, FSD is even able to match SD benchmark accuracy at over 2 tokens per second faster, demonstrating that distributional equivalence is not necessary to maintain target model performance. Furthermore, FSD can be seamlessly integrated into existing SD extensions; we demonstrate this by applying FSD to EAGLE-2, greatly enhancing this existing extension’s efficiency while allowing it to leverage FSD’s tunable quality-speed trade-off.
pdf
bib
abs
PLAY2PROMPT: Zero-shot Tool Instruction Optimization for LLM Agents via Tool Play
Wei Fang
|
Yang Zhang
|
Kaizhi Qian
|
James R. Glass
|
Yada Zhu
Large language models (LLMs) are increasingly integrated with specialized external tools, yet many tasks demand zero-shot tool usage with minimal or noisy documentation. Existing solutions rely on manual rewriting or labeled data for validation, making them inapplicable in true zero-shot settings. To address these challenges, we propose PLAY2PROMPT, an automated framework that systematically “plays” with each tool to explore its input-output behaviors. Through this iterative trial-and-error process, PLAY2PROMPT refines tool documentation and generates usage examples without any labeled data. These examples not only guide LLM inference but also serve as validation to further enhance tool utilization. Extensive experiments on real-world tasks demonstrate that PLAY2PROMPT significantly improves zero-shot tool performance across both open and closed models, offering a scalable and effective solution for domain-specific tool integration.
pdf
bib
abs
Towards the Pedagogical Steering of Large Language Models for Tutoring: A Case Study with Modeling Productive Failure
Romain Puech
|
Jakub Macina
|
Julia Chatain
|
Mrinmaya Sachan
|
Manu Kapur
One-to-one tutoring is one of the most efficient methods of teaching. With the growing popularity of Large Language Models (LLMs), there have been efforts to create LLM-based conversational tutors which can expand the benefits of one-to-one tutoring to everyone. However, current LLMs are trained primarily to be helpful assistants and lack crucial pedagogical skills. For example, they often quickly reveal the solution to the student and fail to plan for a richer multi-turn pedagogical interaction.To use LLMs in pedagogical settings, they need to be steered to use effective teaching strategies: a problem we introduce as Pedagogical Steering. We develop StratL, an algorithm to optimize LLM prompts and steer it to follow a predefined multi-turn tutoring plan represented as a transition graph.As a case study, we create a prototype tutor for high school math following Productive Failure (PF), an advanced and effective learning design. To validate our approach in a real-world setting, we run a field study with 17 high school students in Singapore and show that StratL succeeds in steering the LLM to follow the PF tutoring strategy. Finally, we highlight challenges in Pedagogical Steering of LLMs and offer opportunities for further improvements by publishing a dataset of PF problems and our code.
pdf
bib
abs
Spotting Out-of-Character Behavior: Atomic-Level Evaluation of Persona Fidelity in Open-Ended Generation
Jisu Shin
|
Juhyun Oh
|
Eunsu Kim
|
Hoyun Song
|
Alice Oh
Ensuring persona fidelity in large language models (LLMs) is essential for maintaining coherent and engaging human-AI interactions. However, LLMs often exhibit Out-of-Character (OOC) behavior, where generated responses deviate from an assigned persona, leading to inconsistencies that affect model reliability. Existing evaluation methods typically assign single scores to entire responses, struggling to capture subtle persona misalignment, particularly in long-form text generation. To address this limitation, we propose an atomic-level evaluation framework that quantifies persona fidelity at a finer granularity. Our three key metrics measure the degree of persona alignment and consistency within and across generations. Our approach enables a more precise and realistic assessment of persona fidelity by identifying subtle deviations that real users would encounter. Through our experiments, we demonstrate that our framework effectively detects persona inconsistencies that prior methods overlook. By analyzing persona fidelity across diverse tasks and personality types, we reveal how task structure and persona desirability influence model adaptability, highlighting challenges in maintaining consistent persona expression.
pdf
bib
abs
What Language Do Non-English-Centric Large Language Models Think in?
Chengzhi Zhong
|
Qianying Liu
|
Fei Cheng
|
Junfeng Jiang
|
Zhen Wan
|
Chenhui Chu
|
Yugo Murawaki
|
Sadao Kurohashi
In this study, we investigate whether non-English-centric large language models, ‘think’ in their specialized language. Specifically, we analyze how intermediate layer representations, when projected into the vocabulary space, favor certain languages during generation—termed as latent languages. We categorize non-English-centric models into two groups: CPMs, which are English-centric models with continued pre-training on its specialized language, and BLMs, which are pre-trained on a balanced mix of multiple languages from scratch. Our findings reveal that while English-centric models rely exclusively on English as their latent language, non-English-centric models activate multiple latent languages, dynamically selecting the most similar one based on both the source and target languages. This also influences responses to culture difference questions, reducing English-centric biases in non-English models. This study deepens our understanding of language representation in non-English-centric LLMs, shedding light on the intricate dynamics of multilingual processing at the representational level.
pdf
bib
abs
T5Score: A Methodology for Automatically Assessing the Quality of LLM Generated Multi-Document Topic Sets
Itamar Trainin
|
Omri Abend
Using LLMs for Multi-Document Topic Extraction has recently gained popularity due to their apparent high-quality outputs, expressiveness, and ease of use. However, most existing evaluation practices are not designed for LLM-generated topics and result in low inter-annotator agreement scores, hindering the reliable use of LLMs for the task. To address this, we introduce T5Score, an evaluation methodology that decomposes the quality of a topic set into quantifiable aspects, measurable through easy-to-perform annotation tasks. This framing enables a convenient, manual or automatic, evaluation procedure resulting in a strong inter-annotator agreement score.To substantiate our methodology and claims, we perform extensive experimentation on multiple datasets and report the results.
pdf
bib
abs
Uncertainty-Aware Contrastive Decoding
Hakyung Lee
|
Subeen Park
|
Joowang Kim
|
Sungjun Lim
|
Kyungwoo Song
Large language models excel in a wide range of natural language processing tasks, but generating factually accurate and consistent outputs remains a challenge. To improve text reliability, Contrastive Decoding (CD) refines token selection by leveraging differences between an expert and base model, penalizing low-quality token choices. However, CD employs static weighting between models, making it sensitive to variations in model architecture and input characteristics, often resulting in suboptimal token selection and error propagation throughout generation. We propose Uncertainty-Aware Contrastive Decoding (UCD), a method that dynamically adjusts model contributions at each decoding step based on uncertainty. We introduce a cumulative energy function, where uncertainty is quantified as the negative log-sum-exp over logits, and decomposed into entropy and expected logit components. This energy serves as a dynamic confidence signal, guiding adaptive model weighting during generation. We demonstrate through extensive experiments that UCD significantly improves factual accuracy and reliability over existing decoding methods. Finally, we provide a theoretical analysis showing that our energy function serves as a well-defined uncertainty metric capturing model confidence. Our code is available at: https://github.com/MLAI-Yonsei/UCD.
pdf
bib
abs
GEMS: Generation-Based Event Argument Extraction via Multi-perspective Prompts and Ontology Steering
Run Lin
|
Yao Liu
|
Yanglei Gan
|
Yuxiang Cai
|
Tian Lan
|
Qiao Liu
Generative methods significantly advance event argument extraction by probabilistically generating event argument sequences in a structured format. However, existing approaches primarily rely on a single prompt to generate event arguments in a fixed, predetermined order. Such a rigid approach overlooks the complex structural and dynamic interdependencies among event arguments. In this work, we present GEMS, a multi-prompt learning framework that Generates Event arguments via Multi-perspective prompts and ontology Steering. Specifically, GEMS utilizes multiple unfilled prompts for each sentence, predicting event arguments in varying sequences to explicitly capture the interrelationships between arguments. These predictions are subsequently aggregated using a voting mechanism. Furthermore, an ontology-driven steering mechanism is proposed to ensure that the generated arguments are contextually appropriate and consistent with event-specific knowledge. Extensive experiments on two benchmark datasets demonstrate that GEMS achieves state-of-the-art performance, particularly in low-resource settings. The source code is available at: https://github.com/AONE-NLP/EAE-GEMS
pdf
bib
abs
RomanLens: The Role Of Latent Romanization In Multilinguality In LLMs
Alan Saji
|
Jaavid Aktar Husain
|
Thanmay Jayakumar
|
Raj Dabre
|
Anoop Kunchukuttan
|
Ratish Puduppully
Large Language Models (LLMs) exhibit strong multilingual performance despite being predominantly trained on English-centric corpora. This raises a fundamental question: How do LLMs achieve such multilingual capabilities? Focusing on languages written in non-Roman scripts, we investigate the role of Romanization—the representation of non-Roman scripts using Roman characters—as a potential bridge in multilingual processing. Using mechanistic interpretability techniques, we analyze next-token generation and find that intermediate layers frequently represent target words in Romanized form before transitioning to native script, a phenomenon we term Latent Romanization. Further, through activation patching experiments, we demonstrate that LLMs encode semantic concepts similarly across native and Romanized scripts, suggesting a shared underlying representation. Additionally, for translation into non-Roman script languages, our findings reveal that when the target language is in Romanized form, its representations emerge earlier in the model’s layers compared to native script. These insights contribute to a deeper understanding of multilingual representation in LLMs and highlight the implicit role of Romanization in facilitating language transfer.
pdf
bib
abs
7 Points to Tsinghua but 10 Points to ? Assessing Large Language Models in Agentic Multilingual National Bias
Qianying Liu
|
Katrina Qiyao Wang
|
Fei Cheng
|
Sadao Kurohashi
Large Language Models have garnered significant attention for their capabilities in multilingual natural language processing, while studies on risks associated with cross biases are limited to immediate context preferences. Cross-language disparities in reasoning-based recommendations remain largely unexplored, with a lack of even descriptive analysis. This study is the first to address this gap. We test LLM’s applicability and capability in providing personalized advice across three key scenarios: university applications, travel, and relocation.We investigate multilingual bias in state-of-the-art LLMs by analyzing their responses to decision-making tasks across multiple languages. We quantify bias in model-generated scores and assess the impact of demographic factors and reasoning strategies (e.g., Chain-of-Thought prompting) on bias patterns. Our findings reveal significant biases in both the scores and the reasoning structure of non-English languages. We also draw future implications for improving multilingual alignment in AI systems.
pdf
bib
abs
Search-in-Context: Efficient Multi-Hop QA over Long Contexts via Monte Carlo Tree Search with Dynamic KV Retrieval
Jiabei Chen
|
Guang Liu
|
Shizhu He
|
Kun Luo
|
Yao Xu
|
Jun Zhao
|
Kang Liu
Recent advancements in large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks, such as math problem-solving and code generation. However, multi-hop question answering (MHQA) over long contexts, which demands both robust knowledge-intensive reasoning and efficient processing of lengthy documents, remains a significant challenge. Existing approaches often struggle to balance these requirements, either neglecting explicit reasoning or incurring expensive computational costs due to full-attention mechanisms over long contexts. To address this, we propose **Search-in-Context (SIC)**, a novel framework that integrates Monte Carlo Tree Search (MCTS) with dynamic key-value (KV) retrieval to enable iterative, context-aware reasoning. SIC dynamically retrieves critical KV pairs (e.g., 4K tokens) at each step, prioritizing relevant evidence while mitigating the “lost in the middle” problem. Furthermore, the paper introduces a Process-Reward Model (PRM) trained on auto-labeled data to guide the MCTS process with stepwise rewards, promoting high-quality reasoning trajectories without manual annotation. Experiments on three long-context MHQA benchmarks (HotpotQA, 2WikiMultihopQA, MuSiQue) and a counterfactual multi-hop dataset demonstrate SIC’s superiority, achieving state-of-the-art performance while significantly reducing computational overhead.
pdf
bib
abs
LLM-as-an-Interviewer: Beyond Static Testing Through Dynamic LLM Evaluation
Eunsu Kim
|
Juyoung Suk
|
Seungone Kim
|
Niklas Muennighoff
|
Dongkwan Kim
|
Alice Oh
We introduce LLM-as-an-Interviewer, a novel paradigm for evaluating large language models (LLMs). This approach leverages multi-turn interactions where the LLM interviewer actively provides feedback on responses and poses follow-up questions to the evaluated LLM. At the start of the interview, the LLM interviewer dynamically modifies datasets to generate initial questions, mitigating data contamination. We apply the LLM-as-an-Interviewer framework to evaluate six models on the reasoning, factuality and instruction-following tasks. Our results show that the framework effectively provides insights into LLM performance, including the quality of initial responses, adaptability to feedback, and ability to address follow-up queries like clarification or additional knowledge requests. The framework also addresses key limitations of conventional methods like LLM-as-a-Judge, including verbosity bias and inconsistency across runs. Finally, we propose the Interview Report, which aggregates insights from the interview process, providing examples and a comprehensive analysis of the LLM’s strengths and weaknesses. This report offers a detailed snapshot of the model’s real-world applicability.
pdf
bib
abs
IntentionESC: An Intention-Centered Framework for Enhancing Emotional Support in Dialogue Systems
Xinjie Zhang
|
Wenxuan Wang
|
Qin Jin
In emotional support conversations, unclear intentions can lead supporters to employ inappropriate strategies, inadvertently imposing their expectations or solutions on the seeker. Clearly defined intentions are essential for guiding both the supporter’s motivations and the overall emotional support process. In this paper, we propose the Intention-centered Emotional Support Conversation (IntentionESC) framework, which defines the possible intentions of supporters in emotional support conversations, identifies key emotional state aspects for inferring these intentions, and maps them to appropriate support strategies. While Large Language Models (LLMs) excel in text generating, they fundamentally operate as probabilistic models trained on extensive datasets, lacking a true understanding of human thought processes and intentions. To address this limitation, we introduce the Intention CEntric Chain-of-Thought (ICECoT) mechanism. ICECoT enables LLMs to mimic human reasoning by analyzing emotional states, inferring intentions, and selecting suitable support strategies, thereby generating more effective emotional support responses. To train the model with ICECoT and integrate expert knowledge, we design an automated annotation pipeline that produces high-quality training data. Furthermore, we develop a comprehensive evaluation scheme to assess emotional support efficacy and conduct extensive experiments to validate our framework. Our data and code will be publically released to facilitate further research.
pdf
bib
abs
Beyond Context to Cognitive Appraisal: Emotion Reasoning as a Theory of Mind Benchmark for Large Language Models
Gerard Christopher Yeo
|
Kokil Jaidka
Datasets used for emotion recognition tasks typically contain overt cues that can be used in predicting the emotions expressed in a text. However, one challenge is that texts sometimes contain covert contextual cues that are rich in affective semantics, which warrant higher-order reasoning abilities to infer emotional states, not simply the emotions conveyed. This study advances beyond surface-level perceptual features to investigate how large language models (LLMs) reason about others’ emotional states using contextual information, within a Theory-of-Mind (ToM) framework. Grounded in Cognitive Appraisal Theory, we curate a specialized ToM evaluation dataset to assess both forward reasoning—from context to emotion—and backward reasoning—from emotion to inferred context. We showed that LLMs can reason to a certain extent, although they are poor at associating situational outcomes and appraisals with specific emotions. Our work highlights the need for psychological theories in the training and evaluation of LLMs in the context of emotion reasoning.
pdf
bib
abs
CSTRL: Context-Driven Sequential Transfer Learning for Abstractive Radiology Report Summarization
Mst. Fahmida Sultana Naznin
|
Adnan Ibney Faruq
|
Mostafa Rifat Tazwar
|
Md Jobayer
|
Md. Mehedi Hasan Shawon
|
Md Rakibul Hasan
A radiology report comprises several sections, including the Findings and Impression of the diagnosis. Automatically generating the Impression from the Findings is crucial for reducing radiologists’ workload and improving diagnostic accuracy. Pretrained models that excel in common abstractive summarization problems encounter challenges when applied to specialized medical domains largely due to the complex terminology and the necessity for accurate clinical context. Such tasks in medical domains demand extracting core information, avoiding context shifts, and maintaining proper flow. Misuse of medical terms can lead to drastic clinical errors. To address these issues, we introduce a sequential transfer learning that ensures key content extraction and coherent summarization. Sequential transfer learning often faces challenges like initial parameter decay and knowledge loss, which we resolve with the Fisher matrix regularization. Using MIMIC-CXR and Open-I datasets, our model, CSTRL — Context-driven Sequential TRansfer Learning — achieved state-of-the-art performance, showing 56.2% improvement in BLEU-1, 40.5% in BLEU-2, 84.3% in BLEU-3, 28.9% in ROUGE-1, 41.0% in ROUGE-2 and 26.5% in ROGUE-3 score over benchmark studies. We also analyze factual consistency scores while preserving the medical context. Our code is publicly available at https://github.com/fahmidahossain/Report_Summarization.
pdf
bib
abs
Rethinking Prompt-based Debiasing in Large Language Model
Xinyi Yang
|
Runzhe Zhan
|
Shu Yang
|
Junchao Wu
|
Lidia S. Chao
|
Derek F. Wong
Investigating bias in large language models (LLMs) is crucial for developing trustworthy AI. While prompt-based through prompt engineering is common, its effectiveness relies on the assumption that models inherently understand biases. Our study systematically analyzed this assumption using the BBQ and StereoSet benchmarks on both open-source models as well as commercial GPT model. Experimental results indicate that prompt-based is often superficial; for instance, the Llama2-7B-Chat model misclassified over 90% of unbiased content as biased, despite achieving high accuracy in identifying bias issues on the BBQ dataset. Additionally, specific evaluation and question settings in bias benchmarks often lead LLMs to choose “evasive answers”, disregarding the core of the question and the relevance of the response to the context. Moreover, the apparent success of previous methods may stem from flawed evaluation metrics. Our research highlights a potential “false prosperity” in prompt-base efforts and emphasizes the need to rethink bias evaluation metrics to ensure truly trustworthy AI. We will release our data and code upon acceptance.
pdf
bib
abs
Exploring In-context Example Generation for Machine Translation
Dohyun Lee
|
Seungil Chad Lee
|
Chanwoo Yang
|
Yujin Baek
|
Jaegul Choo
Large language models (LLMs) have demonstrated strong performance across various tasks, leveraging their exceptional in-context learning ability with only a few examples.Accordingly, the selection of optimal in-context examples has been actively studied in the field of machine translation.However, these studies presuppose the presence of a demonstration pool with human-annotated pairs, making them less applicable to low-resource languages where such an assumption is challenging to meet.To overcome this limitation, this paper explores the research direction of in-context example generation for machine translation.Specifically, we propose Demonstration Augmentation for Translation (DAT), a simple yet effective approach that generates example pairs without relying on any external resources.This method builds upon two prior criteria, relevance and diversity, which have been highlighted in previous work as key factors for in-context example selection.Through experiments and analysis on low-resource languages where human-annotated pairs are scarce, we show that DAT achieves superior translation quality compared to the baselines.Furthermore, we investigate the potential of progressively accumulating generated pairs during test time to build and reuse a demonstration pool. Our implementation is publicly available at https://github.com/aiclaudev/DAT.
pdf
bib
abs
Knowledge Base Construction for Knowledge-Augmented Text-to-SQL
Jinheon Baek
|
Horst Samulowitz
|
Oktie Hassanzadeh
|
Dharmashankar Subramanian
|
Sola Shirai
|
Alfio Gliozzo
|
Debarun Bhattacharjya
Text-to-SQL aims to translate natural language queries into SQL statements, which is practical as it enables anyone to easily retrieve the desired information from databases. Recently, many existing approaches tackle this problem with Large Language Models (LLMs), leveraging their strong capability in understanding user queries and generating corresponding SQL code. Yet, the parametric knowledge in LLMs might be limited to covering all the diverse and domain-specific queries that require grounding in various database schemas, which makes generated SQLs less accurate oftentimes. To tackle this, we propose constructing the knowledge base for text-to-SQL, a foundational source of knowledge, from which we retrieve and generate the necessary knowledge for given queries. In particular, unlike existing approaches that either manually annotate knowledge or generate only a few pieces of knowledge for each query, our knowledge base is comprehensive, which is constructed based on a combination of all the available questions and their associated database schemas along with their relevant knowledge, and can be reused for unseen databases from different datasets and domains. We validate our approach on multiple text-to-SQL datasets, considering both the overlapping and non-overlapping database scenarios, where it outperforms relevant baselines substantially.
pdf
bib
abs
NBDESCRIB: A Dataset for Text Description Generation from Tables and Code in Jupyter Notebooks with Guidelines
Xuye Liu
|
Tengfei Ma
|
Yimu Wang
|
Fengjie Wang
|
Jian Zhao
Generating cell-level descriptions for Jupyter Notebooks, which is a major resource consisting of codes, tables, and descriptions, has been attracting increasing research attention. However, existing methods for Jupyter Notebooks mostly focus on generating descriptions from code snippets or table outputs independently. On the other side, descriptions should be personalized as users have different purposes in different scenarios while previous work ignored this situation during description generation. In this work, we formulate a new task, personalized description generation with code, tables,and user-written guidelines in Jupyter Notebooks. To evaluate this new task, we collect and propose a benchmark, namely NBDESCRIB: , containing code, tables, and user-written guidelines as inputs and personalized descriptions as targets. Extensive experiments show that while existing models of text generation are able to generate fluent and readable descriptions, they still struggle to produce factually correct descriptions without user-written guidelines. CodeT5 achieved the highest scores in Orientation (1.27) and Correctness (-0.43) among foundation models in human evaluation, while the ground truth scored higher in Orientation (1.45) and Correctness (1.19). Common error patterns involve misalignment with guidelines, incorrect variable values, omission of im-031 portant code information, and reasoning errors.032 Moreover, ablation studies show that adding guidelines significantly enhances performance, both qualitatively and quantitatively.
pdf
bib
abs
ECoRAG: Evidentiality-guided Compression for Long Context RAG
Yeonseok Jeong
|
Jinsu Kim
|
Dohyeon Lee
|
Seung-won Hwang
Large Language Models (LLMs) have shown remarkable performance in Open-Domain Question Answering (ODQA) by leveraging external documents through Retrieval-Augmented Generation (RAG). To reduce RAG overhead, from longer context, context compression is necessary. However, prior compression methods do not focus on filtering out non-evidential information, which limit the performance in LLM-based RAG. We thus propose Evidentiality-guided RAG, or ECoRAG framework. ECoRAG improves LLM performance by compressing retrieved documents based on evidentiality, ensuring whether answer generation is supported by the correct evidence. As an additional step, ECoRAG reflects whether the compressed content provides sufficient evidence, and if not, retrieves more until sufficient. Experiments show that ECoRAG improves LLM performance on ODQA tasks, outperforming existing compression methods. Furthermore, ECoRAG is highly cost-efficient, as it not only reduces latency but also minimizes token usage by retaining only the necessary information to generate the correct answer. Code is available at https://github.com/ldilab/ECoRAG.
pdf
bib
abs
From Complexity to Clarity: AI/NLP’s Role in Regulatory Compliance
Jivitesh Jain
|
Nivedhitha Dhanasekaran
|
Mona T. Diab
Regulatory data compliance is a cornerstone of trust and accountability in critical sectors like finance, healthcare, and technology, yet its complexity poses significant challenges for organizations worldwide. Recent advances in natural language processing, particularly large language models, have demonstrated remarkable capabilities in text analysis and reasoning, offering promising solutions for automating compliance processes. This survey examines the current state of automated data compliance, analyzing key challenges and approaches across problem areas. We identify critical limitations in current datasets and techniques, including issues of adaptability, completeness, and trust. Looking ahead, we propose research directions to address these challenges, emphasizing standardized evaluation frameworks and balanced human-AI collaboration.
pdf
bib
abs
EXPERT: An Explainable Image Captioning Evaluation Metric with Structured Explanations
Hyunjong Kim
|
Sangyeop Kim
|
Jongheon Jeong
|
Yeongjae Cho
|
Sungzoon Cho
Recent advances in large language models and vision-language models have led to growing interest in explainable evaluation metrics for image captioning. However, these metrics generate explanations without standardized criteria, and the overall quality of the generated explanations remains unverified. In this paper, we propose EXPERT, a reference-free evaluation metric that provides structured explanations based on three fundamental criteria: fluency, relevance, and descriptiveness. By constructing large-scale datasets of high-quality structured explanations, we develop a two-stage evaluation template to effectively supervise a vision-language model for both scoring and explanation generation. EXPERT achieves state-of-the-art results on benchmark datasets while providing significantly higher-quality explanations than existing metrics, as validated through comprehensive human evaluation. Our code and datasets are available at https://github.com/hjkim811/EXPERT.
pdf
bib
abs
Mind Your Theory: Theory of Mind Goes Deeper Than Reasoning
Eitan Wagner
|
Nitay Alon
|
Joseph M Barnby
|
Omri Abend
Theory of Mind (ToM) capabilities in LLMs have recently become a central object of investigation, sparking debates and discussions. In this position paper, we explore many lines of work in different communities in AI and cognitive science. Inspired by cognitive work, we view ToM tasks as a two-step process: (I) first, determining whether and how to invoke ToM, which includes setting the appropriate Depth of Mentalizing (DoM); and (II) second, applying correct inference given the appropriate DoM. We identify that many works about ToM in LLMs, such as benchmarks and add-on modules, tend to unjustly overlook the first step and focus exclusively on the second one, which can be framed as a logic-reasoning task. We support our distinction with empirical evidence about the difficulty of the different steps in existing benchmarks. We conclude with suggestions for improved evaluation of ToM capabilities, inspired by dynamic environments used in cognitive tasks in biological agents.
pdf
bib
abs
LLMs are Biased Evaluators But Not Biased for Fact-Centric Retrieval Augmented Generation
Yen-Shan Chen
|
Jing Jin
|
Peng-Ting Kuo
|
Chao-Wei Huang
|
Yun-Nung Chen
Recent studies have demonstrated that large language models (LLMs) exhibit significant biases in evaluation tasks, particularly in preferentially rating and favoring self-generated content. However, the extent to which this bias manifests in fact-oriented tasks, especially within retrieval-augmented generation (RAG) frameworks—where keyword extraction and factual accuracy take precedence over stylistic elements—remains unclear. Our study addresses this knowledge gap by simulating two critical phases of the RAG framework. In the first phase, LLMs evaluated human-authored and model-generated passages, emulating the pointwise reranking phase. The second phase involves conducting pairwise reading comprehension tests to simulate the generation phase. Contrary to previous findings indicating a self-preference in rating tasks, our results reveal no significant self-preference effect in RAG frameworks. Instead, we observe that factual accuracy significantly influences LLMs’ output, even in the absence of prior knowledge. These findings are consistent among three common QA datasets (NQ, MARCO, TriviaQA Datasets) and 5 widely adopted language models (GPT-3.5, GPT-4o-mini, Gemini, LLaMA3, and Mistral). Our research contributes to the ongoing discourse on LLM biases and their implications for RAG-based system, offering insights that may inform the development of more robust and unbiased LLM systems.
pdf
bib
abs
Standard Quality Criteria Derived from Current NLP Evaluations for Guiding Evaluation Design and Grounding Comparability and AI Compliance Assessments
Anya Belz
|
Simon Mille
|
Craig Thomson
Research shows that two evaluation experiments reporting results for the same qualitycriterion name (e.g. Fluency) do not necessarily evaluate the same aspect of quality. Notknowing when two evaluations are comparablein this sense means we currently lack the abilityto draw conclusions based on multiple independently conducted evaluations. It is hard to seehow this issue can be fully addressed other thanby the creation of a standard set of quality criterion names and definitions that the evaluationsin use in NLP can be grounded in. Taking a descriptivist approach, the QCET Quality Criteriafor Evaluation Taxonomy derives a standard setof 114 quality criterion names and definitionsfrom three surveys of a combined total of 933evaluation experiments in NLP, and structuresthem into a reference taxonomy. We presentQCET and its uses in (i) establishing comparability of existing evaluations, (ii) guiding thedesign of new evaluations, and (iii) assessingregulation compliance.
pdf
bib
abs
skLEP: A Slovak General Language Understanding Benchmark
Marek Suppa
|
Andrej Ridzik
|
Daniel Hládek
|
Tomáš Javůrek
|
Viktória Ondrejová
|
Kristína Sásiková
|
Martin Tamajka
|
Marian Simko
In this work, we introduce skLEP, the first comprehensive benchmark specifically designed for evaluating Slovak natural language understanding (NLU) models. We have compiled skLEP to encompass nine diverse tasks that span token-level, sentence-pair, and document-level challenges, thereby offering a thorough assessment of model capabilities. To create this benchmark, we curated new, original datasets tailored for Slovak and meticulously translated established English NLU resources. Within this paper, we also present the first systematic and extensive evaluation of a wide array of Slovak-specific, multilingual, and English pre-trained language models using the skLEP tasks. Finally, we also release the complete benchmark data, an open-source toolkit facilitating both fine-tuning and evaluation of models, and a public leaderboard at
https://github.com/slovak-nlp/sklep in the hopes of fostering reproducibility and drive future research in Slovak NLU.
pdf
bib
abs
Can Vision Language Models Understand Mimed Actions?
Hyundong Justin Cho
|
Spencer Lin
|
Tejas Srinivasan
|
Michael Saxon
|
Deuksin Kwon
|
Natali T. Chavez
|
Jonathan May
Non-verbal communication (NVC) is an integral part of human language, but it has been overlooked in natural language processing research. Studying NVC in general is challenging because of its high variance in interpretation among individuals and cultures, but mime—the theatrical technique of suggesting intent using only gesture, expression, and movement—is a subset of NVC with much lower human interpretation variance. As a gateway for evaluating vision-language models on their understanding of NVC, we propose Mime Identification-based Multimodal Evaluation (MIME), a gesture recognition task built upon a novel corpus of mimed activity comprising 86 unique gestures with a variety of perturbations applied to the avatar, background, and viewpoint for evaluating recognition robustness. We find that both open-weight and API-based vision-language models perform significantly worse than humans at identifying mimed gestures in MIME, motivating the need for increased research for instilling more robust understanding of human actions for VLMs.
pdf
bib
abs
Training Language Model to Critique for Better Refinement
Tianshu Yu
|
Chao Xiang
|
Mingchuan Yang
|
Pei Ke
|
Bosi Wen
|
Cunxiang Wang
|
Jiale Cheng
|
Li Zhang
|
Xinyu Mu
|
Chuxiong Sun
|
Minlie Huang
Large language models (LLMs) have demonstrated remarkable evaluation and critique capabilities, providing insightful feedback and identifying flaws in various tasks. However, limited research has explored which types of critiques are most effective for improving model responses or how to generate such critiques. To address this gap, we introduce Refinement-oriented Critique Optimization (RCO), a novel framework designed to train critic models using refinement signals. RCO uses a feedback loop where critiques, generated by the critic model, guide the actor model in refining its responses. The critique utility (CU) quantifies the effectiveness of these refinements, serving as the reward signal for training the critic model. By focusing on critiques that lead to better refinements, RCO eliminates the need for direct critique preference assessment, ensuring that critiques driving meaningful improvements are rewarded. We evaluate RCO across five tasks—dialog generation, summarization, question answering, mathematical reasoning, and code generation—and show that it significantly outperforms traditional methods and open-source models in terms of critique quality and refinement outcomes. Our contributions include the introduction of RCO, a novel supervision scheme based on refined response preferences, and comprehensive experimental results that highlight the method’s effectiveness in enhancing LLM critique-refinement loops. Code and data will be publicly available upon acceptance of this paper.
pdf
bib
abs
Dynamic Task Vector Grouping for Efficient Multi-Task Prompt Tuning
Peiyi Zhang
|
Richong Zhang
|
Zhijie Nie
|
Ziqiao Wang
Multi-task prompt tuning utilizes multiple high-resource source tasks to improve performance on low-source target tasks. Existing approaches transfer the soft prompt trained by combining all source tasks or a single “high-similar” source task one-time-only. However, we find that the optimal transfer performance often comes from a combination of source tasks, which is neither one nor all. Further, we find that the similarity between source and target tasks also changes dynamically during fine-tuning after transfering, making similarity calculation in the initiation stage inadequate. To address these issues, we propose a method called Dynamic Task Vector Grouping (DTVG), whose core ideas contain (1) measuring the task similarity with task vectors instead of soft prompt, (2) grouping the optimal source task combination based on two metrics: target similarity and knowledge consistency; (3) dynamically updating the combination in each iteration step. Extensive experiments on the 26 NLP datasets under different settings demonstrate that DTVG effectively groups similar source tasks while reducing negative transfer, achieving the start-of-art performance.
pdf
bib
abs
DICE-BENCH: Evaluating the Tool-Use Capabilities of Large Language Models in Multi-Round, Multi-Party Dialogues
Kyochul Jang
|
Donghyeon Lee
|
Kyusik Kim
|
Dongseok Heo
|
Taewhoo Lee
|
Woojeong Kim
|
Bongwon Suh
Existing function-calling benchmarks focus on single-turn interactions. However, they overlook the complexity of real-world scenarios. To quantify how existing benchmarks address practical applications, we introduce DICE-SCORE, a metric that evaluates the dispersion of tool-related information such as function name and parameter values throughout the dialogue. Analyzing existing benchmarks through DICE-SCORE reveals notably low scores, highlighting the need for more realistic scenarios. To address this gap, we present DICE-BENCH, a framework that constructs practical function-calling datasets by synthesizing conversations through a tool graph that maintains dependencies across rounds and a multi-agent system with distinct personas to enhance dialogue naturalness. The final dataset comprises 1,607 high-DICE-SCORE instances. Our experiments on 19 LLMs with DICE-BENCH show that significant advances are still required before such models can be deployed effectively in real-world settings. Our code and data are all publicly available.
pdf
bib
abs
HASH-RAG: Bridging Deep Hashing with Retriever for Efficient, Fine Retrieval and Augmented Generation
Jinyu Guo
|
Xunlei Chen
|
Qiyang Xia
|
Zhaokun Wang
|
Jie Ou
|
Libo Qin
|
Shunyu Yao
|
Wenhong Tian
Retrieval-Augmented Generation (RAG) encounters efficiency challenges when scaling to massive knowledge bases while preserving contextual relevance. We propose Hash-RAG, a framework that integrates deep hashing techniques with systematic optimizations to address these limitations. Our queries directly learn binary hash codes from knowledgebase code, eliminating intermediate feature extraction steps, and significantly reducing storage and computational overhead. Building upon this hash-based efficient retrieval framework, we establish the foundation for fine-grained chunking. Consequently, we design a Prompt-Guided Chunk-to-Context (PGCC) module that leverages retrieved hash-indexed propositions and their original document segments through prompt engineering to enhance the LLM’s contextual awareness. Experimental evaluations on NQ, TriviaQA, and HotpotQA datasets demonstrate that our approach achieves a 90% reduction in retrieval time compared to conventional methods while maintaining considerate recall performance. Additionally, The proposed system outperforms retrieval/non-retrieval baselines by 1.4-4.3% in EM scores.
pdf
bib
abs
A Constrained Text Revision Agent via Iterative Planning and Searching
Hannan Cao
|
Hwee Tou Ng
Existing text revision systems are capable of generating fluent and coherent text, but struggle with constrained text revision (CTR), which requires adherence to specific constraints. Furthermore, adapting these systems to diverse constraints is challenging. To bridge this gap, we introduce TRIPS, a Text Revision agent via Iterative Planning and Searching, focusing on CTR. TRIPS utilizes a planner, a reviser (i.e., a large language model), and adaptable tools to generate revisions tailored to different scenarios. Specifically, we propose an iterative self-training alignment method to construct the planner, which generates tool usage and text revision plans. Furthermore, we propose Tool-Guided Monte Carlo Tree Search (TG-MCTS), a novel CTR algorithm that extends MCTS with tool-guided expansion and evaluation, enabling the search for optimal revision strategies across various scenarios. To evaluate TRIPS, we introduce ConsTRev, a dataset with multi-level constrained instructions for paragraph-level revision. Experimental results show that TRIPS outperforms baselines in both constraint adherence and revision quality. Furthermore, TRIPS exhibits robust performance across diverse use cases, including plain text and LaTeX revision.
pdf
bib
abs
MMRefine: Unveiling the Obstacles to Robust Refinement in Multimodal Large Language Models
Gio Paik
|
Geewook Kim
|
Jinbae Im
This paper introduces MMRefine, a MultiModal Refinement benchmark designed to evaluate the error refinement capabilities of Multimodal Large Language Models (MLLMs). As the emphasis shifts toward enhancing reasoning during inference, MMRefine provides a framework that evaluates MLLMs’ abilities to detect and correct errors across six distinct scenarios beyond just comparing final accuracy before and after refinement. Furthermore, the benchmark analyzes the refinement performance by categorizing errors into six error types.Experiments with various open and closed MLLMs reveal bottlenecks and factors impeding refinement performance, highlighting areas for improvement in effective reasoning enhancement. Our code and dataset are publicly available at
https://github.com/naver-ai/MMRefine.
pdf
bib
abs
How Programming Concepts and Neurons Are Shared in Code Language Models
Amir Hossein Kargaran
|
Yihong Liu
|
François Yvon
|
Hinrich Schuetze
Several studies have explored the mechanisms of large language models (LLMs) in coding tasks, but most have focused on programming languages (PLs) in a monolingual setting. In this paper, we investigate the relationship between multiple PLs and English in the concept space of LLMs. We perform a few-shot translation task on 21 PL pairs using two Llama-based models. By decoding the embeddings of intermediate layers during this task, we observe that the concept space is closer to English (including PL keywords) and assigns high probabilities to English tokens in the second half of the intermediate layers. We analyze neuron activations for 11 PLs and English, finding that while language-specific neurons are primarily concentrated in the bottom layers, those exclusive to each PL tend to appear in the top layers. For PLs that are highly aligned with multiple other PLs, identifying language-specific neurons is not feasible. These PLs also tend to have a larger keyword set than other PLs and are closer to the model’s concept space regardless of the input/output PL in the translation task. Our findings provide insights into how LLMs internally represent PLs, revealing structural patterns in the model’s concept space. Code is available at https://github.com/cisnlp/code-specific-neurons.
pdf
bib
abs
DynaQuest: A Dynamic Question Answering Dataset Reflecting Real-World Knowledge Updates
Qian Lin
|
Junyi Li
|
Hwee Tou Ng
The rapidly changing nature of real-world information presents challenges for large language models (LLMs), which are typically trained on static datasets. This limitation makes it difficult for LLMs to accurately perform tasks that require up-to-date knowledge, such as time-sensitive question answering (QA). In this paper, we introduce **DynaQuest**, a **Dyna**mic **Quest**ion answering dataset reflecting knowledge updates in the real world. DynaQuest is based on Wikipedia Infoboxes, which are frequently updated to reflect real-world changes. Our dataset is created by automatically identifying and comparing changes between different versions of Wikipedia pages and generating question-answer pairs based on these updates. To address the challenges posed by our dynamic dataset, we propose **CARL**, a **C**ontext-**A**ware **R**einforcement **L**earning framework to improve the performance of LLMs on time-sensitive question answering. We conduct experiments on our collected dataset across recent time periods and demonstrate the effectiveness of our approach. Furthermore, we maintain a dynamic knowledge updating process, providing a periodically evolving benchmark to continually evaluate LLMs’ ability to answer time-sensitive questions.
pdf
bib
abs
ProcrustesGPT: Compressing LLMs with Structured Matrices and Orthogonal Transformations
Ekaterina Grishina
|
Mikhail Gorbunov
|
Maxim Rakhuba
Large language models (LLMs) demonstrate impressive results in natural language processing tasks but require a significant amount of computational and memory resources. Structured matrix representations are a promising way for reducing the number of parameters of these models. However, it seems unrealistic to expect that weight matrices of pretrained models can be accurately represented by structured matrices without any fine-tuning.To overcome this issue, we utilize the fact that LLM output is invariant under certain orthogonal transformations of weight matrices.This insight can be leveraged to identify transformations that significantly improve the compressibility of weights within structured classes.The proposed approach is applicable to various types of structured matrices that support efficient projection operations. Code is available at: https://github.com/GrishKate/ProcrustesGPT.
pdf
bib
abs
Revisiting In-Context Learning with Long Context Language Models
Jinheon Baek
|
Sun Jae Lee
|
Prakhar Gupta
|
Geunseob Oh
|
Siddharth Dalmia
|
Prateek Kolhar
In-Context Learning (ICL) is a technique by which language models make predictions based on examples provided in their input context. Previously, their context window size imposed a limit on the number of examples that can be shown, making example selection techniques crucial for identifying the maximally effective set of examples. However, the recent advent of Long Context Language Models (LCLMs) has significantly increased the number of examples that can be included in context, raising an important question of whether ICL performance in a many-shot regime is still sensitive to the method of sample selection. To answer this, we revisit these approaches in the context of LCLMs through extensive experiments on 18 datasets spanning 4 tasks. Surprisingly, we observe that sophisticated example selection techniques do not yield significant improvements over a simple random sample selection method. Instead, we discover that the advent of LCLMs has fundamentally shifted the challenge of ICL from that of selecting the most effective examples to that of collecting sufficient examples to fill the context window. Specifically, in certain datasets, including all available examples does not fully utilize the context window; however, by augmenting the examples in context with a simple data augmentation approach, we substantially improve ICL performance by 5%.
pdf
bib
abs
Rationalize and Align: Enhancing Writing Assistance with Rationale via Self-Training for Improved Alignment
Hannan Cao
|
Hai Ye
|
Hwee Tou Ng
A Writing Assistant (WA) is a system that offers writing suggestions based on user instructions. Existing WAs are typically built by training large language models (LLMs) on domain-specific instruction data through supervised fine-tuning (SFT) only. However, SFT optimizes models to match a single reference, failing to capture the inherent flexibility of text editing, where multiple valid revisions exist. Therefore, solely relying on SFT limits WA performance. To address this limitation, we propose the Rationalize and Align framework, which enhances the WA performance with rationale (i.e., linguistic explanations) and alignment. Our framework automatically generates the rationale and preference data for writing tasks via distillation and self-training, eliminating the need for human annotation. These data are then leveraged to refine WA using a novel preference optimization method. Empirical results show that our framework significantly improves WA performance. Our WA outperforms both open-source state-of-the-art WAs and the closed-source GPT-4o by 3.9 and 7.1 points on average, respectively, across eight well-established writing-related test sets.
pdf
bib
abs
Accelerating Adaptive Retrieval Augmented Generation via Instruction-Driven Representation Reduction of Retrieval Overlaps
Jie Ou
|
Jinyu Guo
|
Shuaihong Jiang
|
Zhaokun Wang
|
Libo Qin
|
Shunyu Yao
|
Wenhong Tian
Retrieval-augmented generation (RAG) has emerged as a pivotal method for expanding the knowledge of large language models. To handle complex queries more effectively, researchers developed Adaptive-RAG (A-RAG) to enhance the generated quality through multiple interactions with external knowledge bases. Despite its effectiveness, A-RAG exacerbates the pre-existing efficiency challenges inherent in RAG, which are attributable to its reliance on multiple iterations of generation. Existing A-RAG approaches process all retrieved contents from scratch. However, they ignore the situation where there is a significant overlap in the content of the retrieval results across rounds. The overlapping content is redundantly represented, which leads to a large proportion of repeated computations, thus affecting the overall efficiency. To address this issue, this paper introduces a model-agnostic approach that can be generally applied to A-RAG methods, which is dedicated to reducing the redundant representation process caused by the overlapping of retrieval results. Specifically, we use cache access and parallel generation to speed up the prefilling and decoding stages respectively. Additionally, we also propose an instruction-driven module to further guide the model to more effectively attend to each part of the content in a more suitable way for LLMs. Experiments show that our approach achieves 2.79 and 2.33 times significant acceleration on average for prefilling and decoding respectively while maintaining equal generation quality.
pdf
bib
abs
MEXA: Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment
Amir Hossein Kargaran
|
Ali Modarressi
|
Nafiseh Nikeghbal
|
Jana Diesner
|
François Yvon
|
Hinrich Schuetze
English-centric large language models (LLMs) often show strong multilingual capabilities. However, their multilingual performance remains unclear and is under-evaluated for many other languages. Most benchmarks for multilinguality focus on classic NLP tasks or cover a minimal number of languages. We introduce MEXA, a method for assessing the multilingual capabilities of pre-trained English-centric LLMs using parallel sentences, which are available for more languages than existing downstream tasks. MEXA leverages that English-centric LLMs use English as a pivot language in their intermediate layers. MEXA computes the alignment between English and non-English languages using parallel sentences to evaluate the transfer of language understanding from English to other languages. This alignment can be used to estimate model performance in different languages. We conduct controlled experiments using various parallel datasets (FLORES-200 and Bible), models (Llama family, Gemma family, Mistral, and OLMo), and established downstream tasks (Belebele, m-MMLU, and m-ARC). We explore different methods to compute embeddings in decoder-only models. Our results show that MEXA, in its default settings, achieves an average Pearson correlation of 0.90 between its predicted scores and actual task performance across languages. This suggests that MEXA is a reliable method for estimating the multilingual capabilities of English-centric LLMs, providing a clearer understanding of their multilingual potential and the inner workings of LLMs. Leaderboard: https://cis-lmu-mexa.hf.space, Code: https://github.com/cisnlp/MEXA.
pdf
bib
abs
Automated Fine-Grained Mixture-of-Experts Quantization
Zhanhao Xie
|
Yuexiao Ma
|
Xiawu Zheng
|
Fei Chao
|
Wanchen Sui
|
Yong Li
|
Shen Li
|
Rongrong Ji
The Mixture of Experts (MoE) architecture enables efficient model scaling through conditional computation, where only subset of parameters are activated per input. However, this distributed architecture poses unprecedented challenges for model compression, as conventional quantization methods optimized for dense networks prove inadequate. This paper introduces a specialized quantization framework for MoE architectures, motivated by our discovery that weight matrices across expert networks exhibit distinctive channel-wise outlier distributions, necessitating a more nuanced compression approach. Through theoretical analysis incorporating Fisher Information matrices and condition number characteristics, we establish a fundamental relationship between layer functionality and quantization sensitivity, demonstrating that down-projection layers inherently demand higher precision compared to up-projection layers. Leveraging these insights, we develop an automated channel-wise quantization framework that dynamically determines optimal bit-width allocations while maintaining minimal computational overhead through efficient statistical approximations. When evaluated on the Mixtral-8x7b-v0.1 architecture, our methodology demonstrates a 3.96% improvement over existing state-of-the-art approaches across natural language understanding benchmarks, while achieving superior compression ratios.
pdf
bib
abs
Enhancing Complex Reasoning in Knowledge Graph Question Answering through Query Graph Approximation
Hongjun Jeong
|
Minji Kim
|
Heesoo Jung
|
Ko Keun Kim
|
Hogun Park
Knowledge-grounded Question Answering (QA) aims to provide answers to structured queries or natural language questions by leveraging Knowledge Graphs (KGs). Existing approaches are mainly divided into Knowledge Graph Question Answering (KGQA) and Complex Query Answering (CQA). Both approaches have limitations: the first struggles to utilize KG context effectively when essential triplets related to the questions are missing in the given KGs, while the second depends on structured first-order logic queries. To overcome these limitations, we propose a novel framework termed Aqua-QA. Aqua-QAapproximates query graphs from natural language questions, enabling reasoning over KGs. We evaluate Aqua-QA on challenging QA tasks where KGs are incomplete in the context of QA, and complex logical reasoning is required to answer natural language questions. Experimental results on these datasets demonstrate that Aqua-QA outperforms existing methods, showcasing its effectiveness in handling complex reasoning tasks in knowledge-grounded QA settings.